Building a practical alerting system for Fivetran and dbt failures with AWS Lambda, DynamoDB, Teams, and Jira
A useful alerting system does not just say something broke. It knows whether the failure is new, whether it already alerted, whether it recovered, and when to create a ticket.
TL;DR / Key Takeaways
- Alerting should distinguish between a new failure, an ongoing failure, and a recovery.
- Teams messages are good for fast visibility, but Jira tickets are better for durable follow-up.
- DynamoDB can track alert state so the system does not spam the team every time it runs.
- Fivetran and dbt failures need different metadata, but the alerting pattern can be shared.
- A good alert includes what failed, when it failed, current status, severity, owner path, and next action.
Bad alerting says, "Something broke."
Useful alerting says, "This specific thing failed, this is the first time we saw it, this is who should own it, here is the next action, and we will not keep yelling unless something changes."
This is a sanitized architecture pattern for monitoring Fivetran and dbt jobs with AWS Lambda, DynamoDB, Microsoft Teams, and Jira.
The problem with naive alerting
Naive alerting usually has one move: send a message every time a scheduled check sees a failure.
That creates noise fast:
- The same failure posts every run.
- Nobody knows whether a ticket already exists.
- Recoveries are invisible.
- Ongoing failures look like new failures.
- Teams becomes a scrolling wall of repeated alerts.
Alerting should reduce uncertainty, not create a second system people learn to ignore.
What the system needs to know
A useful alerting system needs memory.
It should know:
- Is this failure new?
- Have we already alerted?
- Have we already created a ticket?
- Is the failure still active?
- Did the job recover?
- Who should investigate?
- What action should happen next?
That is why the alert state matters as much as the API polling.
Architecture overview
A practical architecture can stay small:
- Scheduled AWS Lambda runs every few minutes.
- Lambda polls the Fivetran API for connector or sync status.
- Lambda polls dbt job or run status.
- Failures are normalized into common alert objects.
- DynamoDB stores current alert state.
- Microsoft Teams receives a webhook message for new failures.
- Jira receives a ticket for new actionable failures.
- Recovery messages are sent when jobs return to healthy state.
- Duplicate Teams messages and duplicate Jira tickets are suppressed.
This is not complicated infrastructure. It is careful state management.
Normalizing Fivetran and dbt failures
Fivetran and dbt failures have different shapes. The alerting system should normalize them before deciding what to do.
| Source | Useful metadata | | --- | --- | | Fivetran | Connector name, sync status, field changes, last sync time, failure message | | dbt | Job name, run ID, environment, status, failed step, run URL | | Shared alert | Severity, fingerprint, summary, first seen, last seen, next action |
The normalized object lets downstream logic handle both systems with the same alert workflow.
DynamoDB as alert memory
DynamoDB is a practical fit for alert state because the access pattern is simple:
- Get alert by fingerprint.
- Insert alert when new.
- Update
last_seen_atwhile ongoing. - Mark recovered when healthy.
- Store Teams and Jira state.
The table does not need to be clever. It needs a stable key and predictable updates.
Teams alerts for visibility
Microsoft Teams is useful for fast visibility. The message should be short enough to scan but specific enough to act.
Good Teams alerts include:
- Source system
- Job or connector name
- Status
- Severity
- First seen time
- Summary
- Next action
- Jira ticket link if one exists
Teams is not the system of record. It is the signal.
Jira tickets for durable ownership
Jira is better for durable follow-up. A ticket gives the team a place to assign ownership, add investigation notes, and track resolution.
Create tickets for failures that are actionable. Not every transient warning deserves a ticket.
The ticket should include:
- What failed
- When it first failed
- Current status
- Severity
- Source metadata
- Links to relevant logs or dashboards
- Suggested next action
Recovery detection
Recovery detection is where alerting starts to feel trustworthy.
If a connector or job was open and the next poll reports healthy, the system should:
- Send a Teams recovery message
- Comment on or close the Jira ticket if appropriate
- Mark the alert recovered in DynamoDB
- Avoid reopening unless a new failure appears
Recovery messages tell humans when the system is no longer in the same state.
Duplicate suppression
Duplicate suppression depends on a stable fingerprint.
Example fingerprint pattern:
source:job_name:status
For a generic Fivetran connector:
fivetran:example_connector:failed
The fingerprint should be specific enough to avoid merging unrelated failures, but stable enough to recognize the same ongoing issue.
Severity and routing
Not every failure has the same impact.
Severity can be derived from:
- Source system
- Job type
- Time since last success
- Business criticality
- Number of failed runs
- Whether downstream dashboards or workflows depend on it
Routing can stay simple at first: high severity creates a ticket immediately, lower severity can alert in Teams and wait for repeated failure before ticket creation.
Example alert payload
{
"source": "fivetran",
"job_name": "example_connector",
"status": "failed",
"severity": "high",
"first_seen_at": "2026-05-03T12:00:00Z",
"last_seen_at": "2026-05-03T12:05:00Z",
"fingerprint": "fivetran:example_connector:failed",
"summary": "Connector failed during sync",
"next_action": "Review connector logs and retry history"
}
Keep alert objects generic and sanitized. Do not include sensitive values, raw payloads, or private links.
Example DynamoDB state model
{
"alert_key": "fivetran:example_connector:failed",
"status": "open",
"first_seen_at": "2026-05-03T12:00:00Z",
"last_seen_at": "2026-05-03T12:05:00Z",
"teams_alert_sent": true,
"jira_ticket_key": "DATA-123",
"recovery_sent": false
}
The DynamoDB item is alert memory. It tells the next Lambda run what already happened.
Example pseudocode
def handle_alert(current_failure):
existing = load_alert_state(current_failure.fingerprint)
if not existing:
send_teams_alert(current_failure)
ticket_key = create_jira_ticket(current_failure)
save_alert_state(current_failure, ticket_key)
return
update_last_seen(existing, current_failure)
Recovery should be explicit too:
def handle_recovery(alert_key):
existing = load_alert_state(alert_key)
if existing and existing["status"] == "open":
send_teams_recovery(existing)
close_or_comment_on_jira(existing["jira_ticket_key"])
mark_alert_recovered(alert_key)
The important part is not the exact code. The important part is that new failures, ongoing failures, and recoveries are treated differently.
Operational checklist
- Define which Fivetran connectors and dbt jobs are in scope.
- Normalize source-specific failures into a common alert object.
- Create a stable fingerprint for each failure type.
- Store alert state in DynamoDB.
- Send Teams messages only for new failures and recoveries.
- Create Jira tickets only for actionable failures.
- Update
last_seen_atfor ongoing failures. - Add severity and routing rules.
- Avoid logging raw API responses or sensitive values.
- Test recovery behavior, not just failure behavior.
FAQ
Why use DynamoDB for alert state?
DynamoDB works well when the state model is simple and key-based. Alerting needs fast reads and writes by fingerprint, not complex relational queries.
Should every failure create a Jira ticket?
No. Some failures are transient. A ticket should represent work someone may need to own. Teams can handle fast visibility, while Jira handles durable follow-up.
Why send recovery messages?
Recovery messages close the loop. They tell the team that the system returned to a healthy state and reduce the need for manual checking.
How do you avoid Teams spam?
Store alert state and only send a new Teams message when the alert is first seen or when it recovers. Ongoing failures should update state quietly.
Can the same pattern monitor other systems?
Yes. Any system that can report status can fit the same pattern: normalize the failure, fingerprint it, store state, alert once, ticket when needed, and detect recovery.
Related Reading
Related practical notes
CX Alloy API connector lessons when throughput is painfully slow
Some APIs are not hard because authentication is impossible. They are hard because the shape, pagination, nested resources, and throughput force you to design for patience.
Read articleWorkday SOAP APIs, custom connectors, and the joyless art of not getting rate-limited
SOAP APIs can still power serious enterprise integrations, but they demand patience, defensive engineering, and a connector design that respects pagination, retries, payload size, and rate limits.
Read articleWhy AI projects need boring plumbing before they need agents
Most useful AI projects start with clean inputs, stable workflows, and reliable handoffs before anyone needs a complex agent.
Read article