AWS LambdaMay 3, 2026 | 6 min read

Building a practical alerting system for Fivetran and dbt failures with AWS Lambda, DynamoDB, Teams, and Jira

A useful alerting system does not just say something broke. It knows whether the failure is new, whether it already alerted, whether it recovered, and when to create a ticket.

AWS LambdaDynamoDBMicrosoft TeamsJiraFivetrandbtAlertingData EngineeringObservabilityAutomation

TL;DR / Key Takeaways

Alerting should distinguish between a new failure, an ongoing failure, and a recovery.
Teams messages are good for fast visibility, but Jira tickets are better for durable follow-up.
DynamoDB can track alert state so the system does not spam the team every time it runs.
Fivetran and dbt failures need different metadata, but the alerting pattern can be shared.
A good alert includes what failed, when it failed, current status, severity, owner path, and next action.

Bad alerting says, "Something broke."

Useful alerting says, "This specific thing failed, this is the first time we saw it, this is who should own it, here is the next action, and we will not keep yelling unless something changes."

This is a sanitized architecture pattern for monitoring Fivetran and dbt jobs with AWS Lambda, DynamoDB, Microsoft Teams, and Jira.

The problem with naive alerting

Naive alerting usually has one move: send a message every time a scheduled check sees a failure.

That creates noise fast:

The same failure posts every run.
Nobody knows whether a ticket already exists.
Recoveries are invisible.
Ongoing failures look like new failures.
Teams becomes a scrolling wall of repeated alerts.

Alerting should reduce uncertainty, not create a second system people learn to ignore.

What the system needs to know

A useful alerting system needs memory.

It should know:

Is this failure new?
Have we already alerted?
Have we already created a ticket?
Is the failure still active?
Did the job recover?
Who should investigate?
What action should happen next?

That is why the alert state matters as much as the API polling.

Architecture overview

A practical architecture can stay small:

Scheduled AWS Lambda runs every few minutes.
Lambda polls the Fivetran API for connector or sync status.
Lambda polls dbt job or run status.
Failures are normalized into common alert objects.
DynamoDB stores current alert state.
Microsoft Teams receives a webhook message for new failures.
Jira receives a ticket for new actionable failures.
Recovery messages are sent when jobs return to healthy state.
Duplicate Teams messages and duplicate Jira tickets are suppressed.

This is not complicated infrastructure. It is careful state management.

Normalizing Fivetran and dbt failures

Fivetran and dbt failures have different shapes. The alerting system should normalize them before deciding what to do.

| Source | Useful metadata | | --- | --- | | Fivetran | Connector name, sync status, field changes, last sync time, failure message | | dbt | Job name, run ID, environment, status, failed step, run URL | | Shared alert | Severity, fingerprint, summary, first seen, last seen, next action |

The normalized object lets downstream logic handle both systems with the same alert workflow.

DynamoDB as alert memory

DynamoDB is a practical fit for alert state because the access pattern is simple:

Get alert by fingerprint.
Insert alert when new.
Update last_seen_at while ongoing.
Mark recovered when healthy.
Store Teams and Jira state.

The table does not need to be clever. It needs a stable key and predictable updates.

Teams alerts for visibility

Microsoft Teams is useful for fast visibility. The message should be short enough to scan but specific enough to act.

Good Teams alerts include:

Source system
Job or connector name
Status
Severity
First seen time
Summary
Next action
Jira ticket link if one exists

Teams is not the system of record. It is the signal.

Jira tickets for durable ownership

Jira is better for durable follow-up. A ticket gives the team a place to assign ownership, add investigation notes, and track resolution.

Create tickets for failures that are actionable. Not every transient warning deserves a ticket.

The ticket should include:

What failed
When it first failed
Current status
Severity
Source metadata
Links to relevant logs or dashboards
Suggested next action

Recovery detection

Recovery detection is where alerting starts to feel trustworthy.

If a connector or job was open and the next poll reports healthy, the system should:

Send a Teams recovery message
Comment on or close the Jira ticket if appropriate
Mark the alert recovered in DynamoDB
Avoid reopening unless a new failure appears

Recovery messages tell humans when the system is no longer in the same state.

Duplicate suppression

Duplicate suppression depends on a stable fingerprint.

Example fingerprint pattern:

source:job_name:status

For a generic Fivetran connector:

fivetran:example_connector:failed

The fingerprint should be specific enough to avoid merging unrelated failures, but stable enough to recognize the same ongoing issue.

Severity and routing

Not every failure has the same impact.

Severity can be derived from:

Source system
Job type
Time since last success
Business criticality
Number of failed runs
Whether downstream dashboards or workflows depend on it

Routing can stay simple at first: high severity creates a ticket immediately, lower severity can alert in Teams and wait for repeated failure before ticket creation.

Example alert payload

{
  "source": "fivetran",
  "job_name": "example_connector",
  "status": "failed",
  "severity": "high",
  "first_seen_at": "2026-05-03T12:00:00Z",
  "last_seen_at": "2026-05-03T12:05:00Z",
  "fingerprint": "fivetran:example_connector:failed",
  "summary": "Connector failed during sync",
  "next_action": "Review connector logs and retry history"
}

Keep alert objects generic and sanitized. Do not include sensitive values, raw payloads, or private links.

Example DynamoDB state model

{
  "alert_key": "fivetran:example_connector:failed",
  "status": "open",
  "first_seen_at": "2026-05-03T12:00:00Z",
  "last_seen_at": "2026-05-03T12:05:00Z",
  "teams_alert_sent": true,
  "jira_ticket_key": "DATA-123",
  "recovery_sent": false
}

The DynamoDB item is alert memory. It tells the next Lambda run what already happened.

Example pseudocode

def handle_alert(current_failure):
    existing = load_alert_state(current_failure.fingerprint)

    if not existing:
        send_teams_alert(current_failure)
        ticket_key = create_jira_ticket(current_failure)
        save_alert_state(current_failure, ticket_key)
        return

    update_last_seen(existing, current_failure)

Recovery should be explicit too:

def handle_recovery(alert_key):
    existing = load_alert_state(alert_key)

    if existing and existing["status"] == "open":
        send_teams_recovery(existing)
        close_or_comment_on_jira(existing["jira_ticket_key"])
        mark_alert_recovered(alert_key)

The important part is not the exact code. The important part is that new failures, ongoing failures, and recoveries are treated differently.

Operational checklist

Define which Fivetran connectors and dbt jobs are in scope.
Normalize source-specific failures into a common alert object.
Create a stable fingerprint for each failure type.
Store alert state in DynamoDB.
Send Teams messages only for new failures and recoveries.
Create Jira tickets only for actionable failures.
Update last_seen_at for ongoing failures.
Add severity and routing rules.
Avoid logging raw API responses or sensitive values.
Test recovery behavior, not just failure behavior.

FAQ

Why use DynamoDB for alert state?

DynamoDB works well when the state model is simple and key-based. Alerting needs fast reads and writes by fingerprint, not complex relational queries.

Should every failure create a Jira ticket?

No. Some failures are transient. A ticket should represent work someone may need to own. Teams can handle fast visibility, while Jira handles durable follow-up.

Why send recovery messages?

Recovery messages close the loop. They tell the team that the system returned to a healthy state and reduce the need for manual checking.

How do you avoid Teams spam?

Store alert state and only send a new Teams message when the alert is first seen or when it recovers. Ongoing failures should update state quietly.

Can the same pattern monitor other systems?

Yes. Any system that can report status can fit the same pattern: normalize the failure, fingerprint it, store state, alert once, ticket when needed, and detect recovery.

Related practical notes

CX Alloy

CX Alloy API connector lessons when throughput is painfully slow

Some APIs are not hard because authentication is impossible. They are hard because the shape, pagination, nested resources, and throughput force you to design for patience.

Read article

Workday

Workday SOAP APIs, custom connectors, and the joyless art of not getting rate-limited

SOAP APIs can still power serious enterprise integrations, but they demand patience, defensive engineering, and a connector design that respects pagination, retries, payload size, and rate limits.

Read article

Practical AI

Why AI projects need boring plumbing before they need agents

Most useful AI projects start with clean inputs, stable workflows, and reliable handoffs before anyone needs a complex agent.

Read article

Building a practical alerting system for Fivetran and dbt failures with AWS Lambda, DynamoDB, Teams, and Jira

The problem with naive alerting

What the system needs to know

Architecture overview

Normalizing Fivetran and dbt failures

DynamoDB as alert memory

Teams alerts for visibility

Jira tickets for durable ownership

Recovery detection

Duplicate suppression

Severity and routing

Example alert payload

Example DynamoDB state model

Example pseudocode

Operational checklist

FAQ

Why use DynamoDB for alert state?

Should every failure create a Jira ticket?

Why send recovery messages?

How do you avoid Teams spam?

Can the same pattern monitor other systems?

Related Reading

Related practical notes

CX Alloy API connector lessons when throughput is painfully slow

Workday SOAP APIs, custom connectors, and the joyless art of not getting rate-limited

Why AI projects need boring plumbing before they need agents