Why idempotency is non-negotiable in ETL pipelines

Microservice architectures (MSA) are inherently unreliable. Networks fail, services crash, and retries are inevitable. In this environment, idempotency isn't just a best practice; it's a survival mechanism for your ETL jobs. Here's why it matters and how to implement it.

What is Idempotency?

An operation is idempotent if running it multiple times produces the same result as running it once. In the context of ETL, this means: no matter how many times a job runs, or in what order, the final output should be consistent and correct.

Why does this matter? Let me illustrate with a real-world example:

Assume that your system is to aggregate downtime events. Each row represents each downtime event. The raw data contains events from multiple instances with error code.

An error event is detected in Time Window 1.

time window 1

Due to an unknown reason, Time Window 3 data arrives before Time Window 2, and a success event is detected. However, you cannot determine whether the downtime event started in Time Window 1 and ended in Time Window 2. The missing Time Window 2 data creates ambiguity:

time window 2

There might be a single downtime event (Error at T1 → Success at T3)

time window 3-1

There might be multiple downtime events (Error at T1 → Success at T2 → Error at T2 → Success at T3)

time window 3-2

Therefore, it is important to ensure your ETL jobs do not make any assumption about its order or completeness.

How to make an idempotent job?

Baseline: Full Scan

The most straightforward solution is to periodically scan all historical events and recompute the aggregation from scratch. This approach is inherently idempotent; since it recomputes everything, the result is always consistent regardless of how many times the job runs.

Time complexity: O(N log N), where N is the total number of events (due to sorting for LAG computation)

The downside is obvious: as your data grows, this becomes increasingly expensive.

Bounded Lookback Window

Instead of scanning everything, define a fixed window (e.g., 7 days) based on your requirements and scan all records within that window.

Time complexity: O(W log W), where W = number of records in the window

For example, if the majority of downtime events are shorter than a week, this is a good solution; simple and efficient.

One critical issue is false positive. If data is missing in the middle of the window, the job might produce an incorrect downtime duration. Edge cases should be carefully handled if your system has strict accuracy requirements.

Additional State Table

Another approach is to maintain a separate table for unclosed events. When new data arrives, join only against open events; not the entire history.

Time complexity: O(new_events + open_events)

This adds complexity to the system, but produces more robust results. This pattern is common in streaming systems like Flink and Spark Structured Streaming, often called "stateful processing."

If your data is stored in Delta Lake, this pattern introduces another challenge. Since Delta Lake MERGE only supports merging between two tables, ensuring atomicity can require multiple steps:

1. Merge new events with open events that precede the new event's time window
2. Merge the result with events that follow the new event's time window

Conclusion

Building idempotent ETL jobs requires accepting an uncomfortable truth: you cannot assume data will arrive in order, on time, or only once. Design for this reality from the start.

In practice, most teams combine approaches: bounded lookback for daily efficiency, with periodic full scans for reconciliation.> Whatever you choose, the principle remains the same: always assume your system is unreliable, and design accordingly.