
December 19, 2025
Microservice architectures (MSA) are inherently unreliable. Networks fail, services crash, and retries are inevitable. In this environment, idempotency isn't just a best practice; it's a survival mechanism for your ETL jobs. Here's why it matters and how to implement it.
An operation is idempotent if running it multiple times produces the same result as running it once. In the context of ETL, this means: no matter how many times a job runs, or in what order, the final output should be consistent and correct.
Why does this matter? Let me illustrate with a real-world example:
Assume that your system is to aggregate downtime events. Each row represents each downtime event. The raw data contains events from multiple instances with error code.




Therefore, it is important to ensure your ETL jobs do not make any assumption about its order or completeness.
The most straightforward solution is to periodically scan all historical events and recompute the aggregation from scratch. This approach is inherently idempotent; since it recomputes everything, the result is always consistent regardless of how many times the job runs.
LAG computation)The downside is obvious: as your data grows, this becomes increasingly expensive.
Instead of scanning everything, define a fixed window (e.g., 7 days) based on your requirements and scan all records within that window.
For example, if the majority of downtime events are shorter than a week, this is a good solution; simple and efficient.
One critical issue is false positive. If data is missing in the middle of the window, the job might produce an incorrect downtime duration. Edge cases should be carefully handled if your system has strict accuracy requirements.
Another approach is to maintain a separate table for unclosed events. When new data arrives, join only against open events; not the entire history.
This adds complexity to the system, but produces more robust results. This pattern is common in streaming systems like Flink and Spark Structured Streaming, often called "stateful processing."
If your data is stored in Delta Lake, this pattern introduces another challenge. Since Delta Lake MERGE only supports merging between two tables, ensuring atomicity can require multiple steps:
1. Merge new events with open events that precede the new event's time window
2. Merge the result with events that follow the new event's time window
Building idempotent ETL jobs requires accepting an uncomfortable truth: you cannot assume data will arrive in order, on time, or only once. Design for this reality from the start.
In practice, most teams combine approaches: bounded lookback for daily efficiency, with periodic full scans for reconciliation.> Whatever you choose, the principle remains the same: always assume your system is unreliable, and design accordingly.