Been part of couple of large data teams at one of the Magnificent 7, each serving orgs running tens of billions of dollars in revenue and owning hundreds of mission critical core workflows and data products.
What did I learned?
Trying to get a full control of deep data lineages is a lost battle. Too much entropy to hope for a clean and deterministic system.
Fixing the common root cause of recurring issues is a must, but not always realistic (at least in short term).
Instead, most effective way is to persist the tribal knowledge and make it easily accesible (searcheable) by people under duress:
* In the incident ticket, document each step done in the investigation and in what have been done in fixing and testing (can be done postmortem).
* Comment the code with common sense plain English to be understood by someone who just been pages at 3:00 am (a priority in code commit review)
* Have written standard operation procedures for each main category of incidents; revise them oftenly (core part of each sprint review)
* VERY IMPORTANT: Make sure that all this persisted tribal knowledge is searchable, ideally from one place (never tried, but I guess tuning an off the shelf decent LLM with your code and history of incidents would be very cool)
* MOST IMPORTANT ( AND THE HARDEST ONE):
Encourage all your team buddies to do due diligence in writing all the stuff above.
My experience is this takes time (6 months to 1 year). You start with yourself doing the extra work on each oncall rotation and let people see how useful it is that - at 2 am - to find that 80% of your problem was very similar with another one which happened few months ago.
Awesome Cezar and thank you for sharing your experience and advice, I think you are spot on. I’ve seen teams keeping a google sheet as a lookup for error messages and the steps to resolve them. I would like to see that directly in my incident management tool if possible rather than yet another tab and context switching
Nice read! I'm wondering, though -- apart from the tools, what aspects of this are actually data-specific, and what are simply the challenges of software incident management? If, say, the application's transactional database starts getting hosed (and you don't know why), aren't engineers forced to do basically the same process?
In general I think data incident management is a bigger challenge due to highly interdependent components, indirect and far-reaching consequences, fragmented ownership, and the complexity of ensuring data integrity across the entire pipeline. For example, the complexity of data lineage means that a single issue can impact multiple downstream systems, making it harder to trace and resolve compared to more isolated software components. Data issues often manifest as silent failures—pipelines may continue running with incorrect data, causing delayed but widespread consequences. Ownership is also more fragmented, with multiple teams interacting with the same datasets, sometimes making it difficult to determine responsibility for fixing the root cause. Additionally, data problems can affect critical business processes, such as decision-making tools or customer-facing products, requiring careful prioritization and clear communication with stakeholders. Unlike in software systems, where fixing code is often enough, data issues may also corrupt historical datasets, adding the complexity of cleaning and restoring past data to an accurate state. That are some of the things that immediately strikes me, but with that said there is much more in common than different between data and software incident management.
Good read.
Been part of couple of large data teams at one of the Magnificent 7, each serving orgs running tens of billions of dollars in revenue and owning hundreds of mission critical core workflows and data products.
What did I learned?
Trying to get a full control of deep data lineages is a lost battle. Too much entropy to hope for a clean and deterministic system.
Fixing the common root cause of recurring issues is a must, but not always realistic (at least in short term).
Instead, most effective way is to persist the tribal knowledge and make it easily accesible (searcheable) by people under duress:
* In the incident ticket, document each step done in the investigation and in what have been done in fixing and testing (can be done postmortem).
* Comment the code with common sense plain English to be understood by someone who just been pages at 3:00 am (a priority in code commit review)
* Have written standard operation procedures for each main category of incidents; revise them oftenly (core part of each sprint review)
* VERY IMPORTANT: Make sure that all this persisted tribal knowledge is searchable, ideally from one place (never tried, but I guess tuning an off the shelf decent LLM with your code and history of incidents would be very cool)
* MOST IMPORTANT ( AND THE HARDEST ONE):
Encourage all your team buddies to do due diligence in writing all the stuff above.
My experience is this takes time (6 months to 1 year). You start with yourself doing the extra work on each oncall rotation and let people see how useful it is that - at 2 am - to find that 80% of your problem was very similar with another one which happened few months ago.
Awesome Cezar and thank you for sharing your experience and advice, I think you are spot on. I’ve seen teams keeping a google sheet as a lookup for error messages and the steps to resolve them. I would like to see that directly in my incident management tool if possible rather than yet another tab and context switching
Nice read! I'm wondering, though -- apart from the tools, what aspects of this are actually data-specific, and what are simply the challenges of software incident management? If, say, the application's transactional database starts getting hosed (and you don't know why), aren't engineers forced to do basically the same process?
In general I think data incident management is a bigger challenge due to highly interdependent components, indirect and far-reaching consequences, fragmented ownership, and the complexity of ensuring data integrity across the entire pipeline. For example, the complexity of data lineage means that a single issue can impact multiple downstream systems, making it harder to trace and resolve compared to more isolated software components. Data issues often manifest as silent failures—pipelines may continue running with incorrect data, causing delayed but widespread consequences. Ownership is also more fragmented, with multiple teams interacting with the same datasets, sometimes making it difficult to determine responsibility for fixing the root cause. Additionally, data problems can affect critical business processes, such as decision-making tools or customer-facing products, requiring careful prioritization and clear communication with stakeholders. Unlike in software systems, where fixing code is often enough, data issues may also corrupt historical datasets, adding the complexity of cleaning and restoring past data to an accurate state. That are some of the things that immediately strikes me, but with that said there is much more in common than different between data and software incident management.
This is an annoyingly good read