Discussion about this post

User's avatar
Cezar Mihaila's avatar

Good read.

Been part of couple of large data teams at one of the Magnificent 7, each serving orgs running tens of billions of dollars in revenue and owning hundreds of mission critical core workflows and data products.

What did I learned?

Trying to get a full control of deep data lineages is a lost battle. Too much entropy to hope for a clean and deterministic system.

Fixing the common root cause of recurring issues is a must, but not always realistic (at least in short term).

Instead, most effective way is to persist the tribal knowledge and make it easily accesible (searcheable) by people under duress:

* In the incident ticket, document each step done in the investigation and in what have been done in fixing and testing (can be done postmortem).

* Comment the code with common sense plain English to be understood by someone who just been pages at 3:00 am (a priority in code commit review)

* Have written standard operation procedures for each main category of incidents; revise them oftenly (core part of each sprint review)

* VERY IMPORTANT: Make sure that all this persisted tribal knowledge is searchable, ideally from one place (never tried, but I guess tuning an off the shelf decent LLM with your code and history of incidents would be very cool)

* MOST IMPORTANT ( AND THE HARDEST ONE):

Encourage all your team buddies to do due diligence in writing all the stuff above.

My experience is this takes time (6 months to 1 year). You start with yourself doing the extra work on each oncall rotation and let people see how useful it is that - at 2 am - to find that 80% of your problem was very similar with another one which happened few months ago.

Expand full comment
Stephen Bailey's avatar

Nice read! I'm wondering, though -- apart from the tools, what aspects of this are actually data-specific, and what are simply the challenges of software incident management? If, say, the application's transactional database starts getting hosed (and you don't know why), aren't engineers forced to do basically the same process?

Expand full comment
3 more comments...

No posts