DataHem odyssey - the evolution of a data…

Jan 19, 2024

Unified platform and data contracts

3 Comments

Feb 4, 2024

Nice writeup! The YAML -> Pulumi -> BigQuery is the same as the example I created for my book, so it's interesting to read about it in production (and again, interesting that similar ideas are created independently around the same time).

You mentioned that you used the medallion architecture, and the data platform team owned the Bronze and Silver layer. Were the data contracts you describe used to present the data to the analytics engineers working in the Gold layer?

Were data contracts also used as the data was ingested from the source systems (events, DB queries, files, CDC)? And if so, how did you encourage adoption there?

Expand full comment

Reply (1)

Robert Sahlin

Feb 6, 2024

Thanks Andrew. The contracts had a section that defined what kind of representations (log, replica, realtime view, etc.) to be created in the silver layer and at what cadence to merge new data (tables were incremental). The dbt model was generated automatically based on the contract. But the transformations/modelling in the gold layer was not defined in the contracts or automated as they required more manual modeling and naming etc.

The contracts started of as a means to automate the onboarding of new data streams (i.e. data ingestion) while ensuring data quality. From there we discovered more use cases to support also transformations etc. Adoption was encouraged by:

- decision to use data contracts

- data producers were interested in analytics of their own data (data scientists embedded in teams)

- Remove friction as much as we could, switching to YAML was one thing, creating a CLI to generate contracts from data objects in code was another, support, it also improved data quality in the operational system so it was very valuable also in that regards. CI/CD that gave immediate feedback on contracts and the data it was supposed to validate.

We had a number of planned activities to reduce friction even more, we took on a platform engineering perspective to support stakeholders. I think it is equally important to provide the oil to the machinery as the machinery itself to get the adoption.

Expand full comment

Reply (1)

Andrew Jones

Feb 12, 2024

Interesting, thanks for the followup!

Again, very similar to what we found.

The teams who were most keen to adopt were those that work more closely with data and/or had data scientists embedded.

And we had to keep the friction relatively low and/or build tooling that data producers wanted to use (I wrote a bit about that here, and this was pretty much quoting one of our principal software engineers: https://medium.com/gocardless-tech/3-things-our-software-engineers-love-about-data-contracts-3106e1f1602d).

Expand full comment

Data Platform Engineering

DataHem odyssey - the evolution of a data…