March 2023 recommended content
Links shared with my data platform engineering team during March 2023
My monthly summary of links to great and interesting data and ml-engineering content that I and my team members share with each other has been appreciated on LinkedIn and hence I thought I will re-post those also here. First a recap from March.
Using a data contracts service in a data mesh
Jordan Tigani (MotherDuck, BigQuery) questions if we really need distributed scale-out data warehouses or if a single instance would be enough for 99% of our use cases.
Data and ML playbook, really nice overview and a solid architecture.
Confluent’s recommendation and best practices for event design (tips from Johan Ekegren Gunnarsson)
Interesting technical overview / read about BigQuery's latest performance improvements
Surprisingly good paper from google! Three pillars for building a modern data strategy.
Nice example of using Dataform plus BigQuery ML
Using GCP data catalog extensively? Tag Engine to the rescue!
Is MLOps mostly Data engineering? (tips from Mattias Liljenzin)
Finally GCP released support of schema evolution in pubsub, it looks quite nice, however only protobuf and avro schemas.
Is SQLMesh how dbt should have been built if done today
Facebook built their own data modeling similar to how BigQuery works, driven by challenges in data volumes, but found many benefits along the way not doing star schemas but rather nested structures already in bronze/raw layer
I have only skimmed it but looks really good and worth reading this white paper- Building the analytics lakehouse on Google Cloud
Another really interesting piece from Meta and their evolution in data engineering
Another post on denormalized data modeling but from a different angle than meta, Entity-Centric modeling
Some really useful queries for BigQuery cost optimizations