Recommended content Aug - Oct 2023
Links shared with/by my data platform engineering team during Aug - Oct 2023
I realized it was a while since I last shared content recs from my team, so it is quite a long list this time (but still not complete), some content may be regarded as “old” but included in case you’ve missed it. Enjoy!
Useful BigQuery feature to reduce window declarations. (Jonathan)
https://medium.com/data-engineers-notes/tidying-up-window-functions-in-bigquery-with-named-windows-1f5e76198ad6
There's now a 1.7 version for DBT (core release, bq release ), which adds support for BigQuery materialized views. (Kasper)
Great podcast episode from MLOps.community, especially if you are an IC that has moved into a lead/manager role within data. (Mattias)
BigQuery storage write api now allows for DML statements also including the streaming buffer https://cloud.google.com/bigquery/docs/write-api#use_data_manipulation_language_dml_with_recently_streamed_data
Entertaining read! I’m afraid there are many who can relate and have partly similar experiences but not at this level https://ludic.mataroa.blog/blog/i-accidentally-saved-half-a-million-dollars/
The evolution of data platforms into machine learning platforms, emphasizing the importance of integrating machine learning (ML) seamlessly into existing data infrastructure. https://towardsdatascience.com/from-data-platform-to-ml-platform-4a8192edab5d
Data contracts with rule sets and CEL and JSONata for data quality checks and major version migrations using kafka schema registry. https://www.confluent.io/blog/data-contracts-confluent-schema-registry/
Malloy is pretty cool and becoming more mature, wonder when it will become a service. This post covers a bump chart created by Malloy Data that visualizes changes in rankings over time, specifically focusing on the fluctuations of data points within a given time frame. https://malloydata.github.io/blog/2023-10-26-malloy-bump-chart/
Ruff formatter, an extremely fast Python formatter written in Rust. https://astral.sh/blog/the-ruff-formatter
The challenges faced in platform engineering - various issues and pitfalls that can arise during the development and maintenance of a platform. https://thenewstack.io/9-steps-to-platform-engineering-hell/ (Mattias)
BigQuery clustered tables lower the cost of merge with ~95% even when not matching on the clustered column. No one seems to know why :) We can verify that this is the case and reduced bytes scanned in Merge statements with 90%!
https://github.com/dbt-labs/dbt-core/issues/2196 (Johan)
Stacked diffs looks really interesting, never heard about it before (but I’m not coming from a SW engineering background)
BigQuery TABLESAMPLE!!! It has been around since 2021 in preview! How did I miss this? https://cloud.google.com/bigquery/docs/table-sampling
A really good python course (completely free) the team is currently studying together https://github.com/dabeaz-course/python-mastery
DBT dry run capabilities for BigQuery https://github.com/autotraderuk/dbt-dry-run (Jonathan)