April 2023 recommended content

Links shared with my data platform engineering team during April 2023

Apr 29, 2023

Good read on how HelloFresh makes ETL more unified and distributed through their low-code ETL tool. Some similarities to our data integration pipelines but HelloFresh do it all in batch and data lake on AWS while we do it with stream and data warehouse on GCP.

Interesting use of duckDB with DBT instead of Spark in data pipelines, it looks performant but spinning up a 0.5TB instance also comes with a cost ($0.07/vcpu hour, which is similar to BigQuery?). But using spot pricing and if DuckDB is more performant then it would be cheaper.

Cool BigQuery feature for custom CDC (upsert on streaming data from storage write api), example in Python using Protobuf

Finally DBT starts to get what the community has been talking about the last 2-3 years. However, I am not sure if it qualifies as a data contract if you ask the evangelists, but hey, it is a start and good to be able to declare a table schema to avoid running null tests etc.

DuckDB is getting a lot of attention last year, many BI vendors use it in their services. Interesting read on why Rill chose to build on DuckDB

Really interesting and valuable piece on cohort analysis and user retention

Interesting comparison of GCP config connector and Terraform based on a small PoC. However didn’t use config controller or kpt to make config connector easier and more flexible. This is yet another example of Google being fantastic at engineering but limited at product management and sales. I think packaging config controller as a serverless service to orchestrate cloud resources would be a super interesting Infrastructure as Code (or Data?) offering, the biggest limitation IMO is the coverage of GCP services in config connector resources.

Data-diff from Datafold is one of the open-source tools I’m eager to try out at my team’s next hackday.

Keep your local repos clean with gh-poi (tips from Mohamad El-baba)

This is a really nice overview and best practices of the medallion architecture (bronze, silver, gold)

Lot's of pro data vault posts and talks last year, refreshing to hear someone questioning building Data Vault in a lakehouse architecture.

Building DBT CI/CD at scale. Good stuff from Checkout, they beat us to it (blogging about it)! I will try to get my team to write one with our setup that have some similarities and definitely some really good learnings to share.

A creative and interesting way of enabling real-time pull and micro-batch cloud run execution.

Creative use of BigQuery remote functions! Sending chat messages, ex KPIs, love the thinking outside the box.

Interesting take on organizational setup for effective MLOps, covering team setup and evolution within ML-engineering and Data science plus MLOps that resonates with the pivoting towards data- and ML-platform team we do in my team. (tips from Mattias Liljenzin)

Great post by Voltron on deploying Arrow-native data storage and analytics stacks leveraging DuckDB, Arrow Flight and Superset as the interface.