Recommended data engineering content Jan-Feb 2024
Links I would have shared with my former data platform engineering team...
I realized it was a while since I last shared content recommendations, so I had to cherry-pick some favorites, some content may be regarded as “old” but still worth reading in case you’ve missed it. Enjoy!
Are you an Elite DevOps performer? Find out with the Four Keys Project: This is a 3 year old post, but as a (former) manager of a data platform engineering team the four key metrics identified really resonates with me, especially since data teams in general are relatively bad to identify and monitor relevant KPIs and metrics for themselves. I also love that they provide a script/guideline to set up the ETL pipeline to easily get started monitoring Deployment Frequency, Lead Time for Changes, Change Failure Rate and Time to Restore Service.
PyIceberg 0.6.0: Write support: I think this is one of the most interesting releases in the open source data engineering domain for quite some time (in addition to the development in DuckDB and Arrow). Apache Iceberg is perhaps the managed table format with the biggest momentum right now (followed by Delta and Hudi) and will play a central role in most multi-modal and open lake house implementations. However, writing to Iceberg tables has required a setup using powerful query and processing engines (ex spark, flink, trino, etc.). This release (AFAIK) opens up for writing Iceberg tables without the need of a JVM.
Evidence - open source BI as code: This one looks really interesting and I will definitely give it a try at a hackathon or similar. It may not replace a self-serve BI-tool for business users (if that is the path you have chosen) but could be very useful for data analysts and analytics engineers that are comfortable with sql and dbt and produce more guided and snappy reports.
AMIE: A research AI system for diagnostic medical reasoning and conversations: ChatGPT and all the LLMs that are currently hyped for generating text, images, sound and videos are fascinating, however applying LLMs to aid clinicians is just mind-blowing and the results (given the limitations in the test) very promising.
Snowplow goes source available: Snowplow is great and it has been open source for more than 10 years, but not only end users host it themselves... I understand the rationale behind this move, but thinks it is sad that yet another open source project goes source available. This will affect all users that run Snowplow in production that want to upgrade to versions released after January 8, 2024. I’m surprised I haven’t seen or heard much complaints among users, perhaps most users saw this coming anyway.