The future of nested datatypes in analytics pipelines

Re-post from Heroes of Data

Mar 10, 2023

This is a re-post of my part in a post by Heroes of Data

What's an example of how your organization uses semi-structured/nested data (for analytics purposes) today?

A majority of the analytical data in our data platform at Mathem is semi-structured/nested JSON that we ingest in streaming mode. Those messages represent events or entities from our microservices built on top of lambda and DynamoDB and often contain nested/repeated structures.

One example of such an object is the order entity. It does not only contain the order head as is custom for a RDBMS table, but also the order lines. Those can in turn contain fields related to product, discount, etc. Each change of an entity generates a new version being published to our data platform, which results in a history of immutable order records.

Since we do not flatten the order entity, we avoid complex joins of multiple tables to recreate the same business logic from the application layer in the operational source system. Since our data warehouse (BigQuery) has a native support for nested and repeated structures we can stream records into the data warehouse with a schema already in the raw layer making it immediately available for real-time analysis. In addition, analysts don't have to join multiple tables to get what they need, and the structures are also very intuitive. Another benefit of streaming semi-structured data to the data platform is that it makes streaming analytics much easier as joins in streaming data is a different beast than data at rest.

What role do you envision that semi-structured/nested data will have in the future "Modern Data Stack"?

I don't know to what degree it matters, if any, but both source systems (RDBMS) and client systems (spreadsheets) have historically been mostly tabular structures and hence it makes sense that most analytical systems have been designed for that. But I think semi-structured data will become more common due to:

Companies embrace microservice architectures with NoSQL databases as the storage layer and enable CDC on that or tap into the event bus that routes messages (JSON, Protobuf, Avro) that are exchanged by microservices
Companies start publishing analytical events that are decoupled from the storage layer
Data fetched from third party sources (SaaS) are often done over (REST/gRPC) API:s that return data in nested structures (i.e. JSON)

This is part of a movement towards (distributed) event-driven and streaming data architectures. It’s also due to the fact that the analytical system to a large degree reflects the corresponding operational system. I think that is true not only in terms of technology but also organization, skillsets and processes as we see the data domain picking up software engineering best practices one by one. Also, much more data consumption will be done continuously by machines to operationalize analytical data. It is no longer limited to humans using tabular data in BI-systems.

What hindrances do you foresee that will slow down the adoption of semi-structured/nested data in analytics pipelines?

I think the biggest challenge rarely is technology. What takes time is usually changing processes, organizations, and people. We need to educate data roles along the whole data lifecycle how and when to take advantage of semi-structured data.

In terms of technology, I would like to see more examples and best practices of how to model semi-structured data; from the start it is close to One Big Table (OBT) that is not uncommon as data marts. I also want better support for nested data in BI-tools. Malloy is one very interesting initiative but still immature for production. Tooling and services are getting better support for nested structures, but data warehouses have limited support for nested structures in general, I don't think the JSON data type is enough. The fact that BigQuery and Dataflow have great support for nested structures and streaming ingest are two major reasons we use them as the heart of our data platform.

Data Platform Engineering

Discussion about this post