Streaming JSON Map data type to BigQuery

Beam + BigQuery = true

Feb 18, 2023

80% of the data we stream into our data warehouse is produced by our micro services as events in JSON format. This is makes BigQuery a great match since it natively supports nested structures (much more useful than the JSON data types offered by alternative data warehouses). However, there is one relatively common complex data type in JSON that isn’t supported out of the box in BigQuery and that is Maps.

That is a challenge, but also why I love the fact that we are using Apache Beam running on Dataflow, it is open source and “high code”, hence we can fix it ourselves. All our messages are serialized to Beam Rows before we apply logic expressed in our data contracts (ex. PII tokenization) and write them to BigQuery (or pubsub for streaming SQL). Beam Row supports the Map data type and also how to write it to BigQuery. Hence we just had to add support for maps in the serialization step, i.e. JSON → Beam Row, which we’ve done (will see if it is accepted as a contribution to Beam).

The way to translate a Map to BigQuery is using a REPEATED Record of a key (STRING) and value (Record). Hence we can now stream json records containing Map data types to BigQuery, but also Dataflow SQL.

Below you see an example where we have a message with products being a Map containing to products with keys milk and oatmeal respectively. Instead of a schema that would have to contain a nested record for each product, i.e. milk and oatmeal in this case, we translate it to the a generic and repeated key-value structure.

We also benefit from using Beam (open source and high code) to manage a few other data types, but that will be another post.

Data Platform Engineering

Discussion about this post