SELECT BigQuery_dataset WHERE physical_storage_cost < logical_storage_cost
And save 80% in BigQuery storage cost!
Google Cloud Platform has introduced an alternative BigQuery storage billing model based on the physical compressed storage of data rather than the logical data that is the default model. I think this is great, but also inevitable as open and managed table formats such as Apache Iceberg (inspired by the Google Dremel paper AFAIK) enables compressed file storage on cloud storage to a competitive price.
The cost is also divided into active vs long-term based on if the the table or table partition has been modified in the last 90 days. The physical storage billing is about twice the rate of the logical storage billing, but if compression is more the 2x then there is potential savings to be made.
Logical storage billing → Active: 0.02 $/GB, Long-term: 0.01 $/GB
Physical storage billing → Active: 0.044 $/GB, Long-term: 0.022 $/GB
I won’t go into details about the options as this is very well described by Xiaoxu Gao in this post.
There’s actually quite easy to extract what datasets that have the biggest potential of cost savings, you can query the information schema. If your data compress well and is partitioned and append only then there is a good chance that you will save cost by switching billing model to physical storage. For example, in our project used as bronze layer we could see as much as 80% cost savings potential!
/*
Schema details - https://cloud.google.com/bigquery/docs/information-schema-table-storage
Storage pricing details - https://cloud.google.com/bigquery/pricing#storage
*/
DECLARE active_logical_gb_price FLOAT64 DEFAULT 0.02;
DECLARE long_term_logical_gb_price FLOAT64 DEFAULT 0.01;
DECLARE active_physical_gb_price FLOAT64 DEFAULT 0.044;
DECLARE long_term_physical_gb_price FLOAT64 DEFAULT 0.022;
WITH storage_cost AS(
SELECT
table_schema,
ACTIVE_LOGICAL_BYTES / POW(1024, 3) + LONG_TERM_LOGICAL_BYTES / POW(1024, 3) as logical_storage_gb,
ACTIVE_PHYSICAL_BYTES / POW(1024, 3) + LONG_TERM_PHYSICAL_BYTES / POW(1024, 3) as physical_storage_gb,
((ACTIVE_LOGICAL_BYTES / POW(1024, 3)) * active_logical_gb_price) + ((LONG_TERM_LOGICAL_BYTES / POW(1024, 3)) * long_term_logical_gb_price) logical_storage_cost,
((ACTIVE_PHYSICAL_BYTES / POW(1024, 3)) * active_physical_gb_price) + ((LONG_TERM_PHYSICAL_BYTES / POW(1024, 3)) * long_term_physical_gb_price) physical_storage_cost
FROM
`region-eu`.INFORMATION_SCHEMA.TABLE_STORAGE
WHERE
NOT deleted
)
SELECT
storage_cost.table_schema,
SAFE_DIVIDE(SUM(logical_storage_gb),SUM(storage_cost.physical_storage_gb)) as compression_reate,
SUM(storage_cost.logical_storage_cost) as sum_logical_cost,
SUM(storage_cost.physical_storage_cost) as sum_physical_cost,
SUM(storage_cost.logical_storage_cost) - SUM(storage_cost.physical_storage_cost) as sum_cost_difference,
REPLACE("ALTER SCHEMA DATASET_NAME SET OPTIONS(storage_billing_model = 'physical');", "DATASET_NAME", storage_cost.table_schema) as command
FROM storage_cost
group by table_schema
order by sum_cost_difference desc
There’s one issue though, you may encounter an error telling you ALTER SCHEMA does not support the option: storage_billing_model.
and you can help prioritize fixing that bug by voting on this ticket in GCP’s issue tracker. Edit (2023-04-12): unfortunately it seems like the bug could be due to not being listed for the preview. Knowing that you can still prepare for working with it starting July 2023.
It’s worth noting that physical storage billing includes the bytes used for time travel storage (7 days by default) which isn’t taken into account in the query. Also, after switching to physical storage billing you can’t go back to the logical billing model.
Thanks for sharing! 🙌 Just ran it on one of our GCP projects. Potentially a few thousand dollars a year we can save 💸
Run at own risk, but it's pretty cool that you can wrap the query above in:
FOR resulting_row IN (
... the query with a filter to only include rows that save cost ...
) DO EXECUTE IMMEDIATE(resulting_row.command); END FOR;
to actually apply the command directly with the query. I find scripts like above quite powerful and potentially dangerous :)