Serverless orchestration with GCP Workflows, part 1
Do we really need Composer in a modern GCP data stack?
This is a re-post from my blog (February, 2022) as I’m moving my blogging to substack.
There’s been a lot of discussion about Airflow’s role in the Modern Data Stack lately and how different tools and services are unbundling the responsibilities of Airflow. It is a very interesting discussion, but to be fair to Airflow it was built as a workflow manager but its flexibility has invited users to add additional responsibilities resulting in a anti-pattern to fill the gap of a missing control plane across data tools/services. I really encourage you to read those posts. This post isn’t about that discussion but rather if we can replace GCP Composer (managed Airflow) with a more lightweight option.
Why not GCP Composer?
Don’t get me wrong, Composer is a powerful beast, but it isn’t trivial to operate and definitely not cheap. Even as a managed Airflow service it will leave you scratching your head trying to figure out different Kubernetes related alerts and errors thrown for what seems to be a random but recurring pattern. It is also easy to slip into the anti-pattern of not only orchestrating tasks but also actually running them on Composer or start using pre-built operators that later are abandoned and may raise conflicts when upgrading Airflow version. It is also usually deployed as a centralized monolith where noisy neighbours (DAGs) potentially could have negative impact on your DAGs. But there is a GCP option nowadays that may be better suited to meet your lightweight orchestration needs - GCP Workflows.
At my employer most of our data integration is streaming and not batch. But we have some data integration jobs that are batch and we’ve been using Composer extensively last few years for data warehouse related tasks such as Transformations, Validations and Data feeds/exports. With the introduction of DBT (scheduled and orchestrated by GCP Cloud Scheduler and GCP Workflows) the Transformation and Validation tasks are now the responsibility of DBT. What is left is running Composer on a weekly/daily schedule for less than 10 simple DAGs to generate data feeds and that seems quite an overkill.
GCP Workflows
Hence I wanted to give GCP Workflows a try and see how it compares with Composer. Here are some reflections:
Separation of concerns
GCP Workflows really forces you to use it ONLY for orchestrating tasks, not for actually running them. Even the scheduling is the responsibility of another service (cloud scheduler).
Isolation of jobs
Each job in GCP Workflows runs independently of other jobs, hence you avoid the risk of noisy neighbors starving your cluster resources or conflicting package dependencies required by different jobs. The experience from running it at scale (400+ jobs / hour) for more than 6 months is that Workflows is extremely reliable.
Cost
GCP Workflows is serverless and you pay per execution. The cost of a minimal Composer cluster is approximate $500/month which corresponds to 50M executions (20 executions/s) in GCP Workflows which probably is quite rare. This also let you provide isolated orchestration capabilities in multiple projects without the significant overhead of Composer.
Security
Since GCP Workflows is lightweight, isolated and supports IAM, you can provide end-users greater access to orchestration and the flexibility of a “distributed” orchestration rather than a more centralized architecture.
IaC
I also like the fact that it is so easy to set up using IaC such as Pulumi. There is a lot of talk about no/low code in the Modern Data Stack, but I’ve experienced the pain and appreciate the APIs and defining my workflows using code, version control and CI/CD.
Re-runs and backfills
This is perhaps where composer has an edge and it is also put forward as one of the arguments for using Composer over GCP Workflows as the latter doesn’t have those features built in. However, that is quite easy to remedy yourself which I also show in the next blog post.
Hi everyone I just want to say that I am so glad to be talking to all of you. My biggest fear going into this is that I would have no friends and not know anyone but all of you have changed that. Thank you all for being so kind and amazing and I can t wait to meet all of you in person and creat life long friendships!!