I'm committed to keeping this content free and accessible for everyone interested in data platform engineering. If you find this post valuable, the most helpful way you can support this effort is by giving it a like, leaving a comment, or sharing it with others via a restack or recommendation. It truly helps spread the word and encourage future content!
We have probably all felt this pain. Too much time is spent fixing data pipelines that break all the time. Something small changes upstream, and suddenly everything stops working. Data is late, or maybe it's even wrong. We feel like firefighters, always fixing problems, instead of building new value or helping people understand the data. The "modern data stack" was supposed to help, but sometimes it just gave us different kinds of headaches.
We've talked before about why data platforms are needed and the issues they help with. So, let's get a little practical about how we build them successfully. What are the key things we really need to focus on, based on what I've seen work, to create a data platform that people find genuinely helpful and that can grow with the company?
For me, it's much more than just giving people access to BigQuery or setting up an Airflow instance somewhere. Good data platform engineering means building systems that are strong, reliable, and make work easier for everyone involved with data. This includes the application teams who create the data in the first place, and also the analytics engineers, data analysts, data scientists who use the data to build reports, models, or their own data products.
It's a real change in how we work. We need to stop thinking like we only build pipelines. We need to think like we are building a platform. It took me some time to understand what this platform approach really means in practice. I think there are at least four important parts to get right if we want to build data platforms that people actually like using and that truly help the business.
Good data platform engineering means building systems that are strong, reliable, and make work easier for everyone involved with data.
1. Thinking like a product owner for data
Maybe the most important change is this: we must treat our data platform like it's a product. It's not just infrastructure hidden somewhere, it has users (our colleagues in different teams) and they have jobs to do. Our platform should help them do their jobs better. This means we need to be focused on their needs and think carefully about what we offer.
We must choose what to support
We cannot possibly support every data tool or every way of working. That would be chaos, and we could not make anything reliable. We need to make smart choices. We need to be opinionated, in a good way. For example, we might decide that dbt is our standard way to transform data. Or that getting data from databases should use a specific change data capture (CDC) tool that feeds into Pub/Sub, then maybe Dataflow processes it into BigQuery. We choose these standards based on what most teams need and what we can realistically support well. Offering fewer things, but making them work very well, is much better than offering too many things poorly. We need to be clear about what our platform does and does not do.
Focus on the user's experience
We must always think about the people using our platform. Are they software engineers trying to send data? Are they analysts building dashboards? What are their biggest frustrations? How can we make their work less painful, maybe even enjoyable? We need to talk to them often, listen to their problems, and build solutions that make sense for them. We shouldn't just build things because the technology is cool. Their experience matters most.
When we think like this, we tend to build two kinds of things.
1. Standard pathways
These are like the main roads for common data tasks. We want to make the normal things super easy and safe. For example, maybe we provide simple tools or templates (using Pulumi or Terraform) to create a new BigQuery table. This tool automatically sets up the right security permissions based on the team, adds standard labels for cost tracking, and maybe even source it into dbt following our company's best practices, including CI/CD setup. The idea is to cover the most frequent needs (maybe 80% of them) with a really smooth, reliable path. We guide people to use these standard ways, but we don't force them if they have a very special need. It's a suggestion, not a prison.
2. Building specific new capabilities
Sometimes, we see that many teams are struggling with the same problem, and there's no good tool available in our platform to solve it. In these cases, we might need to build something bigger and more specific. It's a focused investment. For instance, maybe we see teams struggling with data quality issues or unexpected schema changes breaking downstream processes. A great example of a focused platform investment here could be building a central data contract service. Imagine a simple service, maybe built with FastAPI and running on Cloud Run, that allows teams to define and register contracts for their data (schemas, quality expectations, ownership). This service could then be integrated into CI/CD pipelines for both data producers (checking outgoing data) and consumers (validating incoming data or alerting on contract breaches). This isn't just infrastructure; it's a specific software component built by the platform team to solve a widespread problem and enforce standards programmatically across many teams. This kind of targeted build adds significant new power across the company.
The main point is always asking:
Does this truly help our users?
Does it solve a real problem for them?
Sometimes the best platform feature is one that completely automates a task they used to do manually.
2. Building software, not just connecting pipes
Here's something I believe strongly: if your data platform team doesn't write any software code, it's probably not doing data platform engineering. Just managing cloud services isn't enough to create real leverage. We need to build software that makes using data systems better tailored to our company's specific needs.
Now, this doesn't mean we should build complex custom APIs to hide powerful tools like BigQuery completely. That would be a mistake. Users often need the full power of the underlying tools, like writing complex SQL queries. But we do need to build the software that connects things together smoothly, enforces our standards automatically, and makes complicated tasks much simpler.
What kind of software might a data platform team build?
Internal tools and APIs
Often small, focused services. Maybe a little web service (using something simple like FastAPI in Python, maybe running on Cloud Run) that checks data schema definitions submitted by teams before allowing data ingestion via Pub/Sub. Or a tool that handles requests for data access based on rules stored in our system. Or maybe code that interacts with cloud provider APIs or tools to automate workflows. These tools automate processes and contain our specific workflow logic and rules.
Helper tools for developers
A command line tool can be incredibly useful. Imagine typing data-gen start my-new-data-product and having it create the standard folders, basic dbt models and tests automatically from meta-data or a data contract. That saves a lot of time and ensures consistency. Also, shared code libraries that provide easy ways to add standard data quality checks, or to send metadata updates to our catalog, are great examples of helpful software abstractions.
Infrastructure as Code templates
We must use IaC tools like Pulumi or Terraform. It's essential. We build reusable pieces of code (modules) that define how our data infrastructure should look. For example, a Pulumi module to create a BigQuery dataset that always has our standard company labels and the correct IAM access controls based on the team requesting it. This is much safer and more reliable than manual setup. We can track changes, review them, and apply them consistently.
Working with open source
Sometimes we use open source tools that are almost perfect, but need a little adjustment for us. Maybe we write a custom plugin for dbt, or build a specific connector for Airflow to talk to an old internal system. Sometimes we might even fix bugs or add features to the open source project itself. Being able to work with, and sometimes modify, open source code is a valuable skill for the platform team.
Connecting to metadata
This is critical in the data world. Our platform tools must automatically capture and send information to our central metadata system or data catalog. For instance, after a dbt job runs successfully in CI/CD, the platform should automatically update the descriptions, tags, and ownership information for the created tables in the catalog (maybe BigQuery universal catalog, openMetadata, or another tool). People need to easily find data and understand where it came from (i.e. lineage). Making metadata capture automatic is key, because asking people to update documentation manually never works well for long.
The important question for any software we build is:
Does this really make things easier or safer for the user?
Or does it just add complexity and hide useful features of the tools they already know?
We need to be careful not to build abstractions just for the sake of it. Sometimes direct access is best.
3. Making it work for everyone
A data platform's success depends on whether it helps many different kinds of people across the company. It's not just for one team of data experts, we need to support:
Software engineers who need simple ways to get data from their apps into the data platform (e.g., publishing events to Pub/Sub).
Analytics engineers who build the core, reliable data models for the company (often using dbt on BigQuery).
Data analysts who explore data and build dashboards for business users (maybe using Looker or Looker Studio).
Data scientists who need access to data for training models (perhaps using Vertex AI).
To serve such a wide audience well, the platform needs to provide certain things.
Self service options
Users should be able to do many common tasks themselves, quickly, without needing to create a ticket and wait for the platform team. Maybe they can register a new data source by adding a configuration file to a Git repository, and the platform automatically sets up the ingestion. Or they can use an internal developer portal or CLI tool to request resources (like a BigQuery dataset and service account) for a new project. Less waiting means people can move faster.
Good visibility for users
People need to understand what's happening with their data and their processes running on the platform. Can an analyst easily see when the data they are using was last updated? Can an analytics engineer find the error logs quickly if their dbt transformation job fails? Can a team manager see how much their team's usage of BigQuery is costing? The platform must provide dashboards, logs, and metrics so users can monitor their own work and solve their own problems when possible. They need to be able to figure out if the problem is in their code or if the platform itself has an issue.
Built in safety rules
We cannot assume everyone using the platform is an expert in data governance, security, cost control, or even data modeling best practices. Mistakes can be easily made, and sometimes those mistakes can be expensive or cause compliance problems. So, the platform needs automatic safety features, like "guardrails" on a highway. For example, automatically scanning incoming data for sensitive information and masking or encrypting it before it lands in BigQuery. Alerting teams if their BigQuery spending seems too high. Perhaps enforcing that certain critical datasets must pass specific quality checks before they can be used in production reports. These automatic checks help prevent common problems.
Handling many teams efficiently
To be cost effective, data platforms usually need to share some resources (like computing power) across many different teams or projects. For example, several teams might run their dbt jobs on the same shared Cloud Composer instance, or use the same pool of BigQuery processing capacity. The platform needs to manage this sharing carefully, ensuring each team has its own secure space and that one team's work doesn't negatively impact others. This sharing makes the platform cheaper and easier to operate than running separate copies of everything for every single team.
Here we should ask ourselves:
Are we truly enabling a wide range of users to be productive and safe?
Or do our platform choices accidentally make life harder for some groups while helping others?
Can users actually help themselves when they need to, or are they always blocked waiting for us?
4. Focusing on reliable operations
This last part is absolutely essential for building trust. The data platform must be solid. If the platform is often broken, if data deliveries are frequently late or incorrect, or if the tools are buggy and unreliable, then users will quickly lose faith. They will start building workarounds, and all the potential benefits of the platform will be lost. Reliability is maybe the most important feature. Making the platform reliable requires a few things.
Taking full responsibility
The platform team needs to own the smooth running of the entire system they offer to users. Not just the pieces of code they wrote themselves. If the platform involves getting data via Pub/Sub, processing it with Dataflow, storing it in BigQuery, and orchestrating with Cloud Composer and dbt, the platform team is responsible for making sure this whole chain works reliably for the end user. It's not good enough to just provide the tools or the basic infrastructure and then tell the user teams "good luck running it in production!". That approach pushes complexity onto the users and doesn't scale well. The platform team must handle the operations side.
Providing good help to users
Even with the best platform, users will have questions or run into problems sometimes. They might hit an unusual error, need help understanding how to use a feature, find a bug, or want to request an improvement. The platform team needs to be prepared to help its users. This means having clear ways for users to ask questions (maybe a chat channel, a ticketing system, or office hours), providing good documentation and examples (maybe in a shared wiki, Confluence or Backstage), and generally having a helpful attitude. Building a culture where the team cares about user problems is important.
Being serious about running things smoothly
Operating complex, distributed data systems requires care and attention. We can't just set things up and hope they keep working forever. We need good operational practices. This means having thorough monitoring, useful alerts that are actionable, good logging so we can investigate problems, performing regular maintenance (like updating software versions for dbt, Airflow/Composer, or underlying containers before they become unsupported), having plans for how to handle incidents when things do break, and keeping an eye on costs. Because our platforms often rely on many external pieces, we need this constant attention to catch problems early and keep everything running reliably. It's not the most glamorous work, but it's essential for a platform people can trust.
And finally, we must constantly check:
Do people actually trust our platform and the data it provides?
Is it a stable base they can rely on every day, or is it a source of frustration and surprises?
Is our operational work manageable, or are we just barely keeping things running?
Wrapping up
So, building a data platform that really works well is definitely not a simple task. It requires a different mindset than just building individual pipelines. It needs a mix of skills: thinking about user needs like a product manager, writing solid software, understanding the specific challenges of data, and being very focused on making the system reliable day to day.
But if we focus on these areas, really understanding user needs and offering curated solutions, building helpful software to automate and standardize, making the platform easy and safe for many different users, and putting a lot of effort into stable operations, then we can build something truly valuable. We can move away from being constantly stressed by broken pipelines and instead provide a foundation that lets our whole company use data more effectively to make better decisions and innovate faster.
What are your thoughts on this?
Is your team also moving towards building data platforms?
What are the biggest challenges you face in making them truly helpful and reliable?
I would be interested to hear your experiences.
P.S. Just a quick note on context. Most of my experience, especially over the last ten years as a data engineer and data platform engineer, has been focused heavily on Google Cloud Platform (GCP). While I've also worked with Azure and AWS, you'll see many examples here refering to GCP services like BigQuery, Pub/Sub, or Dataflow simply because that's where my deepest hands on experience lies. However, the core ideas and logic we talked about are generally applicable no matter which cloud you use.
In my experience leading platform teams in cross-functional environments, the shift from “pipeline firefighting” to “platform product thinking” was the real inflection point. Especially liked your distinction between building standard pathways vs solving shared pain points with specific new capabilities.
One thing I’d add: platform teams often underestimate how much “last-mile” friction users face, especially analysts and less technical stakeholders. We found that even when infrastructure was solid, adoption lagged until we invested in two things: (1) a simple internal UI built on top of our metadata layer for lineage tracing and (2) lightweight onboarding playbooks for common tasks (like setting up streaming ingestion or debugging dbt tests). Neither was technically complex, but both dramatically boosted perceived usability and trust.
Also agree strongly on the need for real software development inside the platform team. We started seeing real leverage only once we treated internal tooling (like provisioning workflows and data contracts) as first-class products with version control, feedback loops, and basic SLAs.
Curious if you’ve seen any success patterns around platform usability metrics? We’ve been experimenting with proxy measures like time-to-first-insight or number of manual Slack requests dropped over time, but it still feels more art than science.