What a disaster

I once worked on a project that was described as the biggest software disaster ever. I was a young university student, freelancing over the summer break. I took a job with a small company that was working on a small part of a much larger project. I went to some of the offsite meetings where all the different companies involved got together. I remember there being lots of people dressed very smartly, very good food laid on for all the attendees (you pay attention to free food when you’re a starving student), and a big fancy conference center with water features and modern architecture. I remember being very impressed, it all looked very professional. The actual meetings though, were very slow, and a little boring, and even though I was very young and inexeprienced, it was quite clear what was causing the problems; expanding requirements. The meetings were constantly documenting new requirements and things to follow up on, and each meeting generated more questions than answers.

The project was about healthcare messaging and the engineers were trying to figure out all the schema for the messages that could be sent between healthcare systems, but the problem was the domain was way more complicated than they’d anticipated, and the requirements soon spiralled out of control. Combine this with committee organisation (always a good way to slow down any project), bike shedding, and the communication overhead of the many teams involved, and you have a recipe for the biggest software disaster ever. Was it really the biggest ever? I’m not in a position to say, but it was a great learning experience for me.

Why data projects fail

Nice story, right? But, what has this got to do with data projects? Well, this was a data project. It sounds simple, just model the manual data workflows that people are using in software, so we don’t have to rely on paper. How hard could that be? It turns out that data projects are particularly prone to going over budget and failing for the exactly the same reasons this project was considered a huge disaster. Namely that data projects have huge amounts of hidden complexity, despite seeming simple. Once you actually drill into the data, and the workflows, the requirements quickly explode. I want to talk about how to account for this so that the project isn’t a disaster, and I want to explain why the choice of tools is a very significant factor in achieving success.

Here are the common reasons why data projects fail

Data is hard
Change is hard
Organisations are hard
Customer usability is hard

I say ETL, you say data pipeline

We’ll cover these reasons, in more detail, but first, it’s worth discussing how these problems show up and how they’re typically dealt with. These reasons for failure tend to manifest themselves in large unexpected costs, particularly in one area of data projects, ETL. A large part of most data projects is moving data from source locations, transforming it into a useable format, and placing it into a destination location, ready to be used. This is often called ETL, ELT, data pipelines, or whatever term you like to use, from here on let’s use the term data workflow. It’s often estimated that data workflow accounts for >70% of the total cost of data projects. The way most companies choose to deal with this is to buy a very expensive tool support and consultants (if you’re company would like to go that route, feel free to reach out to me, I’m happy to match or increase the highest consultant rate you’ve been quoted). I want to argue that a lot of tools in this area actually make these problems worse, not better, and that the right kind of tool can improve this situation drastically, reducing the cost and risk of data projects significantly.

A data workflow tool by any other name

Traditional data workflow tools often define a language of actions and tasks that can be linked together and run on some kind of schedule. This domain specific language (DSL) is usually represented as some kind of proprietary script, or a visual language that has a drag and drop UI, with a library of components. These kinds of tools are typically desktop applcations that first became popular as part of the relational DB and data warehousing ecosystem of the late early 2000s. They’re usually designed to streamline particular vendor ecosystems and use cases, and make linking the systems easy for DBAs. Out of the box they work really well for a specific set of scenarios, and often focus on speed of data movement as a key feature. Also, they’re often extensible, meaning you can add new functionality and connect to new types of system, but they’re not usually EASY to extend. Extending usually requires creating a library than conforms to some API that makes the custom connector compatible with the other parts of the DSL. These tools really shine when most of the jobs match a known pattern that has been anticipated by the tool vendor. Stepping outside of these pre-planned use cases is usually possible, but is much more work, and not the sweet spot of these tools.

Contrast this with the new generation of data workflow tools. These tools assume that the user is data engineer wanting to code, instead of a DBA wanting a UI. These tools start from the assumption that there is no common pattern, and that the tool should be completely generic, enabling the user to define any kind of data workflow that’s needed. The interface is an API in a general purpose programming language, instead of a restricted DSL. These kinds of tools have typically emerged as part of the Big Data movement, a prominent example is the Apache Airflow project that is strongly influenced by Facebook’s Data Swarm. These tools are programmatic, so are inherently felxible and extensible. They only assumptions programmatic workflow tools make are; there are dependencies, tasks and schedules, and ngineers need power and control to easily fill in the blanks. Being programmatic makes these tools flexible and powerful, but they also have a much steeper learning curve than alternative, and can often have a significant operational burden to keep them up and running.

So now we know about the types of tools that exist, let’s see how they cope with the problems that arise when creating data workflows:

Problem 1: Data is hard

Software data systems are difficult because they model real world phenomena that are often not easily represented as software models. The real world is messy, data is unclean, and there are LOTS of edge cases. Uncovering this complexity and modelling it in a data system is hard and takes time. There are often multiple teams dedicated just to finding out where all the data sources are, what they mean, how the business needs to use the data, and how it should be modelled.

Traditional data workflow tools work great for the certain use cases, but can have difficulty with the many edge cases and system complexity. I would argue that in most medium to large businesses the data ecosystem has become so large and complex, that there are many of these edge cases that are no longer dealt with by the traditional tools. This is due to the explosion of new types of database sources, and the Big Data movement, so now companies expect all these previously siloed data sources to be connected and available to everyone. This means that the tools must be extended, or alternative solutions have to be used when the tool doesn’t work well.

The new programmatic tools, on the other hand, are setup for exactly this scenario. General purpose programming languages can specify arbitrarily complex workflows, and are built to express such logic in a manageable and concise way. Popular languages also have vibrant ecosystems of libraries for connecting to all kinds of data sources, that can be made use of. For example, almost every popular database has multiple libraries for use in Python and Java. Using these components and libraries allows programmatic data workflow tools to connect to almost any data source as needed, usually with quite small amount of effort. This enables the programmatic tools to cope very well with unanticipated change, complexity and edge cases, and makes specifying arbitrary data workflows possible much less risky and costly.

Problem 2: Change is hard

Real world data systems change frequently, so the requirements gathering and modelling process is iterative and perpetual. If a data system is small, or simple, traditional data workflow tools have an advantage. They’re very easy and quick to use. But, if data systems are large, then frequent change can be bottlenecked when the person(s) who operate the tools can’t make changes fast enough (see the next section on organisations). While programmatic data workflow tools, aren’t as quick and easy to change, they are very amenable to change management, because they’re just code. This means that version control, code review, bug tracking, testing and other software development practices can be used on the workflow definitions. This is very valuable when these workflows become important, which they often do. Having a quick turnaround in a UI based tool is great, but having a peer reviewed code review process with checkin and integration tests can help reduce the amount of costly mistakes that occur in production data workflows. If using non-programmatic tools, the definitions of workflows can sometimes still be put into version control, but often they aren’t because they’re not operated by software engineers.

Problem 3: Organisations are hard

Real-world data is often dispersed across many teams and organisations. The more people and cross organisation communication is needed, the bigger the communication overhead required, and the increased likelihood that a change in one team is going to have ripple effects (link: law about communication being exponential ), upstream and downstream, that must be accounted for. This communication overhead can become a bottleneck if there are limited groups of people that can use, update, or understand the workflows in the data workflow tool. Traditional data workflow tools are often more likely to be controlled by a small set of knowledgeable people and can become the bottleneck. However, if the workflow definitions can be edited by anyone in the organisation, and deployed easily, then other teams can use the tool to unblock themselves, without being bottlenecked on a central team/tool. Programmatic tools usually allow easier sharing and less bottlenecks, because the definition is just stored as a text file, but some DSL based tools can also be made self-service too.

One advantage that programmatic workflow platforms have, is that the definition they work with can be modular, and these modules can be defined by different teams. For example the team that owns the data warehouse, might expose some libraries that can be used to access the warehouse using approved best practices, or the finance team might write a re-usable library for finding and using the metrics for financial reports. Having the full power of a programmatic data platform allows teams to share and re-use components, so discovering how to do things is easier, using best practices is easier and duplicate effort is reduced through code sharing. This does require engineering discipline (testing, code reviews, issue tracking, release management) to make it effective, but can really help reduce the impact of communication overhead, especially if a lot of the different teams have engineering resources (which they often do in large organisations). On the other hand, if engineering resources are not available, programmatic tools can actually increase friction, because learning a general purpose programming language is harder than learning a DSL, creating more friction to the tools being used..

Problem 4: Customer usability is hard

The way that the data workflow is specified must be understood by the person who wants to author the workflow. Often, the person who needs the workflow is a data scientist or analyst, but the person managing the workflow can be a DBA or a data engineer. Translating the needs of the workflow customer, to the workflow author can be a challenge, made more difficult if the changes are frequent. Ideally, the workflow customer and author, will be the same person, but often they’re not, so there either needs to be a request each time a change is made, or there needs to be some kind of self-serve interface for users to create and/or modify their workflows. Many traditional data workflow tools don’t have an easy way to do this, other than having the customers learn to use the tool itself. Programmatic platforms are more flexible, and can expose abstractions for authoring workflows through user interfaces such as web UIs or other abstractions. For example, the author of the workflow might create a workflow for every YAML file it finds in a particular directory, then data workflow authors just right the YAML file and don’t have to know much else about the system, or they could expose monitoring and management via a custom Web UI. Creating usable interfaces is not trivial, and requires software engineering processes to capture requirements and implement interfaces that satisfy these requirements, but with programmatic tools, and engineering resources, it’s possible to create very usable interfaces for customers of the data systems.

Recap

Each of the problems listed above are likely to be faced by large data projects. In organizations where software engineers are readily available, the flexibility and power of programmatic data workflow platforms can be used reduce the cost and risk involved. Data source and workflow complexity can be dealt with by using general purpose programming languages that can express arbitrarily complex tasks, and make use of large community libraries and frameworks. Frequent requirements change can be dealt with by software engineering change management tools and processes. Organizational communication overhead can be dealt with by using modular code design, code sharing practices, and the creation of team specific libaries. Finally, customer usability can be improved by creating abstractions over the authoring or maintenance process and exposing these via the simple interfaces to the customers (e.g. custom web-based UIs).

Programmatic data workflow tools can be of great benefit in the right circumstances, but are not suitable for all organisations. They require engineering skills, resources and discipline to use effectively. If the engineering resources are not readily available, using these tools may cause more harm than good. Likewise, if the data problems of the organisation are small, or specific to a particular vendor or ecosystem, then there might be more focused tools that are a better fit.

Summary

In summary, the complex and expanding requirements of data projects can be a big problem. The risk involved in choosing tool support can be mitigated by choosing the most flexible and powerful tools available, and these are programmatic data workflow tools. Using programmatic data workflow tools brings the full power of software engineering to tackle these problems, and has numerous advantages, but these tools are not the best choice for all projects and situations as they are difficult to use effectively. However, if you have a large data project and an organization with available software engineering resources, you could potentially avoid a software disaster by using programmatic data workflow..

Why data projects fail, and what to do about it.

Andy Maule