The book, ‘The Selfish Gene’ was life-changing for me. It’s one of the most fascinating books I’ve ever read. If you haven’t read it I highly recommend it. When I first read it, as a young undergrad, it changed my views about religion, atheism, economics, and introduced me to a lot of science and mathematics that have really shaped my view of the world, yet I put off reading it for a very long time because of the title. I incorrectly assumed that I knew what it was about. I thought it would be about a gene that causes selfishness, and that didn’t sound very interesting to me. Now though, after actually reading it, I understand the nuance behind the title, and it makes sense, but beforehand, I couldn’t have really guessed what it was really about. I judged the book by it’s title and nearly didn’t read it because of that, and nearly missed out on reading one of my favorite ever books.

The name Big Data has a similar problem. Some people hear the term Big Data, and assume they know what it’s about. They get the impression that you need to have “Big”, meaning huge, amounts of data to get value from Big Data technologies. That’s the impression I had too, before becoming really involved in data engineering, but don’t worry, it’s not entirely true, and I’m going to explain why. Even if you’re not operating anywhere near Google scale, you can still get a great deal of value out of Big Data technologies, and in fact, the hype and growth of Big Data has been more to do with these other factors, and less about the “Big” part. I want to try and explain why Big Data is a misleading name, and convince you not to judge it by it’s name. Don’t miss out on learning about one of today’s most interesting technologies, by making the same mistake I did when I thought I knew what The Selfish Gene was about.

Defining Big Data

First let’s define some terms. Big data means the new generation of distributed data technologies, and, typically, this means software using the Hadoop stack, built on top of the HDFS distributed file system, and other Hadoop ecosystem projects. These are distributed systems that are built on top of distributed storage systems (such as HDFS), resource management layers, and with a pluggable computation layer. Examples of Big Data technologies are Hive, Spark, Presto, Impala, Flink, etc. The reason why we call these solutions Big Data is that they were originally developed to cope with very large scale data. The origins are in the Google file system (GFS) and map-reduce papers, which were then turned into the open-source implementation, called Hadoop. These technologies were originally built for indexing huge volumes of web data for running a search engine, so the name Big Data was initially appropriate. So what if you’re not building a search engine? Do you really need to have huge volumes of data to justify using these technologies. I’ll argue you don’t, and there are 3 main use-cases for why Big Data is not just about data being “Big”.

Exploratory BI

The first use case is exploratory BI. Traditionally with a most databases you have to define a schema, then load the data into that schema before being able to query it. This lets the database do lots of optimizations of the physical storage of the data, so it can be queried efficiently. The disadvantage of defining a schema ahead of time is that it can be a lot of effort to define the schema, then load your data into a particular schema (it’s often estimated around 70% of data projects are ETL, which is basically moving the data into a place where it can be queried ). In contrast, some of the most popular Big Data query engines, like Hive, actually do something known as schema-on-read. Because of the way hadoop stores physical data in one layer, independent of its meta-data and the query engines, it’s possibly to access the storage layer directly. This design was useful for the original map-reduce use-case of Hadoop, allowing data to be stored independently of how it is processed. With Hadoop based query engines like Hive, this design has continued, and the data can be registered in the metadata layer with pointers to a location for the physical data in the storage layer. What this means is that you can just dump data into Hadoop, without worrying too much about the schema, and then query it almost immediately, with little or no pre-processing. In other words, you can dump all your data to a file system as simple CSV or JSON files, before you know how you’re going to use it, and then easily choose how you want to make it queryable later. Being able to arbitrarily query any raw stored data like this, and easily change the schema independently of the physical data, is called exploratory BI; basically you’re enabling users to explore the data in a raw and relatively unprocessed form, before making it more curated and performant later. Yes, exploratory BI might have slower query performance than if the data had been stored in a more optimized way, but it’s also so much easier to process, load and explore, so you can explore it before you choose the best way to store it. If after some explorations, the data turns out to be useful, then the user might go through the process of refining or modeling the data, putting it in another type of database or warehouse for dispersal, changing its physical structure, and/or scheduling an ETL process to keep it up to date. But before all that, the users often just want to explore and play with the data, even if the queries are not fast. With Big Data tech, exploratory BI of unprocessed data becomes possible, and this is useful whether or not your data is “Big”.

Specialized query engines

The second use case for Big Data tech on non-big data is specialized query engines. As the Hadoop platform matured, something interesting happened; people realized that the platform had other potential uses, and they started extending it. The fact that you could switch out and plug in fundamental components, such as query engines, was hugely impactful. In other words you could think of Hadoop as a traditional database, but where you can swap out the different components. This flexibility, combined with the ease of use and ability to scale horizontally, allowed companies like Hortonworks, and Cloudera to build companies around the stack. Research groups, like Amplab have also built hugely successful academic projects around these open stacks, and have been pushing the boundaries of the technology. The result of all this, is a vibrant ecosystem that is large and varied. With this varied ecosystem, a lot of popular new compute engines have come out, built on top of hadoop e.g. Presto or Spark. If you have a specialized use case for storing/process data, e.g. time series data, graph databases, geo-databases, or one of many other use cases, there may be a Big Data solution out there for you. Given an easy pluggable foundation in HDFS, many groups have chosen to build on top of this foundation and gain the network effects of the community, rather than building their own systems from scratch. So, if you have highly specialized data and need specialized data engines, you might be able to find a Hadoop solution, and again, this can be valuable even if your data is not large.

The data platform

The third use-case is the data platform scenario. You can get a lot of network effects by having all your data online and accessible in one place (especially when combined with exploratory BI). For example, if all your data is in one shared platform you get the following advantages: Making all your archived data online and queryable is now relatively cheap and this is can be very valuable for scenarios like compliance and auditing. Modelling initiatives and joining datasets from different sources is now much easier because all your data is in the same place and ready to be processed Building custom tools on top of the hadoop foundations can be much cheaper than building from scratch, because a lot of the fundamental building blocks are already provided by the Hadoop ecosystem Specialized tools can be used by everyone who has access to the platform and query engines can be made accessible to all Operations can be centralized and cost amortized amongst teams Proven horizontal scale Relatively low operational cost compared to other data solutions.

There are, of course, many trade-offs in selecting a technical stack, and this is just as true for data stacks. If you have huge data sets Big Data technologies are a natural choice, but even if your data isn’t “Big” it’s still worth taking a look. The misleading focus on “Big” and scale shouldn’t stop you from taking a look at set of technologies that might be more applicable to your situation than the name would imply. You can get the benefit of enabling exploratory BI, use of specialized ecosystem tools, and the opportunity to build a shared data platform with numerous benefits. What would be a better name than Big Data? I’ll leave that discussion for another post.



Leave a Comment