The data pipeline is the central pillar of modern data-intensive applications. In the first post of this series, we’ll take a look at the history of the data pipeline and how these technologies have evolved over time. Later we’ll describe how we are leveraging some of these systems at Barracuda, things to consider when evaluating components of the data pipeline, and novel example applications to help you get started with building and deploying these technologies.
In 2004 Jeff Dean and Sanjay Ghemawat of Google published MapReduce: Simplified Data Processing on Large Clusters. They described MapReduce as:
“[…] a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.”
With the MapReduce model, they were able to simplify the parallelized workload of generating Google’s web index. This workload was scheduled against a cluster of nodes and offered the ability to scale to keep up with the growth of the web.
An important consideration of MapReduce is how and where data is stored across the cluster. At Google this was dubbed the Google File System (GFS). An open-source implementation of GFS from the Apache Nutch project was ultimately folded into an open-source alternative to MapReduce called Hadoop. Hadoop emerged out of Yahoo! in 2006. (Hadoop was named by Doug Cutting after a toy elephant that belonged to his son.)
Apache Hadoop: An open source implementation of MapReduce
Hadoop was met with wide popularity, and soon developers were introducing abstractions to describe jobs at a higher level. Where the inputs, mapper, combiner, and reducer functions of jobs were previously specified with much ceremony (usually in plain Java), users now had the ability to build data pipelines using common sources, sinks, and operators with Cascading. With Pig, developers specified jobs at an even higher level with an entirely new domain-specific language called Pig Latin. See word count in Hadoop, Cascading (2007), and Pig (2008) for comparison.
Apache Spark: A unified analytics engine for large-scale data processing
In 2009 Matei Zaharia began work on Spark at the UC Berkeley AMPLab. His team published Spark: Cluster Computing with Working Sets in 2010, which described a method for reusing a working set of data across multiple parallel operations, and released the first public version in March of that year. A follow-on paper from 2012 entitled Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing won Best Paper at the USENIX Symposium on Networked Systems Design and Implementation. The paper describes a novel approach called Resilient Distributed Datasets (RDDs), which enable programmers to leverage in-memory computations to achieve orders of magnitude performance increases for iterative algorithms like PageRank or machine learning over the same type of jobs when built on Hadoop.
Along with performance improvements for iterative algorithms, another major innovation that Spark introduced was the ability to perform interactive queries. Spark leveraged an interactive Scala interpreter to allow data scientists to interface with the cluster and experiment with large data sets much more rapidly than the existing approach of compiling and submitting a Hadoop job and waiting for the results.
A problem persisted, however, which is that the input into these Hadoop or Spark jobs only considers data from a bounded source (it does not consider new incoming data at runtime). The job is aimed at an input source, it determines how to decompose the job into parallelizable chunks or tasks, it executes the tasks across the cluster simultaneously, and finally, it combines the results and stores the output somewhere. This worked great for jobs like generating PageRank indexes or logistic regression, but it was the wrong tool for a vast number of other jobs that need to work against an unbounded or streaming source like click-stream analysis or fraud prevention.
Apache Kafka: A distributed streaming platform
In 2010 the engineering team at LinkedIn was undertaking the task of rearchitecting the underpinnings of the popular career social network [A Brief History of Kafka, LinkedIn’s Messaging Platform]. Like many websites, LinkedIn transitioned from a monolithic architecture to one with interconnected microservices — but adopting a new architecture based on a universal pipeline built upon a distributed commit log called Kafka enabled LinkedIn to rise to the challenge of handling event streams in near real-time and at considerable scale. Kafka was so-named by LinkedIn principal engineer Jay Kreps because it was “a system optimized for writing,” and Jay was a fan of the work of Franz Kafka.
The primary motivation for Kafka at LinkedIn was to decouple the existing microservices such that they could evolve more freely and independently. Previously, whatever schema or protocol was used to enable cross-service communication had bound the coevolution of services. The infrastructure team at LinkedIn realized that they needed more flexibility to evolve services independently. They designed Kafka to facilitate communication between services in a way that could be asynchronous and message-based. It needed to offer durability (persist messages to disk), be resilient to network and node failure, offer near real-time characteristics, and scale horizontally to handle growth. Kafka met these needs by delivering a distributed log (see The Log: What every software engineer should know about real-time data's unifying abstraction).
By 2011 Kafka was open-sourced, and many companies were adopting it en masse. Kafka innovated on previous similar message queue or pub-sub abstractions like RabbitMQ and HornetQ in a few key ways:
- Kafka topics (queues) are partitioned to scale out across a cluster of Kafka nodes (called brokers).
- Kafka leverages ZooKeeper for cluster coordination, high-availability, and failover.
- Messages are persisted to disk for very long durations.
- Messages are consumed in order.
- Consumers maintain their own state regarding offset of last consumed message.
These properties free producers from having to keep state regarding the acknowledgment of any individual message. Messages could now be streamed to the filesystem at a high rate. With consumers responsible for maintaining their own offset into the topic, they could handle updates and failures gracefully.
Apache Storm: Distributed real-time computation system
Meanwhile, in May of 2011, Nathan Marz was inking a deal with Twitter to acquire his company BackType. BackType was a business that “built analytics products to help businesses understand their impact on social media both historically and in real-time” [History of Apache Storm and Lessons Learned]. One of the crown jewels of BackType was a real-time processing system dubbed “Storm.” Storm brought an abstraction called a “topology” whereby stream operations were simplified in a similar nature to what MapReduce had done for batch processing. Storm became known as “the Hadoop of real-time” and quickly shot to the top of GitHub and Hacker News.
Apache Flink: Stateful computations over data streams
Flink also made its public debut in May 2011. Owing its roots to a research project called “Stratosphere” [http://stratosphere.eu/], which was a collaborative effort across a handful of German universities. Stratosphere was designed with the goal “to improve the efficiency of massively parallel data processing on Infrastructure as a Service (IaaS) platforms” [http://www.hpcc.unical.it/hpc2012/pdfs/kao.pdf].
Like Storm, Flink provides a programming model to describe dataflows (called “Jobs” in Flink parlance) that include a set of streams and transformations. Flink includes an execution engine to effectively parallelize the job and schedule it across a managed cluster. One unique property of Flink is that the programming model facilitates both bounded and unbounded data sources. This means that the difference in syntax between a run-once job that sources data from a SQL database (what may have traditionally been a batch job) versus a run-continuously job operating upon streaming data from a Kafka topic is minimal. Flink entered the Apache incubation project in March 2014 and was accepted as a top-level project in December 2014.
In May of 2014 Spark, 1.0.0 was released, and it included the debut of Spark SQL. Although the current version of Spark at the time only offered streaming capability by splitting a data source into “micro-batches,” the groundwork was in place for executing SQL queries as streaming applications.
Apache Beam: A unified programming model for both batch and streaming jobs
In 2015 a collective of engineers at Google released a paper entitled The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. An implementation of the Dataflow model was made commercially available on Google Cloud Platform in 2014. The core SDK of this work, as well as several IO connectors and a local runner, were donated to Apache and became the initial release of Apache Beam in June of 2016.
One of the pillars of the Dataflow model (and Apache Beam) is that the representation of the pipeline itself is abstracted away from the choice of execution engine. At the time of writing, Beam is able to compile the same pipeline code to target Flink, Spark, Samza, GearPump, Google Cloud Dataflow, and Apex. This affords the user the option to evolve the execution engine at a later time without altering the implementation of the job. A “Direct Runner” execution engine is also available for testing and development in the local environment.
In 2016 the Flink team introduced Flink SQL. Kafka SQL was announced in August 2017, and in May of 2019, a collection of engineers from Apache Beam, Apache Calcite, and Apache Flink delivered “One SQL to Rule Them All: An Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables” towards a unified streaming SQL.
Where we’re headed
The tools available to software architects designing the data pipeline continue to evolve at an increasing velocity. We’re seeing workflow engines like Airflow and Prefect integrating systems like Dask to enable users to parallelize and schedule massive machine learning workloads against the cluster. Emerging contenders like Apache Pulsar, and Pravega are competing with Kafka to take on the storage abstraction of the stream. We’re also seeing projects like Dagster, Kafka Connect, and Siddhi integrating existing components and delivering novel approaches to visualizing and designing the data pipeline. The rapid pace of development in these areas makes it a very exciting time to build data-intensive applications.
If working with these kinds of technologies is interesting to you then we encourage you to get in touch! We’re hiring across multiple engineering roles and in multiple locations.