Quantcast
Viewing all articles
Browse latest Browse all 37

Data Pipeline Evolution At LinkedIn On A Few Pictures

The LinkedIn Engineering blog is a great resource of technical blog posts related to building large-scale data pipelines using Kafka and its “ecosystem” of tools. In this post I show several pictures of how data pipeline has evolved at LinkedIn over the years.

Problem Definition

“We had dozens of data systems and data repositories. Connecting all of these would have lead to building custom piping between each pair of systems something like this:”

Image may be NSFW.
Clik here to view.
datapipeline_complex

Source: The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps.

Idealistic Vision

“Instead, we needed something generic like this:”

Image may be NSFW.
Clik here to view.
datapipeline_simple

Source: “The Log: What every software engineer should know about real-time data’s unifying abstraction” by Jay Kreps.

Kafka Does All The Magic!

Kafka became a universal pipeline (…) It enabled near realtime access to any data source, empowered our Hadoop jobs, allowed us to build realtime analytics, vastly improved our site monitoring and alerting capability, and enabled us to visualize and track our call graphs.

Image may be NSFW.
Clik here to view.
kafka_broker

Source: “A Brief History of Scaling LinkedIn” by Josh Clemm.

Loading To Hadoop Looks So Simple …

As simple as the picture below:

Image may be NSFW.
Clik here to view.
kafka-to-hadoop-simple

Deployment Reality (loading to Hadoop only)

“The figure shows the complexity of the data pipelines. Some of the solutions like our Kafka-etl (Camus), Oracle-etl (Lumos) and Databus-etl pipelines were more generic and could carry different kinds of datasets, others like our Salesforce pipeline were very specific. At one point, we were running more than 15 types of data ingestion pipelines and we were struggling to keep them all functioning at the same level of data quality, features and operability.”

Image may be NSFW.
Clik here to view.
gobblin_complex2

Source: Gobblin’ Big Data With Ease by Shirshanka Das and Lin Qiao.

Deployment Reality (loading to and from Hadoop)

The quote by Jay Kreps several years ago:

Note that data often flows in both directions, as many systems (databases, Hadoop) are both sources and destinations for data transfer. This meant we would end up building two pipelines per system: one to get data in and one to get data out.

It seems that it didn’t save LinkedIn from building a complex system like below:

Image may be NSFW.
Clik here to view.

Source: Gobblin’ Big Data With Ease @ QConSF 2014 by Lin Qiao.

Deployment Reconsidered

“Late last year (2013), we took stock of the situation and tried to categorize the diversity of our integrations a little better. (…) We also realized there were some common patterns and requirements. (…) We’ve brought these demands together to form the basis for our uber-ingestion framework Gobblin. As the figure below shows, Gobblin is targeted at “gobbling in” all of LinkedIn’s internal and external datasets through a single framework.”

Image may be NSFW.
Clik here to view.

Source: Gobblin’ Big Data With Ease by Shirshanka Das and Lin Qiao.

Reconsidered Idea

Our motivations for building Gobblin stemmed from our operational challenges in building and maintaining disparate pipelines for different data sources across batch and streaming ecosystems. (…) Our first target sink was Hadoop’s ubiquitous HDFS storage system and that has been our focus for most of last year. (…) At LinkedIn, Gobblin is currently integrated with more than a dozen data sources including Salesforce, Google Analytics, Amazon S3, Oracle, LinkedIn Espresso, MySQL, SQL Server, SFTP, Apache Kafka, patent and publication sources, CommonCrawl, etc.

Image may be NSFW.
Clik here to view.
gobblin-ingest-ecosystem

Source: Bridging Batch and Streaming Data Ingestion with Gobblin by Shirshanka Das.

Sooner or later, Gobblin will be also integrated with sinks different than Hadoop such as real-time stream processing frameworks e.g. Samza, Storm, Flink Streaming, Spark Streaming.

The ideal deployment scenario is where we can deploy Gobblin in continuous ingestion mode. (…) This will bring further latency reductions in our ingestion from streaming sources, enable resource utilization efficiencies and allow us to integrate with streaming sinks seamlessly.

Source: Bridging Batch and Streaming Data Ingestion with Gobblin by Shirshanka Das.

Coming Full Circle?

Even though Gobblin is the probably most recent data-ingestion innovation at LinkedIn, there is one more brand-new project that might make a difference soon.

Kafka Connect is a tool for copying data between Kafka and a variety of other systems, ranging from relational databases, to logs and metrics, to Hadoop and data warehouses, to NoSQL data stores, to search indexes, and more.

Image may be NSFW.
Clik here to view.
kafka_connect_source_sink_flow_diagram_on_white

Source: Confluent Platform 2.0 is GA! by Neha Narkhede.

Summary

Although it’s great to see new open-source world-class tools that simplify Big Data ingestion, the vision often differs from the reality.

The picture is not always worth a thousand words, but sometimes the picture should be explained with a thousand words Image may be NSFW.
Clik here to view.
:)


Viewing all articles
Browse latest Browse all 37

Trending Articles