Wednesday, August 26, 2015

Real-time Data Distribution with Apache Kafka



I wrote a blog post about CenturyLink Cloud's usage of Kafka for CenturyLink Cloud's website. That post, in it's full content, can be found below. Thanks. Chris

Real Time Data Distribution with Apache Kafka


Like most enterprises and service providers, here at CenturyLink Cloud we aim to please our customers by making data based decisions. As we grew our business, so does the amount of data we collect. The effort required to distill, distribute and analyze the data in meaningful ways was increasingly strenuous. In short, we needed to become faster with our analytics. If your organization is becoming overwhelmed with data, read on, as I’ll share with you how we used Apache Kafka to solve the following challenges with collecting and distributing data at scale.

·      Efficient access to data
·      Speed
·      Scalability
·      Distributed, Fault Tolerant and Highly Available


Challenge #1: Efficient access to data

Getting access to data is sometimes the hardest part. Sometimes the data that you need lives in some isolated software service and a one off data export is required. Perhaps your data integration flow is overly complex due to your adoption of a service-oriented-architecture. Maybe you had a process that required multiple teams to due work before data moved around. We saw all of this and more. We wanted to make it easier to move data around which would increase the overall velocity of our organization. Our data distribution flow between services looked something like this:

Previous Data Distribution Model


To fix this, we centralized our data flow with Apache Kafka. Kafka is commonly referred to as a pub/sub (publish and subscribe) messaging system. Simply put, systems publish their data into a Kafka ‘topic’. Other systems subscribe to the desired ‘topic’ within Kafka and extract the data. Kafka acts as a messenger, moving data to where it needs to go. Kafka, or any pub/sub messaging system, decouples the publishing of data from the subscription of data. Publishers only need to worry about sending the message. Subscribers only need to worry about receiving the message. It’s important to know that Kafka is not intended to be used a permanent data store. Rather, it provides a reliable, fast and scalable method to distribute data from data publishers to subscribers. Data storage is left to the consumers. By using Kafka we had an asynchronous, decoupled, centralized communication mechanism that looked like this:


Data Distribution Model with Kafka


We moved away from the many-to-many data distribution model to a centralized model. All of our services only need to send their data once, regardless to the number of destinations for the data. This also means that our services only need to pull from one source versus worrying about implementing a variety of different data integrations technologies. This centralization design reduced the overall effort spent by our engineers distributing data.

Note: There are alternative messaging software systems out there. RabbitMQ is similar in capabilities and may suite your needs better depending on what you need.

Challenge #2: The Need for Speed

Time is money. The quicker we can collect, detect, process, predict, and take action on our data, the faster we can act on behalf of our customers. We needed a data distribution model that enabled both near real time (streaming) analytics as well as the more traditional batch analytics. Enter the Lambda Architecture framework. The Lambda framework calls for all new data to be distributed simultaneously to both stream and batch processing pipelines, as such:


Lambda’s view of data distribution

Using the Lambda framework and Kafka for the messaging queue, we gained the ability to seamlessly add streaming analytics when needed (we use Apache Spark’s Streaming Module). In addition, it helps that Kafka itself is fast. Kafka’s data structure is essentially log files. As data comes into Kafka, Kafka appends the data to a log file. If subscribers are pulling data out in real-time they are also reading from the end of the file. This allows Linux’s page caching to store the needed data in memory for reads while using disk drives for writes. Lambda, Spark and Kafka’s low latency messaging allows us to make near real time decisions for our customers.

Note: An alternative framework to Lambda that is also worth looking into, as described in this article, suggests that users can feed data into a stream processing pipeline which in turn feeds the data into the batch pipeline.

Challenge #3: Scalability

The amount of data we collect and process is large and growing. We needed a data distribution model that will scale horizontally, use cloud infrastructure, support large workloads and auto-scale without impacting the data distribution flow. Kafka was built in order to scale and is has proven to work well under heavy load, as described in this LinkedIn post. Kafka’s ability to scale is achieved through it’s clustering technology. It relies on the concept of data partitioning in order to distribute the load across members in a cluster. If you need more data throughput in a particular location, you simply need to add more nodes to the Kafka cluster. Many cloud providers, like ours, and infrastructure automation frameworks provide the mechanisms to automate the creation and scaling of Kafka clusters.

Challenge #4: Distributed, Fault Tolerant and Highly Available

Losing data is bad. Having data stop flowing is bad. Having data only available in one datacenter is bad. We needed a solution that wasn’t bad and Kafka had us covered. Kafka supports data replication inside a cluster. This means that even if a server in a Kafka cluster crashes, data in the Kafka message bus would not be lost. Great. In addition to replication, Kafka’s uses sequential IDs to enable the graceful restarting of a data flow. If a client stops for any reason, another client can use the sequential IDs to pick up where the other client left off. We process the data in aggregate or to have high availability on the aggregate data, there is a need to replicate data to multiple data centers. Kafka provides an easy mechanism to have a Kafka clusters pull data from other Kafka clusters in other data centers. We use this to pull data from all our data centers to a single data center in order to preform analytics in aggregate. Kafka is reliable, distributed and fault tolerate. It just works.

Note: The delta in sequential IDs between the publisher and subscriber is referred to as the offset. The offset often used as a KPI metric on data flow through Kafka.


What’s wrong with Kafka?

So what’s wrong with Kafka? Not much. It solves our challenges and the price is right, free. The one drawback with Kafka that I’d like to call out is the complexity is puts on it’s clients. For example, each client needs to connect to every Kafka server in a particular cluster versus making only a single connection to the cluster. It needs to do this because the data being put into a Kafka cluster is split across members of the cluster. Thankfully there are SDKs, APIs and tools available that help engineers over come these challenges if a little time is spent researching and testing these items. In the end, we gladly accepted burden on the client in exchange for Kafka’s reliability, performance and scalability.

Summary


Kafka is the heart of our data stack. It pumps our data around so we can quickly and accurately act on behalf of our customers. Before Kafka, our data was moving around all willy-nilly. Post Kafka, we have simplified, scaled and structured dataflow that enables our business to move faster than was previously possible. Today, we are publishing operational metrics, logs, usage and performance information into Kafka and we are planning on importing additional types of data. We have a variety of software systems consuming the data from Kafka: Apache Spark for real-time and batch analytics, Cassandra for data storage, a SQL based data warehouse for reporting purposes, custom written applications and ELK (Elasticsearch, logstash and kibana) for operational visualization and search. Kafka is pumping here at Centurylink Cloud.

1 comment:

  1. Great informative article on Real-time Data Distribution with Apache Kafka and the challenges we faces in it. You clear the confusion in my mind about Real-time Data Distribution

    ReplyDelete