I wrote a blog post about CenturyLink Cloud's usage of Kafka for CenturyLink Cloud's website. That post, in it's full content, can be found below. Thanks. Chris
Real Time Data Distribution with Apache Kafka
Like most enterprises and service providers, here at
CenturyLink Cloud we aim to please our customers by making data based decisions.
As we grew our business, so does the amount of data we collect. The effort
required to distill, distribute and analyze the data in meaningful ways was
increasingly strenuous. In short, we needed to become faster with our
analytics. If your organization is becoming overwhelmed with data, read on, as
I’ll share with you how we used Apache Kafka to solve the following challenges
with collecting and distributing data at scale.
·
Efficient access to data
·
Speed
·
Scalability
·
Distributed, Fault Tolerant and Highly Available
Challenge #1: Efficient
access to data
Getting access to data is sometimes the hardest part. Sometimes
the data that you need lives in some isolated software service and a one off
data export is required. Perhaps your data integration flow is overly complex
due to your adoption of a service-oriented-architecture. Maybe you had a process
that required multiple teams to due work before data moved around. We saw all
of this and more. We wanted to make it easier to move data around which would
increase the overall velocity of our organization. Our data distribution flow
between services looked something like this:
Previous Data
Distribution Model
To fix this, we centralized our data flow with Apache Kafka.
Kafka is commonly referred to as a pub/sub (publish and subscribe) messaging
system. Simply put, systems publish their data into a Kafka ‘topic’. Other
systems subscribe to the desired ‘topic’ within Kafka and extract the data.
Kafka acts as a messenger, moving data to where it needs to go. Kafka, or any
pub/sub messaging system, decouples the publishing of data from the
subscription of data. Publishers only need to worry about sending the message.
Subscribers only need to worry about receiving the message. It’s important to
know that Kafka is not intended to be used a permanent data store. Rather, it
provides a reliable, fast and scalable method to distribute data from data
publishers to subscribers. Data storage is left to the consumers. By using
Kafka we had an asynchronous, decoupled, centralized communication mechanism
that looked like this:
Data Distribution
Model with Kafka
We moved away from the many-to-many data distribution model
to a centralized model. All of our services only need to send their data once,
regardless to the number of destinations for the data. This also means that our
services only need to pull from one source versus worrying about implementing a
variety of different data integrations technologies. This centralization design
reduced the overall effort spent by our engineers distributing data.
Note: There are
alternative messaging software systems out there. RabbitMQ is similar in
capabilities and may suite your needs better depending on what you need.
Challenge #2: The Need for Speed
Time is money. The quicker we can collect, detect, process,
predict, and take action on our data, the faster we can act on behalf of our
customers. We needed a data distribution model that enabled both near real time
(streaming) analytics as well as the more traditional batch analytics. Enter
the Lambda Architecture framework. The Lambda
framework calls for all new data to be distributed
simultaneously to both stream and batch processing pipelines, as such:
Lambda’s view of data
distribution
Using the Lambda framework and Kafka for the messaging
queue, we gained the ability to seamlessly add streaming analytics when needed (we
use Apache Spark’s Streaming Module). In addition, it helps that Kafka itself
is fast. Kafka’s data structure is essentially log files. As data comes into
Kafka, Kafka appends the data to a log file. If subscribers are pulling data
out in real-time they are also reading from the end of the file. This allows
Linux’s page caching to store the needed data in memory for reads while using
disk drives for writes. Lambda, Spark and Kafka’s low latency messaging allows
us to make near real time decisions for our customers.
Note: An alternative framework
to Lambda that is also worth looking into, as described in this article, suggests that users can feed data into a
stream processing pipeline which in turn feeds the data into the batch
pipeline.
Challenge #3:
Scalability
The amount of data we collect and process is large and growing.
We needed a data distribution model that will scale horizontally, use cloud
infrastructure, support large workloads and auto-scale without impacting the
data distribution flow. Kafka was built in order to scale and is has proven to
work well under heavy load, as described in this LinkedIn post. Kafka’s ability to scale is achieved through it’s clustering
technology. It relies on the concept of data partitioning in order to
distribute the load across members in a cluster. If you need more data
throughput in a particular location, you simply need to add more nodes to the
Kafka cluster. Many cloud providers, like ours, and infrastructure automation
frameworks provide the mechanisms to automate the creation and scaling of Kafka
clusters.
Challenge #4:
Distributed, Fault Tolerant and Highly Available
Losing data is bad. Having data stop flowing is bad. Having
data only available in one datacenter is bad. We needed a solution that wasn’t
bad and Kafka had us covered. Kafka supports data replication inside a cluster.
This means that even if a server in a Kafka cluster crashes, data in the Kafka
message bus would not be lost. Great. In addition to replication, Kafka’s uses
sequential IDs to enable the graceful restarting of a data flow. If a client
stops for any reason, another client can use the sequential IDs to pick up where
the other client left off. We process the data in aggregate or to have high
availability on the aggregate data, there is a need to replicate data to
multiple data centers. Kafka provides an easy mechanism to have a Kafka
clusters pull data from other Kafka clusters in other data centers. We use this
to pull data from all our data centers to a single data center in order to
preform analytics in aggregate. Kafka is reliable, distributed and fault
tolerate. It just works.
Note: The delta in
sequential IDs between the publisher and subscriber is referred to as the
offset. The offset often used as a KPI metric on data flow through Kafka.
What’s wrong with Kafka?
So what’s wrong with Kafka? Not much. It solves our
challenges and the price is right, free. The one drawback with Kafka that I’d
like to call out is the complexity is puts on it’s clients. For example, each
client needs to connect to every Kafka server in a particular cluster versus
making only a single connection to the cluster. It needs to do this because the
data being put into a Kafka cluster is split across members of the cluster. Thankfully
there are SDKs, APIs and tools available that help engineers over come these
challenges if a little time is spent researching and testing these items. In
the end, we gladly accepted burden on the client in exchange for Kafka’s
reliability, performance and scalability.
Summary
Kafka is the heart of our data stack. It pumps our data
around so we can quickly and accurately act on behalf of our customers. Before
Kafka, our data was moving around all willy-nilly. Post Kafka, we have
simplified, scaled and structured dataflow that enables our business to move
faster than was previously possible. Today, we are publishing operational
metrics, logs, usage and performance information into Kafka and we are planning
on importing additional types of data. We have a variety of software systems
consuming the data from Kafka: Apache Spark for real-time and batch analytics,
Cassandra for data storage, a SQL based data warehouse for reporting purposes,
custom written applications and ELK (Elasticsearch, logstash and kibana) for
operational visualization and search. Kafka is pumping here at Centurylink
Cloud.
Great informative article on Real-time Data Distribution with Apache Kafka and the challenges we faces in it. You clear the confusion in my mind about Real-time Data Distribution
ReplyDelete