Wednesday, August 26, 2015

Real-time Data Distribution with Apache Kafka



I wrote a blog post about CenturyLink Cloud's usage of Kafka for CenturyLink Cloud's website. That post, in it's full content, can be found below. Thanks. Chris

Real Time Data Distribution with Apache Kafka


Like most enterprises and service providers, here at CenturyLink Cloud we aim to please our customers by making data based decisions. As we grew our business, so does the amount of data we collect. The effort required to distill, distribute and analyze the data in meaningful ways was increasingly strenuous. In short, we needed to become faster with our analytics. If your organization is becoming overwhelmed with data, read on, as I’ll share with you how we used Apache Kafka to solve the following challenges with collecting and distributing data at scale.

·      Efficient access to data
·      Speed
·      Scalability
·      Distributed, Fault Tolerant and Highly Available


Challenge #1: Efficient access to data

Getting access to data is sometimes the hardest part. Sometimes the data that you need lives in some isolated software service and a one off data export is required. Perhaps your data integration flow is overly complex due to your adoption of a service-oriented-architecture. Maybe you had a process that required multiple teams to due work before data moved around. We saw all of this and more. We wanted to make it easier to move data around which would increase the overall velocity of our organization. Our data distribution flow between services looked something like this:

Previous Data Distribution Model


To fix this, we centralized our data flow with Apache Kafka. Kafka is commonly referred to as a pub/sub (publish and subscribe) messaging system. Simply put, systems publish their data into a Kafka ‘topic’. Other systems subscribe to the desired ‘topic’ within Kafka and extract the data. Kafka acts as a messenger, moving data to where it needs to go. Kafka, or any pub/sub messaging system, decouples the publishing of data from the subscription of data. Publishers only need to worry about sending the message. Subscribers only need to worry about receiving the message. It’s important to know that Kafka is not intended to be used a permanent data store. Rather, it provides a reliable, fast and scalable method to distribute data from data publishers to subscribers. Data storage is left to the consumers. By using Kafka we had an asynchronous, decoupled, centralized communication mechanism that looked like this:


Data Distribution Model with Kafka


We moved away from the many-to-many data distribution model to a centralized model. All of our services only need to send their data once, regardless to the number of destinations for the data. This also means that our services only need to pull from one source versus worrying about implementing a variety of different data integrations technologies. This centralization design reduced the overall effort spent by our engineers distributing data.

Note: There are alternative messaging software systems out there. RabbitMQ is similar in capabilities and may suite your needs better depending on what you need.

Challenge #2: The Need for Speed

Time is money. The quicker we can collect, detect, process, predict, and take action on our data, the faster we can act on behalf of our customers. We needed a data distribution model that enabled both near real time (streaming) analytics as well as the more traditional batch analytics. Enter the Lambda Architecture framework. The Lambda framework calls for all new data to be distributed simultaneously to both stream and batch processing pipelines, as such:


Lambda’s view of data distribution

Using the Lambda framework and Kafka for the messaging queue, we gained the ability to seamlessly add streaming analytics when needed (we use Apache Spark’s Streaming Module). In addition, it helps that Kafka itself is fast. Kafka’s data structure is essentially log files. As data comes into Kafka, Kafka appends the data to a log file. If subscribers are pulling data out in real-time they are also reading from the end of the file. This allows Linux’s page caching to store the needed data in memory for reads while using disk drives for writes. Lambda, Spark and Kafka’s low latency messaging allows us to make near real time decisions for our customers.

Note: An alternative framework to Lambda that is also worth looking into, as described in this article, suggests that users can feed data into a stream processing pipeline which in turn feeds the data into the batch pipeline.

Challenge #3: Scalability

The amount of data we collect and process is large and growing. We needed a data distribution model that will scale horizontally, use cloud infrastructure, support large workloads and auto-scale without impacting the data distribution flow. Kafka was built in order to scale and is has proven to work well under heavy load, as described in this LinkedIn post. Kafka’s ability to scale is achieved through it’s clustering technology. It relies on the concept of data partitioning in order to distribute the load across members in a cluster. If you need more data throughput in a particular location, you simply need to add more nodes to the Kafka cluster. Many cloud providers, like ours, and infrastructure automation frameworks provide the mechanisms to automate the creation and scaling of Kafka clusters.

Challenge #4: Distributed, Fault Tolerant and Highly Available

Losing data is bad. Having data stop flowing is bad. Having data only available in one datacenter is bad. We needed a solution that wasn’t bad and Kafka had us covered. Kafka supports data replication inside a cluster. This means that even if a server in a Kafka cluster crashes, data in the Kafka message bus would not be lost. Great. In addition to replication, Kafka’s uses sequential IDs to enable the graceful restarting of a data flow. If a client stops for any reason, another client can use the sequential IDs to pick up where the other client left off. We process the data in aggregate or to have high availability on the aggregate data, there is a need to replicate data to multiple data centers. Kafka provides an easy mechanism to have a Kafka clusters pull data from other Kafka clusters in other data centers. We use this to pull data from all our data centers to a single data center in order to preform analytics in aggregate. Kafka is reliable, distributed and fault tolerate. It just works.

Note: The delta in sequential IDs between the publisher and subscriber is referred to as the offset. The offset often used as a KPI metric on data flow through Kafka.


What’s wrong with Kafka?

So what’s wrong with Kafka? Not much. It solves our challenges and the price is right, free. The one drawback with Kafka that I’d like to call out is the complexity is puts on it’s clients. For example, each client needs to connect to every Kafka server in a particular cluster versus making only a single connection to the cluster. It needs to do this because the data being put into a Kafka cluster is split across members of the cluster. Thankfully there are SDKs, APIs and tools available that help engineers over come these challenges if a little time is spent researching and testing these items. In the end, we gladly accepted burden on the client in exchange for Kafka’s reliability, performance and scalability.

Summary


Kafka is the heart of our data stack. It pumps our data around so we can quickly and accurately act on behalf of our customers. Before Kafka, our data was moving around all willy-nilly. Post Kafka, we have simplified, scaled and structured dataflow that enables our business to move faster than was previously possible. Today, we are publishing operational metrics, logs, usage and performance information into Kafka and we are planning on importing additional types of data. We have a variety of software systems consuming the data from Kafka: Apache Spark for real-time and batch analytics, Cassandra for data storage, a SQL based data warehouse for reporting purposes, custom written applications and ELK (Elasticsearch, logstash and kibana) for operational visualization and search. Kafka is pumping here at Centurylink Cloud.

Thursday, August 20, 2015

Minecraft on Kubernetes

This is a post about launching a minecraft server, inside docker, on a kubernetes cluster.

Why do this?

  • I have a 5 year old son 
  • we like to play games and build stuff with blocks
  • I'm interested in kubernetes
  • I saw this post on minecraft in docker 
  • I wanted to get to know minecraft and kubernetes better

Note: After I got this working, but before I posted this, I came across Julia Ferraioli's series of blog posts about minecraft, docker and kubernetes. These posts are good and you may want to check them out too.


Docker


One of the advantages of Docker is sharing. In the public docker hub, there were a few different minecraft builds available. I choose to reuse Geoff Bourne's (@itzg) dockerfile listed here. Its a great build that lets you specificy many server configuration options via environment variables so you don't have to worry about putting together a minecraft configuration file. With my container already created for me, getting minecraft up in kubernetes was easy.

Game on Kubernetes


First, you will need kubernetes cluster. Doesn't matter where your cluster is as the bueaty of kubernetes is this will work regardless of where your cluster is. Since I work at Centurylink Cloud, I used the scripts I created here to create my kubernetes cluster on CL cloud.


Kubernetes has the concept of pods. A pod in k8 is a logical grouping of one or more containers, zero or more storage volumes and an IP address. We will launch the above minecraft docker container in a pod in kubernetes. And, since minecraft isn't a clustered server, we only need to run one pod per minecraft server.


Here is the yaml file I used to create this pod.
(Also located here: https://github.com/ckleban/minecraft-k8)


apiVersion: v1
kind: Pod
metadata:
  name: minecraft-server
  labels:
    app: minecraft-server
spec:
  containers:
  - name: minecraft-server
    image: itzg/minecraft-server
    env:
    - name: OPS
      value: madsurfer79
    - name: EULA
      value: "TRUE"
    ports:
    - containerPort: 25565
   

You will want to at least change the username in the OPS ENV variable to be your minecraft username. This let's the server know who has admin privileges on the server. You can add more settings, like game type, minecraft version, etc as you wish. To see a more detailed list of what options are available, go check out here. Now, go create the minecraft server:

-------------------------------------


#Create pod
> kubectl create -f pod.yml

#See it up and running:
> kubectl get pods
NAME                      READY     STATUS    RESTARTS   AGE
minecraft-dfhse           1/1       Running   0          2h

#If you want this server to be reachable on the internet, you probably need to do some more things. If you are on a public cloud, like Centurylink, amazon or google, this is easy to do. Simply run:

kubectl expose pod minecraft-server --port=25565 --type=LoadBalancer

You will be given a public IP address, which you can see by running:

> kubectl get services


-------------------------------------


That is it. You and your friends can now play minecraft, inside docker, on kubernetes. 

A few things to note:
  • you may need to wait a few minutes while you wait for the external IP address to be published. 
  • you may also need to open up a firewall rule to this IP address and port. 
  • this isn't a durable configuration. If the pod dies, it won't come back online and the save games and world will be gone for ever. These problems are easily solved in kubernetes. We just need to add a replication controller and persistent storage. Perhaps, I'll work on that next and post it sometime. 

Thanks
Chris