Internet Infrastructure by Chris Kleban

Thursday, July 9, 2015

Containers and Clusters - Disrupting how cloud services are delivered

The Problem

Cloud computing and the higher level cloud services that public cloud companies offer have changed our industry. They allow developers to focus on solving customer needs versus worrying about servers, databases, asynchronous messaging, analytics, storage, media encoding, graphic rendering, content delivery and so on. These cloud services enable engineers to create applications quicker and reduce their costs. However, one of the draw backs of using the higher level cloud services is cloud lock in. The SDKs and APIs developers need to use to interact with these services are, for the most part, not standardized. If you use these services and you want to run the same application on another cloud or on premise, you need to rewrite some code. I don't think that this lock in should exclude people from using these services. However, we need a model that allows for these companies to provide innovative high level cloud services while also allowing us the freedom and flexibility of true portability.

Disruption is needed

One approach could be to get all the cloud companies to standardize their services. Ha. Anyone see pigs flying? Another approach, and the one I believe in, would be for mainstream adoption of containers and clusters to provide a path for cloud services to run anywhere. While I don't think cloud providers will jump at this notion, I do think there is an ongoing technology change that might force their hands. If they don't jump on board, startups will create similar services and take away their market share. Heres how.

Containers

Containers will be the foundation of this revolution. Docker and the rest of the container technologies offer a way for an engineer to package their software, deliver it to a server and run the package anywhere. If you want to learn more about containers and docker you should google it or read this webpage. Docker Docker Docker. It's all you here about these days. In time, containers will be main stream.

Clusters

In order to run a production service or application, you will need to build, run and manage many different containers. For redundancy, performance and scalability you will need to spread these containers across multiple servers and data centers. This is starting to sound complex. To make this easier, people have created clusters.

Kubernetes, Apache Mesos and Docker Swarm are all clustering and scheduling software frameworks that allow organizations to create a logical collection of compute power called clusters. These clusters are made up of servers or VMs and enable engineers to deploy their containers across the infrastructure. In addition, the cluster software also provides container replication, auto-scaling, load balancing, monitoring, logging, scheduling, resource management and so on. The end result is that we can create clusters on the hardware of our choice: in the cloud or on premise. We package up our code and deploy it wherever we want. True utility compute.

Operating systems for Clusters

To make things easier to manage, clustering software allows engineers to logically define groups of containers and various cluster attributes, like load balancing and security, into logically defined services. This allows engineers to manage services on their clusters, instead of a collection of containers. To make things easier for engineers, enter the concept of cluster operation systems. Two exist today: Mesosphere's data center operating system (DCOS) which offers a web-portal and CLI and some might consider kubernetes CLI tool a cluster operating system as well. These cluster operating systems make it easier for engineers to deploy and manage services on one or more clusters. They allow us to see the status of the cluster, the services, the containers and so on. All the things you need to run and maintain your service in production. Cluster operating systems will make deploying your services across multiple clusters a breeze.

The app store for clusters

It pretty obvious how this will enable us to do great things. But how will this change how cloud services are delivered? The answer will be a cluster service repository. Like an app-store for clusters. Want to run a HA database with your application? You will simply download the DB service you wish to integrate to your application. Want a web server? Go choose a web stack. Need caching? Need messaging? Simply pick your service. Then, you get to write code to utilize it, package it up and deploy it to whatever cluster you have running.

Disruption in how services are delivered

Folks in the community will create and manage services in the services repo based on open source software packages. I see traditional software companies packaging their software into cluster services and perhaps charing for licensing. I see startups jumping at the chance to be first to offer new services that run on these clusters. And the kicker, I see cloud providers packaging their existing services so that you can run a copy of their service in your cluster, perhaps for a fee. Imagine running services like AWS Kinesis, Google machine learning, Oracle DB, azure's data warehouse or whatever, anywhere you want. Awesome.

The end state unicorn

Containers, clusters, cluster operating systems, cluster services and a service repository will change the way we use data centers. It will change the way cloud services and open source software is packaged and delivered. This will have all the benefits of Platform as a Service. The control of doing things yourself. The dream of utility compute will be realized. Cloud provider lock in will be a thing of the past. Clustering and the service appstore will rock the cloud industry.

What now

There is still a lot of work to do. Some of the things I described are here today and some are just visions that are being worked on. These technologies have huge momentum and I'm personally very interested in all this. I believe in this disruption and I'll be investing in it one way or another. If you are in the cloud business you should consider this idea and decide if it's worth embracing. If you are an engineer or developer, keep an eye on these developments.

Excited

--Chris

Friday, May 15, 2015

Global Internet Access with Lightbulbs and Mesh Networking

What if the world was connected together by wireless enabled Lightbulbs and mesh networking software?

A problem worth solving

Many amazing people, groups and companies are working on how to better provide Internet and network access to the masses. Some ideas currently in development by the likes of Google, Facebook and SpaceX (to name a few) are: low orbit satellites, solar powered plains, hot air balloons and fiber to the home. All of these have merit and I applaud them for what they are trying to do and ultimately will do. But I think there is another idea to explore.

The internet isn't everywhere

One way to extend the Internet's reach is through distributed mesh networks. According to wikipedia, mesh networking is defined as ".. a network topology in which each node relays data for the network. All mesh nodes cooperate in the distribution of data in the network. Mesh networks can relay messages using either a flooding technique or a routing technique."

Let me explain how mesh networks can help. Take my home Internet connection that is provided by my cable company. Anything within wifi range of my router (100 feet?) has internet access. However, if I leave my house, I no longer have wifi access. If there was were hundreds or thousands of devices in the city that formed a mesh network, I would be able to use the mesh network to reach my home's Internet connection no matter where I went in the city. Or, if other people around the city offered to connect their Internet connections to the mesh network, I would be able to use the mesh network to reach the closet or best Internet connection based on where I was.

Lightbulbs!

Let's look at some simple facts. Lightbulb sockets are everywhere. Some lightbulbs today have wifi. Some have computers in them creating 'Smart' lightbulbs. Some are energy efficient. I've recently read about a lightbulb product that has a built in speaker and bluetooth. This allows people to use lightbulbs in their house as a house wide speaker system.

What if we built lightbulbs for the purpose of acting as nodes in mesh networks? What if we put software code in these lightbulbs that join and create mesh networks automatically so that all someone needs to do is screw it in a socket? What if people, laptops, phones and IoT devices could freely connect to this light bulb enabled mesh network to communicate with each other and the internet? What if we put these lightbulbs all over the world? What if people, organizations, and Internet service providers connected their Internet connections to these mesh networks so that the mesh networks has gateways to the Internet?

We would have a series of mesh networks throughout the world that would together bring the Internet to the masses and to the billions of IoT devices that will be coming in the near future.

Beyond Lightbulbs

Lightbulbs are just one way to create mesh networks. What if a bunch of other things do the same things: Cars, drones, consumer devices (phones, watches, laptops), home routers, artificial birds, etc. Some people are already working on these things which is amazing to see. We just need more of them and for all of the efforts to integrate, versus creating a series of isolated mesh networks.

Privacy and Security

Besides basic Internet connectivity, others have privacy and security concerns. Software and protocols exists today on mesh networks that provide encryption, protection and anonymous network access services to users. These features can easily be enabled by the mesh provider or by the end user through overlay networks.

Path forward

Communities are popping up that are organizing the creation and expansion of these mesh networks. (My local organization is https://seattlemesh.net/). It would be great to speed up this process with major investment in the hardware rollout and node creation (lightbulbs, cars etc). In the ideal world, governments, companies, communities, individuals and organizations all work together in order to roll out mesh networks that interact freely with one another.

Wednesday, March 18, 2015

8 Tips - Build a Highly Available Service

Working at AWS, Citrix, Register.com, Above.net and CenturyLink has taught me a lot about availability and scale. These are the lessons I've learned over time. From infrastructure as a service to web applications, these themes will apply.

Build for failures

Failures happen all the time. Hard drives fail, systems crash, performance decreases due to congestion, power outages, etc. Build a service that can handled failures at any level in the stack. No one server, app, database or employee should be able to cause your service to go offline. Test your failure modes and see how your system recovers. Better yet, use a chaos monkey to continuously test for failures.

Build across multiple failure domains

A failure domain consists of a set of infrastructure that can all go down due a single event. Data centers and public cloud availability zones are examples of failure domains as they can go down due to one event (fire, power, network, etc). Build your service so that it actively serves customers in multiple failure domains. Test it. A simple example is to use global load balancing to route customers to multiple failure domains.

Don't have real time dependencies across failure domains

Don't build a distributed system that relies on synchronous communication across failure domains to serve your customers. Instead, build systems that can independently service your customers completely within a failure domain and make any communication between failure domains asynchronous. Having inter-failure domain dependencies will increase the blast radius for any single outage and increases the overall likelihood of service impacting issues. Also, there are often network instabilities between failure domains that can cause variable performance and periods of slowness to your systems and your customers. One example of this is data replication. Don't require storage writes to be replicated accross failure domains before the client considers the data 'stored'. Rather, store it inside the failure domain and consider it committed. Handle any cross failure domain replication requirements asynchronously, IE, after the fact.

Reduce your blast radius

If a single change or failure can impact 100% of your customers, your blast radius is too large. Break your system up in some way so that any single issue only impacts a portion of your customers. User partitioning, using multiple failure domains (global load balancing), rolling deployments, separate control planes, SOA and A/B testing are a few ways to accomplish this. One example is using partitioning for an email sending service. Assign groups of customers to different groups of email sending servers. If any group of servers has an issue, only a portion of your customers are impacted versus all of them.

Reduce the level of impact

Having your service go completely down for a portion of your customers is much worse than only having a part of your service unavailable for a portion of your users. Break apart your system into smaller units. An example is user authentication. Consider having a scalable, read only, easily replicated system for user logins but have another system for account changes. If you need to bring down the account change system for whatever reason, your users will still be able to login to the service.

Humans make mistakes

Humans are the reason for most service impacts. Bad code deployments, network changes with unintended consciousness, copy/paste errors, unknown dependencies, typos and skill-set deficiencies are just a few examples. As the owner of a service it is critical that you apply the appropriate level of checks, balances, tools and frameworks for the people working on your system. Prevent the inevitable lessons learned of one individual from impacting the service. Peer reviews, change reviews, system health checks and tools that reduce the manual inputs required for 'making changes' can all help reduce service impacts due to human error. The important thing here is that the sum of the things you put in place to prevent human error can not make the overhead of making change so high that your velocity falls to unacceptable levels. Find the right balance.

Reduce complexity

I am not a fan of the band Kiss, but I do like to keep it simple stupid. A system that is too complex is hard to maintain. Dependency tracking becomes impossible. The human mind can only grasp so much. Don't require geniuses to make successful changes on your system.

Use the ownership model

IE, use the devops model. If you build it you should also own it end to end (uptime, operations, performance, etc). If a person feels the pain of of a broken system, that person will do the needed to stop the pain. This has the result of making uptime and system serviceability a priority versus an after thought.

Good luck

--chris