Wednesday, March 18, 2015

8 Tips - Build a Highly Available Service

Working at AWS, Citrix,, and CenturyLink has taught me a lot about availability and scale. These are the lessons I've learned over time. From infrastructure as a service to web applications, these themes will apply.

Build for failures

Failures happen all the time. Hard drives fail, systems crash, performance decreases due to congestion, power outages, etc. Build a service that can handled failures at any level in the stack. No one server, app, database or employee should be able to cause your service to go offline. Test your failure modes and see how your system recovers. Better yet, use a chaos monkey to continuously test for failures.

Build across multiple failure domains

A failure domain consists of a set of infrastructure that can all go down due a single event. Data centers and public cloud availability zones are examples of failure domains as they can go down due to one event (fire, power, network, etc). Build your service so that it actively serves customers in multiple failure domains. Test it. A simple example is to use global load balancing to route customers to multiple failure domains.

Don't have real time dependencies across failure domains

Don't build a distributed system that relies on synchronous communication across failure domains to serve your customers. Instead, build systems that can independently service your customers completely within a failure domain and make any communication between failure domains asynchronous. Having inter-failure domain dependencies will increase the blast radius for any single outage and increases the overall likelihood of service impacting issues. Also, there are often network instabilities between failure domains that can cause variable performance and periods of slowness to your systems and your customers. One example of this is data replication. Don't require storage writes to be replicated accross failure domains before the client considers the data 'stored'. Rather, store it inside the failure domain and consider it committed. Handle any cross failure domain replication requirements asynchronously, IE, after the fact.

Reduce your blast radius

If a single change or failure can impact 100% of your customers, your blast radius is too large. Break your system up in some way so that any single issue only impacts a portion of your customers. User partitioning, using multiple failure domains (global load balancing), rolling deployments, separate control planes, SOA and A/B testing are a few ways to accomplish this. One example is using partitioning for an email sending service. Assign groups of customers to different groups of email sending servers. If any group of servers has an issue, only a portion of your customers are impacted versus all of them.

Reduce the level of impact

Having your service go completely down for a portion of your customers is much worse than only having a part of your service unavailable for a portion of your users. Break apart your system into smaller units. An example is user authentication. Consider having a scalable, read only, easily replicated system for user logins but have another system for account changes. If you need to bring down the account change system for whatever reason, your users will still be able to login to the service.

Humans make mistakes

Humans are the reason for most service impacts. Bad code deployments, network changes with unintended consciousness, copy/paste errors, unknown dependencies, typos and skill-set deficiencies are just a few examples. As the owner of a service it is critical that you apply the appropriate level of checks, balances, tools and frameworks for the people working on your system. Prevent the inevitable lessons learned of one individual from impacting the service. Peer reviews, change reviews, system health checks and tools that reduce the manual inputs required for 'making changes' can all help reduce service impacts due to human error. The important thing here is that the sum of the things you put in place to prevent human error can not make the overhead of making change so high that your velocity falls to unacceptable levels. Find the right balance.

Reduce complexity

I am not a fan of the band Kiss, but I do like to keep it simple stupid. A system that is too complex is hard to maintain. Dependency tracking becomes impossible. The human mind can only grasp so much. Don't require geniuses to make successful changes on your system.

Use the ownership model

IE, use the devops model. If you build it you should also own it end to end (uptime, operations, performance, etc). If a person feels the pain of of a broken system, that person will do the needed to stop the pain. This has the result of making uptime and system serviceability a priority versus an after thought.

Good luck