nanog mailing list archives

Re: Amazon diagnosis

From: Paul Graydon <paul () paulgraydon co uk>
Date: Sun, 01 May 2011 10:03:49 -1000

On 5/1/2011 9:29 AM, Jeff Wheeler wrote:

On Sun, May 1, 2011 at 2:18 PM, Andrew Kirch<trelane () trelane net>  wrote:

Sure they can, but as a thought exercise fully 2n redundancy is
difficult on a small scale for anything web facing.  I've seen a very
simple implementation for a website requiring 5 9's that consumed over
$50k in equipment, and this wasn't even geographically diverse.  I have

What it really boils down to is this: if application developers are
doing their jobs, a given service can be easy and inexpensive to
distribute to unrelated systems/networks without a huge infrastructure
expense.  If the developers are not, you end up spending a lot of
money on infrastructure to make up for code, databases, and APIs which
were not designed with this in mind.

These same developers who do not design and implement services with
diversity and redundancy in mind will fare little better with AWS than
any other platform.  Look at Reddit, for example.  This is an
application/service which is utterly trivial to implement in a cheap,
distributed manner, yet they have failed to do so for years, and
suffer repeated, long-duration outages as a result.  They probably buy
a lot more AWS services than would otherwise be needed, and truly have
a more complex infrastructure than such a simple service should.

IT managers would do well to understand that a few smart programmers,
who understand how all their tools (web servers, databases,
filesystems, load-balancers, etc.) actually work, can often do more to
keep infrastructure cost under control, and improve the reliability of
services, than any other investment in IT resources.

If you want a perfect example of this, consider Netflix. Theirinfrastructure runs on AWS and we didn't see any downtime with themthroughout the entire affair.One of the interesting things they've done to try and enforcereliability of services is an in house service called Chaos Monkey who'ssole purpose is to randomly kill instances and services inside theinfrastructure. Courtesy of Chaos Monkey and the defensive programmingit enforces, nothing is dependent on each other, you will always get atleast some form of a service. For example if the recommendation enginedies, then the application is smart enough to catch that and insteadreturn a list of the most popular movies, and so on. There is aninteresting blog from their Director of Engineering about what theylearned on their migration to AWS, including using less chatty APIs toreduce the impact of typical AWS latency:

http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html

Paul

Current thread:

Re: Amazon diagnosis Mike (May 01)
- Re: Amazon diagnosis Jay Ashworth (May 01)
- Re: Amazon diagnosis Andrew Kirch (May 01)
  - Re: Amazon diagnosis Jeff Wheeler (May 01)
    - Re: Amazon diagnosis Paul Graydon (May 01)
    - Re: Amazon diagnosis Jeroen van Aart (May 02)
    - Re: Amazon diagnosis Valdis . Kletnieks (May 02)
    - Re: Amazon diagnosis Jeroen van Aart (May 02)
    - Re: Amazon diagnosis George Herbert (May 02)
    - Re: Amazon diagnosis Jason Baugher (May 03)
    - Re: Amazon diagnosis Phil Pierotti (May 03)
    - Re: Amazon diagnosis Paul Graydon (May 02)
    - Re: Amazon diagnosis Ryan Malayter (May 05)
    - Re: Amazon diagnosis George Herbert (May 05)
    - Re: Amazon diagnosis Jay Ashworth (May 05)

(Thread continues...)