Amazon customers often tell us that they want to know more about how we build and run our business. On the retail side, they tour Amazon Fulfillment Centers and see how we we organize our warehouses. Corporate customers often ask about our Leadership Principles, and sometimes adopt (and then adapt) them for their own use. I regularly speak with customers in our Executive Briefing Center (EBC), and talk to them about working backwards, PRFAQs, narratives, bar-raising, accepting failure as part of long-term success, and our culture of innovation.

The same curiosity that surrounds our business surrounds our development culture. We are often asked how we design, build, measure, run, and scale the hardware and software systems that underlie Amazon.com, AWS, and our other businesses.

New Builders’ Library

Today I am happy to announce The Amazon Builders’ Library. We are launching with a collection of detailed articles that will tell you exactly how we build and run our systems, each one written by the senior technical leaders who have deep expertise in that part of our business.

This library is designed to give you direct access to the theory and the practices that underlie our work. Students, developers, dev managers, architects, and CTOs will all find this content to be helpful. This is the content that is “not sold in stores” and not taught in school!

The library is organized by category:

Architecture – The design decisions that we make when designing a cloud service that help us to optimize for security, durability, high availability, and performance.

Software Delivery & Operations – The process of releasing new software to the cloud and maintaining health & high availability thereafter.

Inside the Library

I took a quick look at two of the articles while writing this post, and learned a lot!

Avoiding insurmountable queue backlogs – Principal Engineer David Yanacek explores the ins and outs of message queues, exploring the benefits and the risks, including many of the failure modes that can arise. He talks about how queues are used to power AWS Lambda and AWS IoT Core, and describes the sophisticated strategies that are used to maintain responsive and to implement (in his words) “magical resource isolation.” David shares multiple patterns that are used to create asynchronous multitenant systems that are resilient, including use of multiple queues, shuffle sharding, delay queues, back-pressure, and more.

Challenges with distributed systems – Senior Principal Engineer Jacob Gabrielson discusses the many ways that distributed systems can fail. After defining three distinct types (offline, soft real-time, and hard real-time) of systems, he uses an analogy with Bizarro to explain why hard real-time systems are (again, in his words) “frankly, a bit on the evil side.” Building on an example based on Pac-Man, he adds some request/reply communication and enumerates all of the ways that it can succeed or fail. He discussed fate sharing and how it can be used to reduce the number of test cases, and also talks about many of the other difficulties that come with testing distributed systems.

These are just two of the articles; be sure to check out the entire collection.

More to Come

We’ve got a lot more content in the pipeline, and we are also interested in your stories. Please feel free to leave feedback on this post, and we’ll be in touch.

— Jeff;