Etsy DevOps Case Study: The Secret to 50 Plus Deploys a Day
With almost 4.4 million sellers and 81.9 million buyers, Etsy enables eCommerce on an enormous scale for various arts and crafts communities internationally. Etsy had adopted DevOps best practices quite early, in 2009, and has been living and breathing the methodology ever since. So naturally, Etsy is often admired as a poster child of DevOps for showing the world what performance outcomes are possible by renouncing the traditional software development approaches.
Etsy went public in 2015 with a $100 million IPO and has generated a net income of $349.25 million in 2020. Companies these days operate in an ultra-competitive landscape and focus hard on attaining software quality and speed-to-market. Though there’s no one-size-fits-all, with the Etsy DevOps case study, any organization can take away a couple of critical business lessons for its DevOps implementation strategy.
In this article, you’ll learn why Etsy ventured into a DevOps journey, how it achieved high deployment rates, and essentially a culture that all companies admire today.
When and why Etsy became interested in DevOps?
Etsy started with a single web server and database in 2005.
Back then, Etsy’s engineering team was siloed into developers, database administrators, and operation teams. They were small in number, with not more than 35 employees in their support and development teams, not to mention the discrepancies within team collaborations. And so, like many other startups, these barriers were affecting its ability to obtain cutting-edge results out of its software development efforts.
In 2008, a team of Etsy engineers started realizing the cons of using a monolithic architecture and waterfall business model –
- Frequent file changes
- Inconsistencies in deployment
- Lack of confidence in deployment
- Increased deployment time
- Less flexibility to experiment, iterate or react
- Significant pressure on developers
Given the above issues, Etsy needed a crucial change in its software development approach to stay ahead of the technology curve. Let’s have a look at what they did.
The great Etsy cultural shift
Etsy focused on building a culture where teams could collaborate and synchronize in real-time for all their tasks. However, such cultural shifts can be challenging because every business has unique market requirements, resource constraints, and willingness to change.
But, as John Willis, a pioneer of the DevOps movement, had once said, “If you do not have a DevOps culture, a culture to support your DevOps adoption, all automation efforts will be fruitless.” The interesting part is that his statement of 2010 still rings true after a decade.
And Etsy, because of its early vision and inclination towards a change for the better, has magnificently survived the DevOps culture shift.
Kellan Elliott-McCrea’s handing-off note as Etsy’s CTO
Instead of a top-down, order-taking culture, Etsy encouraged people to make decisions based on the situations.
How did Etsy deploy more than 50 times a day?
The company engineers had designed a middleware, a software stack Sprouter (stored procedure router). The tool was intended to make processes easier for developers and database administrators and help scale up the performance of the site. However, it was later realized that it served as a single point of failure, and almost every deployment caused website downtime.
Etsy started a two-year journey to eliminate the use of Sprouter in 2009. Soon, it became more focused on creating an engineering culture centered around the philosophy “Code as Craft.” Gradually, the company started making slow yet significant changes in its engineering team, software delivery, automation, and development approaches.
First, it stabilized the site by monitoring using a metric-based system. Secondly, it upgraded its database vertically as far as possible, providing developers access to production activities to help troubleshoot issues.
Etsy’s DevOps teams found a quick and friction-free way to deploy code. The solution involved the implementation of a continuous delivery pipeline.
Continuous integration and continuous delivery (CI/CD)
At Etsy, CI is the essential process of integrating new code with a “master” branch frequently throughout the day. Here, CI systems were usually allowed to automatically run a series of tests upon merging the latest changes to ensure that the integrations were successful.
Try
Etsy came up with Try, a library that allows developers to test their changes in Jenkins without having to commit to trunk. This tool is central to Etsy’s continuous integration process. Try is responsible for keeping the trunk clean and deployable while enabling developers to quickly and reliably test their changes. In 2011, after Etsy introduced Try to the team, the number of deploys increased to more than 20 deploys a day and more in the future.
Deployinator
Etsy’s team created Deployinator – a one-button web-based deployment app to make code deployment as easy and painless as possible. With the help of Deployinator, Etsy just needed one person to push any amount of update in just under two minutes. Before implementing DevOps, it required a minimum of three developer engineers, one operation engineer, and any production engineer on standby. Deployinator did a lot of heavy lifting for Etsy and is truly at the core of the company’s development and deployment model. In 2015, the company announced the re-release of Deployinator as an open-source Ruby gem.
Main Deployinator screen. Here is how Etsy deployed the “web stack.”
Princess
The deployment pipeline passes through the “Princess,” Etsy’s staging environment, before going into production. It uses data stores, networks, and production resources. While Princess has the same code and data as Etsy.com, it is hosted by a separate server, so the development team can test their new code without affecting Etsy while being live.
When the code is ready to go live, an engineer will hit the “Prod” button. Soon once the code is live, everyone in Internet Relay Chat (IRC) would know who pushed what code.
Automated testing
Continuous deployment allows Etsy to test various scenarios continuously. After investigating a few methods, including O’Brien-Fleming, Pocock, and sequential testing, Etsy ultimately settled on the latter. And so, using the difference in successful observations, the team looked at the raw difference between the old version and the new. This method worked well for detecting small changes quickly.
Config flags (feature flags) are an integral part of Etsy’s deployment process as well. The company leverages an internal A/B testing tool, feature API, that helps to test new features. Along with config flags, these features are made live for a certain percentage of users to understand its behavior before making it live on a global scale. It helps to ensure quality.
Check out the State of Test Automation Survey 2022
Continuous monitoring
Etsy spends a lot of time gathering metrics for all its processes. The development team conducted at least 14000 tests per day. Also, tracking each deployment allowed them to detect any bugs they could have missed quickly.
Monitoring is how Etsy’s team builds confidence in their CI/CD processes. The company used various monitoring tools like Nagios, StatsD, Graphite, and Ganglia to correlate issues that arise across its architecture. For instance, in 2009, Etsy started using Graphite for monitoring application-level metrics for new registrations, items sold, images uploaded, shopping carts, forum posts, and application errors.
Over time, Etsy’s love to use graphs resulted in obtaining over a quarter million distinct metrics. So, the team built Kale to detect anomaly patterns and make an informed diagnosis. Also, all logs are available through Supergrep, which increases the logs’ signal-to-noise ratio.
Data recording with post-mortem
Post successful development and deployment, even the finance and support teams can make themselves part of the data recording process to ensure everyone is on the same page.
Who all are involved in the Etsy post-mortem?
All information related to post-mortem is recorded in Morgue: dates; IRC logs; severity; graphs; remediation actions. “Morgue” is another tool built by Etsy for the specific purpose of post-mortem. In other words, Morgue and post-mortem serve to maintain records of everything necessary. Etsy acknowledges that post-mortems have played a considerable part in instilling a collaborative culture. eCommerce giant Etsy was able to grow rapidly based on this experimentation-friendly practice.
Communication
Etsy uses IRC (Internet Relay Chat), one of the most flexible communication mediums, to carry out various collaborative tasks. For example, there is a channel to organize programmers who are trying to deploy at any given time. “Push” is used to create a queue for the operation teams to deploy in groups. The first person in each deploy group is responsible for deploying the code to Princess.
How has DevOps at Etsy evolved over the years?
So far, we’ve discussed the practices and tools that have played a key role in giving Etsy a historical status in DevOps implementation. With the evolution of technology, Etsy has reevaluated its decisions. For instance, Etsy has been using Google Cloud Platform (GCP) as their cloud provider since 2018 after years of hosting their services on self-managed data centers. After that, Etsy started using Jenkins to securely deploy Kubernetes-based authentication tokens and container orchestration.
The development team also built an entirely new tool called Switchboard to resolve several Etsy-specific problems like having no load balancer or API endpoints, etc. Soon, the blue-green deployment method was “lifted and shifted” to include a canary-like gradual traffic ramp-up via Switchboard. This transformation allowed the company to make its web architecture more scalable and efficient while designing for a better developer-centric experience.
Lessons to learn from Etsy DevOps strategy
- Begin with an honest assessment and identify where your codebase and team stand regarding deployments and where you want to take it from there.
- Keep it simple at the start. Prepare small chunks of code. Yes, “go slow and iterate” is the way to move forward.
- Keep your deployment small and often. Enable people to focus on the quality of their code instead of the reliability/stability of the deployment platform.
- Automate everything you can to foster greater accuracy, reliability, consistency, and speed.
- Evaluate your business, development, and deployment culture, respectively. Visualizations will help you make informed business decisions.
- Encourage value-based learning and collaboration between teams to bring about transformational change within your organization.
Simform’s solution
There are no quick fixes when it comes to tapping into the DevOps trend. Every step you take down the road to implement DevOps in your organization will pay profitable dividends in code quality and overall developer productivity. Simform can be your perfect partner in that transformational journey of your business.
Our extended engineering team can competently manage the right DevOps tech-stack, streamline the delivery and deployment pipeline. Also, we can automate processes, strengthen app security, and walk you through architecture upgrades. Furthermore, Simform does not treat new projects just as a sequence of actions. Instead, we focus on managing your entire DevOps product lifecycle with integrity, expertise, and open-mindedness towards future-proofing.