In today’s rapidly evolving technology landscape, building and maintaining reliable distributed systems is more critical than ever. However, merely implementing observability tools is not enough to ensure the reliability and efficiency of these systems. To truly harness the power of observability, it must be ingrained in your development culture.
In this article, we’ll explore the importance of fostering an observability culture and the benefits it brings to your organization. We’ll also share practical steps on how to build a culture of observability in your team and create a seamless collaboration between development and production environments.
The status quo
Observability tools aren’t the final frontier when it comes to building reliable distributed systems. Yes, it gives you much better visibility into the system, its elements, and the underlying operations. But just because essential information is easily available doesn’t mean the system would automatically become more reliable. You still want people to act on the information and take ownership of their systems.
Despite what many want to think, most engineering teams haven’t been able to leverage observability to its full potential. The general sentiment going around in development teams is that they need to worry about the code only until it goes into production.
Once in production, “but it runs on my machine” is still the most common response you get to hear from devs. In the “machine,” the code doesn’t need to work along with multiple other elements of the distributed complex system. But in production, it has to.
That’s where the problems start. Code change, i.e., new releases into production, is the leading trigger for incidents. And understandably so.
Why do you need an observability culture?
You can’t truly reap the benefits of observability until you imbibe it in your development culture. And that’s because observability doesn’t mean setting up some tools and having SREs react to the outputs.
Observability is all about developing a holistic understanding of the entire system and its goals to prevent and address issues. It’s about creating a deeper sense of ownership by having developers understand what goes on in production and making SREs aware of the developmental challenges.
Here are some of the reasons why you should encourage an observability culture across your teams.
1. Makes the system less scary
Modern distributed systems have more moving parts than ever and they are only getting more complex. It’s difficult for individuals to understand the context and relation between various elements, and consequently assess how the system works on the whole.
With an observability culture, engineers are not just concerned about their lines of code, but they start thinking about how their lines of code interact with the rest of the system. This helps them develop a deeper understanding of the system and its components. With time, they are able to better predict how individual changes will reflect throughout the system. They know the steps to take in case of outages and are overall more confident when it comes to dealing with the system.
2. Helps move away from the hero culture
There used to be days when teams would have one all-knowing-all-seeing hero developers/SREs who would often save the day with their infinite wisdom. But what happens when the hero leaves?
Relying on a single or even a handful of individuals for all the troubleshooting is never a good idea. It means your team is always at risk of losing all the valuable knowledge needed to keep things in order. Moreover, it’s getting difficult for individuals to be that ‘hero’ since systems are getting so complex.
The observability culture encourages everyone to share the load. It democratizes knowledge and enables everyone to step up when needed.
3. Helps build a robust and responsive business
At the end of the day, you can consider the software to be successful only if the customers are happy. And customers don’t care much about coding practices, development strategies, system architecture, etc. but only if the software fulfills their requirements satisfactorily.
The requirements and satisfaction criteria are encapsulated through SLOs, and the observability culture is all about adhering to these SLOs. It means the teams focus on what’s most impactful and work towards making sure it functions the way it’s supposed to.
It’s a no-brainer that if the entire team is collectively working towards keeping the end users happy, the business is going to benefit a lot from that. Prioritizing what’s important for customers ensures the business is able to pivot to an evolving market landscape with ease.
4. Unifies dev and prod
It’s easier for dev and prod to get lost in their respective deliverables and become indifferent towards challenges faced by the other side. This not only leads to rifts between the job roles but also makes the business suffer.
With the observability culture, you make success a shared responsibility. It makes the two teams speak a common language and work together towards achieving SLOs.
As shared in an example by Christine Yen, cofounder of Honeycomb, in a QCon talk, you don’t want exchanges between dev and prod to look like, “CPU is way up on half of the Cassandra cluster.” Instead, it should look like, “API latency is way up on our most expensive endpoints for our biggest customer since our last build.”
It’s through a fresh perspective like this that observability culture is able to bring everyone on the same page to work towards a common goal.
5. Reduces the frequency of errors and outages
‘Trust’ is at the core of observability culture. Trust builds over time through small and regular efforts. And eventually, this trust erodes away the fear that holds individuals from taking ownership when something goes wrong.
This heightened sense of ownership is what enables individuals to share all the necessary context when analyzing an outage or error. In a fear-driven culture, no one would even come forward to acknowledge that something went wrong and there would be finger-pointing all around.
It’s when people are willing to share all the necessary context that the incident-analysis exercise actually becomes helpful for everyone. Everyone becomes aware of the do’s and don’ts and leaves the exercise with a slightly better understanding of the system and its components. Consequently, there are fewer incidents moving forward.
The perks of having an observability culture are numerous. But the million-dollar question is- How do you build an observability culture in your team?
How to build a culture of observability?
Let’s discuss some of the ways you should follow to imbibe an observability culture across teams.
1. Share larger code life-cycle responsibility with developers
Most incidents happen after a release. And it’s not to say that developers are responsible for incidents and therefore they should be the ones fixing them. Instead, it’s just that since they made the recent changes, they are the ones most equipped with the knowledge of what must’ve gone wrong and how it can be fixed. They know all about the intended change and can better assess the incident based on the actual output.
Developers taking more responsibility for the code is one of the first steps to nurture an observability culture. And this also means putting developers on call. This doesn’t necessarily mean having them available on call at all times for all the code they wrote. Instead, it should be done just so that they get to better understand the real-world consequences of changes they made, and also use their knowledge to resolve issues faster.
2. Set up reasonable on-call processes
Irrespective of whether it’s the developer or SRE on call, you need to ensure your on-call rotations do justice to everyone. Production systems should be the responsibility of the entire team and not just of the people on the call. It means everyone should have a sufficient understanding of the system and should be able to dissect instances to a reasonable degree.
And once team members possess the relevant knowledge, the on-call rotations will be easier to manage. Being on call would still be a turbulent experience at times, but with adequate knowledge and the right systems in place, the frequency of such experiences is only going to reduce.
3. Prioritize documentation and encourage knowledge sharing
When we talk of having systems in place to deal with incidents, it’s not just observability tooling. Documentation plays a key role when it comes to helping everyone better understand the nuances of the system. You should encourage members to document everything including past incidents, processes, escalation paths, and more. The more information there is, the better context for everyone to work towards resolving issues.
But let’s face it, documentation isn’t the favorite part of most engineers. Therefore, you need to look beyond conventional ways of documentation and make it easier for everyone to log their experiences. We are talking about breadcrumbs, notes on dashboards, and even the ability to search if other members have worked on the same issue before.
You need to clearly express the importance of documentation to ensure everyone participates. Make it a priority and let everyone know that they should contribute by documenting what they learned.
4. Shift focus to what matters most
There’s so much going on in a modern distributed system at any given time. And since these systems are dynamic, there can also be numerous issues that you might think you need to address asap. But are those really critical?
Engineering teams should stop looking at the system just from an infrastructural point of view and start using the business lens as well. If it isn’t something that’s directly impacting the business then it means no one should wake up in the middle of the night to fix it. Consequently, when you start alerting engineers only to what’s important, they’ll start paying more attention to those particular facets that matter the most to your business.
5. Make it okay to fail fast and learn
Encouraging increased ownership also means encouraging people to make changes as they deem necessary. And not all of those changes are going to go your way. However, it’s alright as long as the team learns from them.
To propagate the observability culture, allow people to fail fast and learn from it. This also means not focusing on individuals during incident reviews but paying attention to the context that led to it.
Ideally, everyone should take responsibility for incidents and the goal should always be to improve the output moving forward. A blameless culture ensures individuals are forthcoming and open to discussing ideas when things go wrong. And in the process, prevent everyone else from forming assumptions they made which led to the incident.
6. Start testing in production
Technically, everyone tests in production. It’s more about whether you are doing it intentionally or not. No matter what, the staging environments are never going to be able to replicate production. And therefore, you can either spend weeks perfecting your design in staging environments or test the hypothesis in production within a single day.
It’s important to note that testing in production is going to be a successful practice only if there is a culture of trust and ownership. It’s only then that developers will be comfortable deploying instances, setting up instrumentation to measure them, observing it, and so on.
7. Deploy observability before production
You may also look at things the other way around. For many teams, testing in production isn’t the ideal scenario. Instead, they should use all the pre-production environments to practice observability. There are staging, development, testing, and other sorts of environments where you can implement observability.
Using these low-risk stages is a great way to get iterative feedback and save all the legwork you would’ve otherwise done after the changes were live. It’s usually also a great opportunity to practice observability and even improve upon that iteratively.
Conclusion
Fostering an observability culture has numerous benefits that can significantly impact your organization’s ability to build and maintain reliable distributed systems. By encouraging a sense of ownership, prioritizing documentation, focusing on what matters most, and embracing a fail-fast-and-learn mindset, you can create an environment where developers and operations teams work together harmoniously.
Embracing observability as a part of your development culture is not just a one-time effort; it’s a continuous process that requires commitment and collaboration from everyone involved. By implementing the strategies discussed in this article, your team can effectively harness the power of observability and build more robust, responsive, and reliable systems for your business.