Site reliability engineers (SREs) and security teams are more powerful when they work together, and being able to combine our efforts can make or break our teams' experiences and outputs.
Source: Prostock-studio via Alamy Stock Photo
COMMENTARY
Security teams and site reliability engineers (SREs) are natural allies in the fight against randomness, terrible systems, and how those systems can hurt people and give us yet another reason to hate surprises. But where do our interests overlap, and how can we take advantage of this to optimize our teams?
I'm not an SRE, but I've had the immense luck to work very closely with multiple SRE teams, and the more work I've done with them, the more I've noticed that what security and SRE each want have at least 90% overlap.
Security teams protect against malicious adversaries, where SREs protect against the system effectively acting like a malicious adversary — and if we work together, we can plan ahead and avoid duplicate work, ultimately making for a better business.
We Actually Have a Lot in Common
We are natural allies because we want a lot of the same things. Though it may not seem immediately apparent, SREs and security teams have a lot of overlapping priorities, including:
Access controls (or, in the SRE world, dependency control systems): You can use access control systems to give you dependency control systems. No one likes it when people and systems touch things they're not supposed to. It's amazing how few people realize that, with careful design, we can build one system for both security teams and SREs — and when I show up and offer an SRE team a dependency control system for the low, low price of working with security to roll out better access controls, I become their instant best friend.
Network design: If possible, we don't want unauthorized folks hitting the access controls. For example, we want to shut down denial-of-service (DoS) attacks (whether from outside attackers or from your own systems because of things like buggy retry logic) before they even take up the resources for a network handshake. And why give potential attackers the chance to look for or exploit a security flaw? Also, Amazon charges a lot of money for that network address translation (NAT) gateway. Running your own proxy, like Envoy, means you get egress controls (which security loves) and you pay less for it. SRE would definitely like to spend that money on something else. Who wouldn't?
Observability: Both our teams are trying to detect incidents — albeit different types. Tracking an attacker is a bit different than debugging, but the information you need for each of these — such as traffic flow, tracing, and crash dumps — heavily overlaps. Both teams want this to be a comprehensive process, and neither of us wants to do redundant work.
Releases: Patching — especially predictable patching — is what gets a lot of people in trouble. Production services get hacked because they're not up to date on patches or are only partially patched. Having a mess of slow and partial patching is a total headache to run or debug, even without attackers in the mix.
Release engineering: Fast, safe releases are how you fix incidents fast — so release speed is a security feature.
Incident response: Enough said.
Eliminating toil: We've both got way more than enough to do, thank you — and need to move fast in a crisis.
The above is not even an exhaustive list of the many ways SREs and security teams overlap. So, what's different? Here are a few things:
Error budget: SREs have an error budget, which (usually) isn't zero. One security error is still a security incident. One of the places where this crops up is tiny experimental launches — a tiny experimental vulnerability is still a vulnerability for a security person, and someone who exploits it is still a problem after the launch is rolled back. The attacker might even still be in the system! With that in mind, security teams tend to be a lot more concerned about launches than SREs are.
Measuring: Security can't always be as good at measuring. A lot of what we're trying to measure is actively hiding from us, and we almost never get ground truth to work with, so this continues to be a challenge.
Compliance: Most engineers hate compliance, probably because it's both inflexible and sometimes seems useless. And hey, I agree. However, we have to do compliance anyway, even if some bits are annoying, because it turns out that at least one of two things are true for most compliance frameworks: that people will give you money for doing it, and that governments will get involved if you don't do it.
As you can see, our list of common priorities is much longer than our list of differences. So how do we work together and optimize our organizational efficiency? Here are three ways:
Respect: People are different. Companies are different. This is a feature, not a bug. If we're going to do a good job at security, privacy, trust and safety, or building any kind of product or system, we need to be solving problems for actual people — otherwise we're not solving problems at all.
Collaboration: We can elevate each other's work, even when it's not immediately apparent why we should be doing it. One way this can work is by raising the priority of projects that SRE teams really want to do, but couldn't get prioritized previously, and that security can't do on their own. Security can help by approaching them as joint projects and critical for the company. Dashboards and monitoring can also make a big difference in your collaboration efforts — they help security teams get over their measuring obstacles, and support SREs in gaining helpful visibility into security's work. And remember: Dashboards are not to inform people or make sure they know things. What you need is a dashboard where when people look at it they do things — like patch their vulnerabilities.
Choosing each other: Many chief information security officers (CISOs) are not engineers and, as such, do not know how to get engineers to want to work with them. If you're not working closely with your CISO already, I suggest looking at a list of the security priorities and saying, "Hey, I can solve some of your problems." They would love to team up and help.
We're more powerful together, and being able to combine our efforts more productively can make or break our teams' experiences and outputs. There's an obvious place for SRE and security teams to be able to do this and, of course, avoid upsetting our legal teams in the meantime.
About the Author(s)
CISO, Lacework
Lea Kissner is chief information security officer at Lacework, bringing more than 20 years of experience leading security, privacy, and anti-abuse efforts at global organizations to the company. Their experience includes serving as CISO at Twitter, chief privacy officer at Humu, and global lead of privacy technology at Google. In the spring of 2020, when Zoom experienced security concerns after a massive increase in usage due to the COVID-19 pandemic, Kissner served as a security and privacy consultant for the company to improve the security, privacy, and anti-abuse features of Zoom's products and systems. Kissner currently serves as a board member to the USENIX Association, a nonprofit organization dedicated to supporting the advanced computing systems communities and furthering the reach of innovative research.
You May Also Like