Many security teams are plagued by the eternal question of whether to “build” or “buy” technical solutions for the complex challenges they face. At BuzzFeed, our security infrastructure team experienced this tension when our newly developed microservice ecosystem began to outgrow our existing auth patterns, which relied on Bitly’s tried and tested oauth2_proxy . This solution was reliable, but it lacked a centralized way for our users to sign on, which presented a handful of challenges for our growing platform. So we did what any good engineers would do: We looked for other existing, tested solutions in open source. However, while we heard from our peers that this was a common problem, we never found something that met our specifications. Thus, we created sso , a centralized authentication proxy of our own. It provided us with what we saw as a novel solution to a common industry problem, so we made a commitment to open source it—to provide that tried and tested solution we too had been searching for in the past. Our problem: microservice auth When BuzzFeed launched over 10 years ago, we started, as many companies do, with a glorious single application. Over time, when both the engineering team and its monolith application had doubled in size, we came to a familiar realization for companies of our size: We lacked the sophistication of infrastructure and tooling to support the organizational model of our teams and systems. Thus, rig was born. Rig is our opinionated Platform as a Service (PaaS), which enables high-velocity development and productionizing of services. The transformation was wonderful—we went from struggling to deploy several times a week to deploying 150 times a day! However, as anyone who has made the move to microservices can attest, there were pain points in migrating to a much more distributed set of services. In this article, we’ll explore one pain point in particular: authenticating and authorizing access to internal services. Today, BuzzFeed’s software ecosystem is comprised of hundreds of microservices that interact with each other in various ways. A large subset of these services is comprised of tools that support our content creation and business teams, which means we need to have a secure way for users to access these internal services. Early in the development process, when we were building out proofs of concept, we were inspired by Google’s BeyondCorp methodology, which addresses the challenge of allowing users to access internal services over untrusted networks without a traditional VPN. This approach requires these services to have a user interface open to the internet, and those interfaces need to both authorize and authenticate those users. Rather than pushing this responsibility onto each service, we opted to protect each of them with an instance of oauth2_proxy , which is a reverse proxy that uses third-party OAuth2 providers (Google, GitHub, and others) to authenticate and authorize requests. Using an auth proxy is generally an effective pattern for microservices because it allows developers to focus on their services’ primary functionality instead of re-implementing authentication every time a new service is developed. This proved to be the case for us, and we rapidly grew the number of deployed internal applications. While our platform made it easy to create new services, correctly configuring the accompanying auth proxy was overly difficult and confusing for developers. However, problems soon emerged. As the number of individual boilerplate oauth2_proxy services grew beyond 70, we ran into a variety of fresh issues—not just as maintainers and developers of the platform, but also as users of the internal applications. While our platform made it easy to create new services, correctly configuring the accompanying auth proxy was overly difficult and confusing for developers. This negatively impacted the productivity of our product teams and led to inconsistent and difficult-to-audit OAuth2 applications and permissions with Google, our third-party provider. Furthermore, as operators, managing the proliferation of auth proxy services proved difficult. Critical security fixes required over 70 patches and deploys, instead of just one. Auditing and controlling access across services was also an ongoing challenge. To remediate this, we needed to find a way to centralize access and administration. This would also create a more pleasant experience for the end user, who in the current state needed to separately log in to every service. In an ideal world, a user would perform a single sign-on and have access to all authorized services for a configured time span.

Our solution: centralized auth Our solution to these pain points is sso , our OAuth2-friendly adaptation of the Central Authentication Service (CAS) protocol. sso allowed us to replace every individual oauth2_proxy service with a single, centralized system, providing a seamless and secure single sign-on experience, easy auditing, rich instrumentation, and a painless developer experience. Our implementation is comprised of two services, sso-auth and sso-proxy , that cooperate to perform both a nested authentication flow and proxy requests. sso-auth sso-auth acts as a central authentication service, directing a user through an authentication flow with a third-party provider (e.g., Google). It uses the third-party provider’s groups API (e.g., Google Groups) to provide a simple administrative user experience for authorization. sso-proxy sso-proxy goes through an OAuth flow similar to sso-auth , but with sso-auth as its authentication provider. After going through this flow, it proxies the request back to the upstream. Additionally, it signs the requests, providing a mechanism for upstreams to verify that the request originated from sso-proxy . Both sso-auth and sso-proxy store user session information in long-lived, encrypted cookies, but sso-proxy transparently revalidates the user’s session with sso-auth on a short, configurable interval to ensure quick propagation of authentication and authorization changes. User experience using sso These two services work together to create a single sign-on experience that, behind the scenes, consists of nested OAuth flows. When end users visit an internal resource, like cms.example.com : sso-proxy first attempts to authenticate the user by validating the session cookie stored client-side. If the cookie does not exist or has hit a refresh deadline, sso-proxy will begin an OAuth flow with sso-auth by redirecting the user to auth.example.com. sso-auth checks its session cookie to see if it exists and if it is still valid. If the cookie does not exist or is invalid, it redirects the user to the third-party provider to authenticate, and stores the session information in a session cookie. If the flow is successful, sso-proxy receives a callback from sso-auth with the authentication code and exchanges it for an access token, then executes an API call to the server to retrieve authorization information about the user. If it’s a valid cookie, the request will be proxied to the upstream.

first attempts to authenticate the user by validating the session cookie stored client-side. The sso-proxy sets a refresh cookie that times out after a short period, at which point the proxy re-requests the identifying information from the server to verify that the user has not been removed from our access control lists (ACLs). Then, when the user requests auth.example.com again, the cookie that was previously set can be used to authorize and authenticate the user from the centralized service, sso-auth . Alternatives We considered several alternative approaches. We thought about using something like Keycloak, but we ultimately felt it would be easier to migrate from our existing cluster of distributed oauth2_proxy instances to something centralized. We also didn’t believe it was necessary to introduce a database to address our requirements, which Keycloak depends on. Stateless and cloud-native systems are easier to deploy, especially on containerized platforms like rig. Furthermore, our reliance on and experience with oauth2_proxy made OAuth-based solutions a more natural candidate than something like SAML. Finally, we also explored using a VPN, but BuzzFeed is a large, distributed organization, which made this a less attractive and viable option from a cost and usability perspective.

Our mission: open source After 12 months of running this project in production in front of services across the company, we felt confident that we could open source it. Happily, this coincided with a reorganization of our infrastructure team into concentrated squads. Our site reliability engineers and platform engineers joined together to form four distinct squads across disciplines, including a squad dedicated to securing BuzzFeed applications. The combination of these skill sets created the dream team to lead the initiative to open source this project. We believed that the transparency of open source would shine a light on the things we could improve. Why open source? First of all, sso was born out of an open-source project, and it seemed only natural to give back to the community. Second, we understood from talking to folks in similar roles at other companies that the need for centralized auth was a common problem among platform engineering teams. We learned that many teams had built out their own solutions internally because there was no ideal open-source solution. We hoped to work together in the open to tackle this. Additionally, we knew empirically that oauth2_proxy , from which sso was originally forked, has a large and active community of users, so we felt confident that we could achieve similar traction with sso . Finally, we believed that granting access to our code would help improve our security practices. As we will discuss in the next section, security encompasses a variety of risk factors you cannot prepare for. We believed that the transparency of open source would shine a light on the things we could improve. In our efforts to open source this project, we quickly realized that there can be a swath of issues to navigate when trying to open source any project, let alone one with the security footprint and risk of an authentication proxy. Here, we’ll share the steps we took to ensure the safety of our systems and the lessons we learned along the way. Securing our systems At BuzzFeed, we use a mono repo, so we started by migrating the sso code to a new repo that we would eventually make public. We quickly ran into the issue of path dependence: Decisions we’d made in the past, like JSON encoding of environment variables and any platform-specific code and configuration, no longer made sense. They weren’t the kinds of interfaces we wanted to expose in the open-source project, so we had to refactor these integration points to allow sso to stand on its own without our opinionated workflow. We also crafted our own internal workflow to complement the expected open-source flow of sso , which meant developing a good process for migrating changes back to the mono repo. Initially, this process involved cloning the repository in a pre-build script before running the service, with the understanding that we could eventually eliminate the workflow challenges by using published container images. This new repo felt like a fresh start, and we took advantage of the opportunity to refactor many aspects of the application. Since we started out with a double clone of the oauth2_proxy , there was a lot of duplicated and unnecessary code. The looming reality that all of its flaws would soon be made public proved to be a great motivator for cleaning up our codebase. The codebase is in Golang, so we took the opportunity to read up on Go best practices, which provided a learning opportunity for members of the team who were less familiar with the language. We audited our dependencies, standardized our Go project layout, and generally improved code hygiene within the sso codebase. The decision to open source sso was fraught with tension around whether open sourcing critical security software would lead to an increased risk of vulnerability for BuzzFeed’s infrastructure. How could we ensure the security of our systems while granting access to newly written code? While we had been using sso in production for almost a year at this point, we understood that by opening up its codebase, we were essentially showing the world the design of all the locks on all our doors. Thus, we took a careful set of steps to minimize our security risk. The decision to open source sso was fraught with tension around whether open sourcing critical security software would lead to an increased risk of vulnerability. This was both our first security and our first Golang open-source project, so we opted for a three-phase auditing process. First, we had our consulting security architect, Eleanor Saitta, look over the initial architecture of the project. She reviewed the design and code in depth and pointed out places where we could improve our perimeter security. One of the most interesting and helpful issues she pointed out had to do with how we encrypt our session state. This is a crucial part of the code because it holds the user information, as well as the refresh and access tokens associated with the user. Through our refactoring we learned all about nonce reuse misuse resistance, and opted to use a Golang misuse resistant symmetric encryption library. Next, we opened up our repository to some of our HackerOne researchers, who were given access to both the code and an unstable environment for penetration testing. Finally, we retained a security consultancy that had never seen our applications or code before. They performed pentests over the course of a week. In the final weeks before the open source launch, we focused on addressing any lingering concerns around risk mitigation. Careful planning and organization was crucial during this time. We created a comprehensive checklist of steps to follow and complete, which included associated timelines and who was responsible for what, which drastically reduced uncertainty. Our team, the security infrastructure squad, discussed up front what we would do if a critical vulnerability was found. We developed runbooks that documented possible remediation paths, and we asked ourselves what, if anything, would make us decide to close source again. Having these worst case scenario runbooks available to us allowed us to more confidently open source the project; we knew what to do in the face of disaster. The importance of working as a team on this undertaking cannot be understated. Through teamwork, we were able to vastly improve our documentation in the final weeks of our sprints, including providing a quickstart guide for setting up sso . We were especially proud to see a comment on HackerNews that commended us for not just “dropping” the project with no context or documentation. Documentation is the heart of open source, and the quality of our docs was the product of collaboration between the security infrastructure squad, IT security, the HackerOne researchers, our site reliability and platform engineers, and the wider BuzzFeed tech community. And, in the end, we were able to celebrate the project’s success as a team as well. Beyond preparedness, understanding that security is never completely “done” was crucial. Our team has a learning and growth mindset about all of our work, and that includes acknowledging that unknown unknowns exist and that we will have to continuously adapt. Nothing is ever 100 percent guaranteed to be secure, but careful planning, good communication, and clear expectations allowed us to assuage our initial fears.