Intro

High level architecture for BeyondCorp

Identifying and gating services



How do you identify and categorize all the services that should be gated?

Google began as a web-based company, and as it matured in the modern era, most internal business applications were developed with a web-first approach. These applications were hosted on similar internal architecture as our external services, with the exception that they could only be accessed on corporate office networks. Thus, identifying services to be gated by BeyondCorp was made easier for us due to the fact that most internal services were already properly inventoried and hosted via standard, central solutions. Migration, in many cases, was as simple as a DNS change. Solid IT asset inventory systems and maintenance are critical to migrating to a zero trust security model.



Enforcement of zero trust access policies began with services which we determined would not be meaningfully impacted by the change in access requirements. For most services, requirements could be gathered via typical access log analysis or consulting with service owners. Services which could not be readily gated by default ACL requirements required service owners to develop strict access groups and/or eliminate risky workflows before they could be migrated.







What do you do with services that are incompatible with BeyondCorp ideals?



It may not always be possible to gate an application by the preferred zero trust solution. Services that cannot be easily gated typically fall into these categories:

Type 1: "Non-proxyable protocols", e.g. non-HTTP/HTTPS traffic.

Type 2: Low latency requirements or localized high throughput traffic.

Type 3: Administrative and emergency access networks. The typical first step in finding a solution for these cases is finding a way to remove the need for that service altogether. In many cases, this was made possible by deprecating or replacing systems which could not be made compatible with the BeyondCorp implementation.



When that was not an option, we found that no single solution would work for all critical requirements:

Solutions for the "Type 1" traffic have generally involved maintaining a specialized client tunneling which strongly enforces authentication and authorization decisions on the client and the server end of the connection. This is usually client/server type traffic which is similar to HTTP traffic in that connectivity is typically multi-point to point.

Solutions to the "Type 2" problems generally rely on moving BeyondCorp-compatible compute resources locally or developing a solution tightly integrated with network access equipment to selectively forward "local" traffic without permanently opening network holes.

As for “Type 3,” it would be ideal to completely eliminate all privileged internal networks. However, the reality is that some privileged networking will likely always be required to maintain the network itself and also to provide emergency access during outages. It should be noted that server-to-server traffic in secure production data center environments does not necessarily rely on BeyondCorp, although many systems are integrated regardless, due to the Service-Oriented Design benefits that BeyondCorp inherently provides. It may not always be possible to gate an application by the preferred zero trust solution. Services that cannot be easily gated typically fall into these categories:The typical first step in finding a solution for these cases is finding a way to remove the need for that service altogether. In many cases, this was made possible by deprecating or replacing systems which could not be made compatible with the BeyondCorp implementation.When that was not an option, we found that no single solution would work for all critical requirements:It should be noted that server-to-server traffic in secure production data center environments does not necessarily rely on BeyondCorp, although many systems are integrated regardless, due to the Service-Oriented Design benefits that BeyondCorp inherently provides.



How do you prioritize gating?



Prioritization starts by identifying all the services that are currently accessible via internal IP-access alone and migrating the most critical services to BeyondCorp, while working to slowly ratchet down permissions via exception management processes. Criticality of the service may also depend on the number and type of users, sensitivity of data handled, security and privacy risks enabled by the service.



Migration logistics



Most services required integration testing with the BeyondCorp proxy. Service teams were encouraged to stand up "test" services which were used to test functionality behind the BeyondCorp proxy. Most services that performed their own access control enforcement were reconfigured to instead rely on BeyondCorp for all user/group authentication and authorization. Service teams have been encouraged to develop their own "fine-grained" discretionary access controls in the services by leveraging session data provided by the BeyondCorp proxy.





Lessons learnt



Allow coarse gating and exceptions



Inventory: It's easy to overlook the importance of keeping a good inventory of services, devices, owners and security exceptions. The journey to a BeyondCorp world should start by solving organizational challenges when managing and maintaining data quality in inventory systems. In short, knowing how a service works, who should access it, and what makes that acceptable are the central tenets of managing BeyondCorp. Fine-grained access control is severely complicated when this insight is missing.



Legacy protocols: Most large enterprises will inevitably need to support workflows and protocols which cannot be migrated to a BeyondCorp world (in any reasonable amount of time). Exception management and service inventory become crucial at this stage while stakeholders develop solutions.

Run highly reliable systems

The BeyondCorp initiative would not be sustainable at Google’s scale without the involvement of various Site Reliability Engineering (SRE) teams across the inventory systems, BeyondCorp infrastructure and client side solutions. The ability to successfully achieve wide-spread adoption of changes this large can be hampered by perceived (or in some cases, actual) reliability issues. Understanding the user workflows that might be impacted, working with key stakeholders and ensuring the transition is smooth and trouble-free for all users helps protect against backlash and avoids users finding undesirable workarounds. By applying our The BeyondCorp initiative would not be sustainable at Google’s scale without the involvement of various Site Reliability Engineering (SRE) teams across the inventory systems, BeyondCorp infrastructure and client side solutions. The ability to successfully achieve wide-spread adoption of changes this large can be hampered by perceived (or in some cases, actual) reliability issues. Understanding the user workflows that might be impacted, working with key stakeholders and ensuring the transition is smooth and trouble-free for all users helps protect against backlash and avoids users finding undesirable workarounds. By applying our reliability engineering practices , those teams helped to ensure that the components of our implementation all have availability and latency targets, operational robustness, etc. These are compatible with our business needs and intended user experiences.



Put employees in control as much as possible



Employees cover a broad range of job functions with varying requirements of technology and tools. In addition to communicating changes to our employees early, we provide them with self-service solutions for handling exceptions or addressing issues affecting their devices. By putting our employees in control, we help to ensure that security mechanisms do not get in their way, helping with the acceptance and scaling processes.





Summary