What were the challenges in designing it?

Challenge 1: Multi-tenancy design

Challenge 2: Which network ownership model should we use to enforce IP address management?

Challenge 3: How big do we need to think?

Challenge 4: Private IPv4 addresses exhaustion

Challenge 5: Identifying edge cases

Challenge 6: Managing multiple regions in a Shared VPC

Challenge 7: Making GCE instances private only

Challenge 1: Multi-tenancy design

Our microservices are defined as a tenant in our company and we created many automated tools around this model of having each team provisioning their GCP projects and resources through GitHub and Terraform.

Giving flexibility to developers while providing adequate guardrails is a core concept of our microservices platform.

Logically, the network should also follow the same vision. In the current architecture, each microservice has a GCP project and VPC. In the case some microservices wanted to access other microservices resources, not in the central GKE cluster, they’d have to create a VPC peering, making statics groups of VPCs.

VPC peering has a hard limitation of 25 VPCs in a peer network (VPCs that are peered in the same group) which make this option impossible to use as we already have over 100 microservices. Also, VPC peering requires participating VPCs not having any IP overlap. This would prevent all microservices VPCs using default subnet IP ranges to peer.

Solution 1: Use Shared VPC to enable multi-tenancy

GCP Shared VPC allows different GCP projects belonging to the same organization to share the same VPC network.

By sharing one VPC, all participating GCP projects don’t have to create any VPC Peering to connect. On top of that, we can define VPC subnets permissions to create a multi-tenancy model by granting each GCP project a VPC subnet. All VPC subnets are natively routed within the same VPC network, regardless of regions.

Moving to Shared VPC forces you to take a new look at the architecture as it is a centralized model. It implies many challenges we didn’t know about before which we wrote about below.

tl;dr: We found out the pros outweighed the cons.

By nature, Shared VPC enforces all participating GCP projects to not have any IP overlap. This is one of the main challenges to solve to use such a solution. Which leads to the next challenge:

Challenge 2: Which network ownership model should we use to enforce IP address management?

Microservices network ownership model

Traditionally, enterprises have a dedicated network team, responsible for the network architecture design, infrastructure configuration (routing, firewall…) When its company starts adopting microservices, a network team will see its workload increase linearly with the microservices count growth. It would rapidly become a Single Point of Failure (SPOF) if it relies on manual and reactive operations.

To avoid this bottleneck, it is primordial to automate network-related processes and operations. Ideally, microservices owners would take this responsibility but in reality, they often lack the skills to properly understand and manage network components configurations, firewall policies, routing…

In consequence, the network team need to provide automated configurations interfacing with the other automation tools used to provision microservices infrastructure.

Entity network ownership model

In the case entities are separated, such as our case with Mercari JP, Merpay and Mercari US, it is usual to have one network team per entity. Each network team would be responsible for its entity and collaborating with other entities in specific cases. Still, the network is such a common foundation layer for a business that it is typical to see one global network shared by entities from the same group, sometimes even with third-parties. Having a complex network architecture with multiple datacenters, global network backbones, Multiprotocol Label Switching (MPLS), Network Address Translation (NAT) everywhere have been an enterprise-standard for decades.

Although, by experience, the more complicated the network ownership model is the more fastidious the collaboration between stakeholders is.

If you have an opportunity to simplify, don’t overlook it!

This statement is very useful when doubting about a design, feature, especially in network. Don’t use NAT unless you have no choice, don’t use complex routing policies where you can get the job done with simple ones etc…

Solution 2: Having a central network team with automation

In the end, we decided to have one network team that would manage the global network for all entities.

The reason lies in Conway’s law:

‘organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations’

Enforcing a non-IP overlap is much harder when multiple teams manage the IP assignment.

Having a central team is the best way to ensure the network standards are clear and respected. It brings cohesion to the overall network design and management. But as stated above, the central team mustn’t become a gatekeeper or a bottleneck for other teams. Providing developers flexibility is crucial.

A central team needs the proper automation and self-servicing capabilities to scale with microservices.

This is what we did by implementing a Terraform-based Github repository automatically generating Shared VPC projects, attachments, subnet provisioning with helper scripts. Any microservice team which need a subnet for their GCP project can get it automatically using the above repository by sending a Pull Request with the generated configuration from our helper script. The IP assignment is not fully-automated but has a low-enough overhead to not let the network engineers drown.

Challenge 3: How big do we need to think?

When designing such an important design involving and impacting an entire group, it is easy to feel overwhelmed by information and scope.

What is the best way to get from the white page syndrome to something you can deliver?

Solution 3.1: Define a “rough” capacity planning

This is an important part of the network architecture design as it should be one of the requirements for the design.

Scalability should not be compromised by architecture, consequently, it requires input from all stakeholders susceptible to consume the infrastructure.

By discussing with several internal customers, we defined some capacity planning for the GCP services that would require private IPs:

Solution 3.2: Keep flexibility in the process

Don’t be too static when designing the architecture, while it is easy to be conservative, there is no point if you take years to make your design.

Identify as many two-way doors decisions as possible while keeping the base of your architecture a high-quality decision.

while keeping the base of your architecture a high-quality decision. Not every part of the design is ever set in stone, even less in our time.

Define what will be a one-way door decision, two-way door decisions first and tackle the one-way first.

Using this mental model proved very useful to us to deliver a high-quality design in a relatively short period (~6 weeks)

Below are some questions useful to ask:

How much will our GKE, GCE usage grow over the next 3 years? 5 years?

Is our design enabling future technologies such as serverless?

How much capacity do we need for Disaster Recovery?

Is our design future-proof? Is it evolutive?

What would be the pain points in managing such a design?

With this information, we were able to design an IP address assignment for the entire Group, including our US entity and to get a clear network architecture design.

Challenge 4: Private IPv4 addresses exhaustion

Private IPv4 addresses are a very scarce resource. We can only have around 18 millions of it, split into 3 classes. Kubernetes and pods bring new requirements on the IPv4 address consumption by giving every pod a private IPv4 address.

While this didn’t cause many issues in the past as overlay networks were isolated, GCP brought pods as a network first-class citizen by releasing Alias IP. Alias IP grants every pod in a Kubernetes cluster a Private IPv4 address from the VPC CIDR block the cluster belongs to.

Below is a breakdown of Kubernetes IP addresses usage (for a 1000 nodes GKE cluster with default settings):

GKE Nodes CIDR: /22 (1024 IPs)

Pods CIDR: /14 (262144 IPs), each node has a /24 (256 IPs) portion allocated

Services CIDR: /20 (4096 IPs)

A 1000-node GKE cluster requires around 267k IP addresses, which is ~1.5% of the total RFC 1918 IPv4 pool!

When you want to scale to several clusters, the number starts becoming preoccupying: 8 clusters take 12% of the RFC 1918 IPv4 pool. Add a simple failover region disaster recovery on top of that and you get a quarter of it eaten up!

Kubernetes is extremely “IPvore” so we had to find solutions to make it use fewer IP addresses.

Solution 4: Using Flexible Pod CIDR

We partially solved the problem using Flexible Pod CIDR, sacrificing pod density for IPv4 address saving. Sacrificing pod density can be an important matter as it virtually limits the total compute capacity for a GKE cluster. We carefully compared this limitation with our capacity planning to find a good balance between loss of capacity and IPv4 address savings.

Reducing a Pod CIDR from /24 (256 IPs, max 110 pods per node) to /26 (64 IPs, max 32 pods per node) has a huge impact. Considering the earlier example, we get from 267k IP addresses to 70k, a substantial 74% decrease! On the other hand, we theoretically get from 110k max pods to 32k pods per cluster, which is about 71% decrease in pods capacity. 32 pods per node felt the sweet spot based on our GKE utilization and capacity planning. We made the call since this was the best compromise we could make, considering IPv4 address saving more important than pods capacity and the max number of pods per node. Your mileage may vary depending on your priorities.

Challenge 5: Identifying edge-cases

As there is no magical solution in the world, cloud providers also come with their bunch of technical limitations they try to hide from customers, until a given edge case is found out. This is the kind of relationship we had with GCP teams during our exchanges, having to bend our design to ensure we could do what we planned with minimal friction. Identifying edge cases takes a lot of time and it is easy to fight mirages.

Solution 5: Research limitations extensively but with moderation

We started with sifting through the GCP network documentation to identify all technical limitations with each product, which was sometimes painful as there are dependencies between some products, especially with Shared VPC.

We could list the first set of technical limitations at a large scale, such as (as of August 2019):

Max Shared VPC Service Project per Host Project: 100

Max number of subnets per project: 275

Max secondary IP ranges per subnet: 30

Max number of VM instances per VPC network: 15000

Max number of firewall rules per project: 500

Max number of Internal Load Balancers per VPC network: 50

Max nodes for GKE when using GCLB Ingress: 1000

Looking at these numbers seems scary when going at a very huge scale. 15k VMs means 15 GKE clusters of 1000 nodes if Kubernetes was the only GCE resources we use in GCP. This is the limit we fear the most yet we agreed on dealing with.

The other one is the maximum number of subnets per project, which means we can at most have 275 microservices, less the reserved ones for Disaster Recovery. We agreed on using Shared VPC since only a few microservices will require a dedicated VPC subnet.

These limitations also confirmed our will to have mirrored development and production network infrastructure, completely isolated from each other to avoid cross-environment violations as well as preventing an englobing Shared VPC from reaching its limits twice as quick.

The important takeaway here is the ability to find the consensus between edge-cases, your capacity planning and your risk assessment.

In our case, we decided to go with these limitations predicting that:

These would be lifted in the future, with as few redesigns as possible We might not achieve this scale (obviously we want to!) We made many two-way doors decisions so it is a calculated risk

Challenge 6: Managing multiple regions in a Shared VPC

Shared VPC spans across the globe by definition so it seemed easy to create a multi-region network architecture. However, we had to choose the design which would fulfil our architecture goals the best while solving the challenges we mentioned.

We defined 4 options for the multi-region Shared VPC design:

Option 1: 1 Global Shared VPC Host Project, 1 Shared VPC network per region peered with VPC peering

Option 2: 1 Global Shared VPC Host Project, 1 Global Shared VPC network

Option 3: 1 Shared VPC Host Project per region with VPC peering

Option 4: 1 Shared VPC Host Project per region without VPC peering

After weighing in on each option’s pros and cons, we choose Option 2 for the following reasons:

It has the simplest management with a centralized Shared VPC Host Project for the entire group, referring to Solution 2.

It is the easiest way to implement the infrastructure logic in GitHub and Terraform

Interconnection between regions is straightforward and leverages Google Global VPC Network

It fulfils the architecture goals and our guesses in Solution 5

Challenge 7: Making GCE instances private only

With Shared VPC, the internal connectivity issue between all GCE instances within the VPC is straightforward and secure. This allows us to remove Public IPs addresses from all GCE instances, but when doing so, GCE instances lose Internet connectivity.

How can we ensure Internet connectivity to VMs while ensuring scalability?

Solution 7: Use Cloud NAT in the Shared VPC

Cloud NAT is a managed-NAT service provided by GCP. It focuses on outbound NAT for GCE instances to provide outbound Internet connectivity. It is deployed regionally thus we need to create at least one Cloud NAT instance per region. In contrary to the existing NAT services, Cloud NAT is embedded into GCP Software Defined Network (SDN), not using traffic from VMs network interfaces and is highly scalable. By defining one public IP in Cloud NAT, we can have at most 64k TCP ports and 64k UDP ports supported. The default setting for each GCE instance is 64 ports, which might be not enough for GKE pods.

Therefore, we need to fine-tune the number of NAT IPs/number of ports allocated per VM to find a good balance for GKE nodes.

For specific use cases, a secure project can also choose to use a dedicated CloudNAT while participating in Shared VPC, which is a good point for our sensitive workloads.