A piece of history

It is almost 2020 and the days when microservices were the new hottest trend are long gone. The dust has settled down and most companies have already formed strong opinions about them. Some completely ignored the fad, calling it the rebirth of SOA, and the rest jumped on the bandwagon. Those who went all in rapidly faced the consequences. Turns out it was not a silver bullet (surprise, huh?) and without thoughtful decisions and careful design, companies ended up with horrible, messy and unstable applications often called distributed monoliths. Most hype articles did not mention challenges like versioning your services or managing dependencies between them. They skipped parts describing how should they communicate properly and how deployment should be correctly performed. Without any guidelines, an inexperienced team in a startup environment could swiftly waste its development velocity and lose countless hours specifying how to ship the thing without breaking it and which version of service A works with service B. And worst of all, from the programmer’s perspective it did not really change the everyday struggle too much. It was the same old programming. It did not differ at all from your friendly neighborhood greenfield monolith project. You just frequently called external services and used more queues (if you knew how). Maybe you could bump the version of your framework more often, convince your tech lead to write a tiny part of an app in a language you’ve always wanted to try out and put a big “yes, we use microservices” banner on your company’s landing page, but that was pretty much it.

Don’t get me wrong – my goal is not to demonize microservices. They have their uses and many products could not work properly without this pattern. After all, why would the industry constantly sound off about them (…right?). Well, with the rising complexity of microservice systems people had to tackle many new problems that popped up, with internal communication between your applications among them. Sooner or later (and usually sooner) you needed to call another internal service. Of course, it worked perfectly most of the time, as performing HTTP requests was bread and butter to most of us. However, a big issue occurred when it didn’t. Suddenly, your entire stack of independent services malfunctioned just because a tiny part of it was being redeployed. You ended up with unhappy customers (because your application did not work), unhappy product owners (feeling cheated about the independence of your services) and unhappy programmers (tracing which request triggered the issue was often not trivial). All that unhappiness just because of one 500 status code.

Maybe we should treat internal communication differently, they asked. What if we think different? Perhaps we could retry our calls? Can we load balance the traffic internally? What if there are various services ready to answer the call? Eureka! Let’s write a piece of code that solves this.(*)

Companies that pioneered in such architectures quickly came up with a bunch of solutions (for example Hystrix, Eureka, Zipkin or Ribbon in JVM-heavy environments). And so, time flew by and followers implemented new fancy tools to their stack. With proper usage of specialized frameworks, applications could finally discover and talk to each other in a safe way.

Unfortunately, that added a big bundle of new technologies to the mix. If you worked in a large organization this wasn’t much of a problem. A specialized team-taught or maintained the “network side” of the code and everything worked just fine. But not all of us could afford such an expense. It also required engineers to adapt straight away, but without proper abstraction, your pure business logic could suddenly get mixed up with load balancing logic. Not a very pleasant situation to be in. Fortunately, new technologies began to rise – with the growing interest in cloud, containers and container orchestration a new solution was born simultaneously.

(*) – I have no proof such a conversation actually happened, but it is very likely.

Enter the service mesh

A brilliant thought came to someone’s mind (*) – business applications should not be bothered by all the issues that can occur inside your network. They should act as if they called an external service somewhere in the web. The network layer, however, should intercept the call and direct it into the appropriate, healthy instance of the receiver. If there is more than one receiver, it should split the traffic between them evenly. If a request had timed out, retry it. It should ensure that when an instance is booting and warming up, we should not flood it by sending thousands of requests at once (which wouldn’t differ much from a DDoS attack). Basically, it should try keeping the internal network stable as much as possible.

The implementation is pretty straightforward – you add a network proxy (often called a sidecar) to every application running in your stack (data plane) and connect all those proxies to the control plane. The control plane specifies all routing, retry or load balancing rules and keeps proxies up to date. That concept is known as a service mesh. It is a simple and ingenious idea.

The main benefit is that you no longer need to clutter the application with code registering the instance to the service discovery or load balancing the internal traffic. Most networking stuff is done at the network level. Your developers can once again focus on the business code.

You can also specify a retry policy – if a single HTTP call to some service fails, we could retry it once or twice after a few seconds, transparently to the caller. Throttling policies, on the other hand, make sure that booting instances are not instantly flooded with requests and have some time to warm up.

Applications often terminate SSL on the load balancer. However, in some security-sensitive cases, you still want to encrypt the communication that takes place inside your network. A service mesh may do exactly that, without any modifications to the services’ code.

Using a service mesh enables you to do some routing magic. You can tweak the routing live, changing the communication patterns without interfering in microservices. Also, if you plan to perform A/B testing with different application versions (think of rolling out a new feature but only to 10% of the overall traffic) or route the request to different services based on HTTP headers (for example, in multi-tenancy application), then such tasks are trivial to implement when your applications live inside the service mesh.

Another perk of the service mesh are smart deployments. After all, A/B testing is no different than very popular canary deployment – just keep swapping the percentages from service A to service B until service B serves all the traffic. Or run a new version of your application, wait until it boots and then move the traffic there. Congratulations, you just performed a Green/Blue type of deployment with tiny effort.

Lastly, some service meshes allow you to perform distributed tracing. It is often a challenge to deduce what actually caused an error deep (architecture-wise) inside your stack. A service has been called by another service, which also has been called by another service and so on. If you generate an arbitrary request-id after receiving the call and pass it around between services (just like a passport getting stamped whenever you visit different countries), then tracing the entire request’s path is a breeze. You also get information on how much time each part of the entire request took, which can prove useful in identifying bottlenecks and overused paths.

An example from linkerd documentation. You clearly see how much time a request (well, 99th percentile) “spends” in each of the application

(*) – I have no proof such a thought has actually come, but it is very likely.

Introducing AWS App Mesh

Amazon, the leader of the cloud industry, quickly acknowledged the idea and introduced their own solution at re:Invent 2018 – AWS App Mesh. Their technology uses open source Envoy proxy, integrates with AWS Cloud Map for service discovery capabilities and manages the control plane for you. AWS App Mesh is integrated with three AWS Compute services – EC2, ECS (Fargate/EC2), EKS, as well as AWS XRay for distributed tracing.

There are four basic building blocks in App Mesh: Mesh, Virtual Service, Virtual Node and Virtual Router. For our purposes, we’ll use the example of a tiny setup with three applications – a gateway, an order service and a pricing_server service.

Mesh is just a logical grouping of all your microservices. Your production environment will have its own mesh, and so will preprod and qa environment. You can also specify on the Mesh level whether your services’ communication with outside world is allowed (by setting the egress to ALLOW_ALL) or not (DROP_ALL).

Virtual Service – practically speaking, this is an abstract definition of your microservice. By defining three different Virtual Services you tell AWS:

I will have a gateway, pricing_server, and order in my stack.

Each of them will be available internally at http://<<service_name>>.<<route53.private_namespace>>, as long as you allow other services to connect to them.

Then, you pick one of the Virtual Service implementations – you either directly provide an implementation using Virtual Node, by describing more about your implementation of a given Virtual Service:

My gateway Virtual Service containers’ listen on port80 (listener). You should check their health by pinging/health every 10 seconds (healthcheck). Every container implementing this Virtual Node will register in Route53 private namespace.foo as gateway (service discovery). It will communicate with other Virtual Services: pricing_server and order (backends).

or indirectly, using Virtual Router – it enables you to balance the traffic between various Virtual Nodes.

When pricing_server Virtual Service is called, route 90% of the traffic to Virtual Node pricing_server:1.0.0 and the rest to pricing_server:2.0.0.

One Virtual Router consists of many Routes. Every Route can match prefixes or HTTP headers.

AWS App Mesh with Terraform

In 2019, Terraform finally caught up with all the pieces you need to implement AWS App Mesh in your infrastructure. We are running it on production with Terraform, and so can you.

Defining a mesh is a breeze:

resource "aws_appmesh_mesh" "mesh" { name = "${var.env_name}_mesh" spec { egress_filter { type = "ALLOW_ALL" } } } 1 2 3 4 5 6 7 8 resource "aws_appmesh_mesh" "mesh" { name = "${var.env_name}_mesh" spec { egress_filter { type = "ALLOW_ALL" } } }

…and you can quickly create an abstract module of a simple Virtual Service -> Virtual Router -> Route -> Virtual Node connection to use in your ECS Services:

resource "aws_appmesh_virtual_service" "virtual_service" { mesh_name = var.mesh_name name = "${var.service_name}.${var.namespace_name}" spec { provider { virtual_router { virtual_router_name = aws_appmesh_virtual_router.virtual_router.name } } } } resource "aws_appmesh_virtual_router" "virtual_router" { mesh_name = var.mesh_name name = "${var.service_name}_router" spec { listener { port_mapping { port = var.container_port protocol = "http" } } } } resource "aws_appmesh_route" "route_to_virtual_nodes" { mesh_name = var.mesh_name name = "${var.service_name}_route" virtual_router_name = aws_appmesh_virtual_router.virtual_router.name spec { http_route { match { prefix = "/" } action { weighted_target { virtual_node = aws_appmesh_virtual_node.node.name weight = 100 } } } } } resource "aws_appmesh_virtual_node" "node" { mesh_name = var.mesh_name name = var.virtual_node_name spec { dynamic "backend" { for_each = var.services_to_connect_with content { virtual_service { virtual_service_name = "${backend.value}.${var.namespace_name}" } } } listener { port_mapping { port = var.container_port protocol = "http" } health_check { healthy_threshold = 2 interval_millis = 15000 protocol = "http" path = var.healthcheck_path timeout_millis = 2000 unhealthy_threshold = 5 } } service_discovery { aws_cloud_map { namespace_name = var.namespace_name service_name = var.ecs_service_discovery_name } } } } variable "mesh_name" {} variable "namespace_name" {} variable "service_name" {} variable "virtual_node_name" {} variable "ecs_service_discovery_name" {} variable "healthcheck_path" {} variable "container_port" {} variable "services_to_connect_with" { type = list(string) } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 resource "aws_appmesh_virtual_service" "virtual_service" { mesh_name = var . mesh_name name = "${var.service_name}.${var.namespace_name}" spec { provider { virtual_router { virtual_router_name = aws_appmesh_virtual_router . virtual_router . name } } } } resource "aws_appmesh_virtual_router" "virtual_router" { mesh_name = var . mesh_name name = "${var.service_name}_router" spec { listener { port_mapping { port = var . container_port protocol = "http" } } } } resource "aws_appmesh_route" "route_to_virtual_nodes" { mesh_name = var . mesh_name name = "${var.service_name}_route" virtual_router_name = aws_appmesh_virtual_router . virtual_router . name spec { http_route { match { prefix = "/" } action { weighted_target { virtual_node = aws_appmesh_virtual_node . node . name weight = 100 } } } } } resource "aws_appmesh_virtual_node" "node" { mesh_name = var . mesh_name name = var . virtual_node_name spec { dynamic "backend" { for_each = var . services_to_connect_with content { virtual_service { virtual_service_name = "${backend.value}.${var.namespace_name}" } } } listener { port_mapping { port = var . container_port protocol = "http" } health_check { healthy_threshold = 2 interval_millis = 15000 protocol = "http" path = var . healthcheck_path timeout_millis = 2000 unhealthy_threshold = 5 } } service_discovery { aws_cloud_map { namespace_name = var . namespace_name service_name = var . ecs_service_discovery _ name } } } } variable "mesh_name" { } variable "namespace_name" { } variable "service_name" { } variable "virtual_node_name" { } variable "ecs_service_discovery_name" { } variable "healthcheck_path" { } variable "container_port" { } variable "services_to_connect_with" { type = list ( string ) }

If you run it on AWS ECS, keep in mind to attach the policy arn:aws:iam::aws:policy/AWSAppMeshEnvoyAccess to the Task Execution Role.

ECS Task definition requires correct proxy configuration:

resource "aws_ecs_task_definition" "task_definition" { [...] network_mode = "awsvpc" proxy_configuration { container_name = "envoy" type = "APPMESH" properties = { AppPorts = var.container_port EgressIgnoredIPs = "169.254.170.2,169.254.169.254" IgnoredUID = "1337" ProxyEgressPort = 15001 ProxyIngressPort = 15000 } } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 resource "aws_ecs_task_definition" "task_definition" { [ . . . ] network_mode = "awsvpc" proxy_configuration { container_name = "envoy" type = "APPMESH" properties = { AppPorts = var . container_port EgressIgnoredIPs = "169.254.170.2,169.254.169.254" IgnoredUID = "1337" ProxyEgressPort = 15001 ProxyIngressPort = 15000 } } }

…and you must add a sidecar in the containers definition:

[ { [...] ###### your application container stuff [...], "dependsOn": [{ "containerName": "envoy", "condition": "HEALTHY" }] }, { "name": "envoy", ###### this varies depending on your region - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/appmesh-getting-started.html "image": "840364872350.dkr.ecr.eu-west-1.amazonaws.com/aws-appmesh-envoy:v1.11.2.0-prod", "essential": true, "memoryReservation": 256, "environment": [ { "name": "APPMESH_VIRTUAL_NODE_NAME", "value": "mesh/${mesh_name}/virtualNode/${virtual_node_name}" } ], "healthCheck": { "command": [ "CMD-SHELL", "curl -s http://localhost:9901/server_info | grep state | grep -q LIVE" ], "startPeriod": 10, "interval": 5, "timeout": 2, "retries": 3 }, "portMappings": [ { "containerPort": 9901, "protocol": "tcp" }, { "containerPort": 15000, "protocol": "tcp" }, { "containerPort": 15001, "protocol": "tcp" } ], "user": "1337", "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "${cloudwatch_logs_group_name}", "awslogs-region": "${cloudwatch_logs_region}", "awslogs-stream-prefix": "envoy${service_name}" } }, "ulimits": [ { "softLimit": 15000, "hardLimit": 15000, "name": "nofile" } ] } ] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 [ { [ . . . ] ###### your application container stuff [ . . . ] , "dependsOn" : [ { "containerName" : "envoy" , "condition" : "HEALTHY" } ] } , { "name" : "envoy" , ###### this varies depending on your region - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/appmesh-getting-started.html "image" : "840364872350.dkr.ecr.eu-west-1.amazonaws.com/aws-appmesh-envoy:v1.11.2.0-prod" , "essential" : true , "memoryReservation" : 256 , "environment" : [ { "name" : "APPMESH_VIRTUAL_NODE_NAME" , "value" : "mesh/${mesh_name}/virtualNode/${virtual_node_name}" } ] , "healthCheck" : { "command" : [ "CMD-SHELL" , "curl -s http://localhost:9901/server_info | grep state | grep -q LIVE" ] , "startPeriod" : 10 , "interval" : 5 , "timeout" : 2 , "retries" : 3 } , "portMappings" : [ { "containerPort" : 9901 , "protocol" : "tcp" } , { "containerPort" : 15000 , "protocol" : "tcp" } , { "containerPort" : 15001 , "protocol" : "tcp" } ] , "user" : "1337" , "logConfiguration" : { "logDriver" : "awslogs" , "options" : { "awslogs-group" : "${cloudwatch_logs_group_name}" , "awslogs-region" : "${cloudwatch_logs_region}" , "awslogs-stream-prefix" : "envoy${service_name}" } } , "ulimits" : [ { "softLimit" : 15000 , "hardLimit" : 15000 , "name" : "nofile" } ] } ]

When your stack is correctly modularized with Terraform, then adding above snippets and tweaking them to your needs should be fairly easy.

Things to consider

As most AWS new products, though, it has its limits (mostly caused by immaturity of the product) which you need to be aware of before you try and implement it.

AWS ENI limits

If your application runs on AWS ECS with EC2, then it is mandatory for tasks to use awsvpc networking mode – but then each task will consume one network interface, and, as you may know, these are fairly limited. Soon after releasing App Mesh, Amazon has increased limits to fairly decent amount but only for a, p, c, r, and m families. This may be problematic if your cluster’s instances are not on the list (yes, burstable instances like t2 and t3 are not there and unfortunately, due to some internal architecture decisions AWS has no plans to increase the limits for them). You can run only two ECS tasks on t3.small instance. You might have to add more instances to the cluster and pack your tasks more loosely.

The problem vanishes on Fargate (but as you may know, it’s a pretty pricey service) and impacts EKS only to a lesser extent (using Amazon VPC CNI plugin limits you to (IPv4 Addresses per Interface * Maximum Network Interfaces – 1) as stated here – running at most 11 containers on t3.small sounds enough to me).

Ingress traffic

Another problem arises when you want to handle traffic from the outside world. You can connect your Virtual App Mesh components directly neither to ALB nor API Gateway. What it means is, if you want to perform true canary deployments, route ingress traffic from the web smartly and utilize 100% of your mesh capabilities, you have to direct your traffic first to a gateway of some kind (HAProxy / nginx / zuul / kong / whatever). As you might have noticed, both AWS hello-world-app-mesh and my example have a gateway – it is often overlooked at first, but this is how your infrastructure should look as well.

Fortunately, it seems like there are plans for ingress in AWS API Gateway, but you never know if they will implement it next week or next decade. ALB is dead silent though, which is a shame, as ALB is a pretty basic AWS block.

As of July 2020 this is no longer true and App Mesh Ingress is finally here!

Egress traffic

Well, if your mesh is set to DROP_ALL then it does exactly this – every service can connect only to virtual services specified as its backends. All “unspecified” http routes will fail with 404 code from Envoy. This is troublesome, as our applications often connect to external APIs, databases and so on. You can drop the DROP_ALL and open the mesh to the outside world or define a Virtual Service implemented by Virtual Node with dns service discovery mode and hardcode the url there though (but this is pretty hacky).

This is a pretty young AWS product

AWS tends to ship its toys rapidly and many people are scared of using new AWS products, as they might be unstable in the beginning. We haven’t noticed any issues so far, but you need to be aware of the overall novelty of this technology. You also need to keep an eye on which functionalities of service meshes you actually need in your stack. Not all of those mentioned before are included in AWS App Mesh. Its features set may feel lackluster for people coming from Consul or Istio. Amazon adds new stuff every month though, as seen on the roadmap.

Which mesh should I choose?

AWS App Mesh is just one of the various implementations of service meshes, all requiring different skillsets and using different technology stacks.

If your company is Kubernetes-heavy, then Linkerd is the way to go. This is probably the most popular mesh, having all the required capabilities you need with helm charts available.

If your company is HashiCorp-heavy, though, then you could have a great time utilizing Consul, which is also an all-in-one solution. It has great integration with other HashiCorp tools like Nomad or Vault and has first-class support for Envoy proxy. If your DevOps team uses HashiCorp widely then they can easily adapt Consul.

You may also opt to use Istio with Envoy, which arguably has even more functionalities than Linkerd or Consul.

All three (well, four with Envoy) mentioned technologies are actively developed and widely used. All of them are pretty much feature complete and you can’t choose wrong. However, when you already ship to Amazon Web Services and your app runs on AWS ECS or AWS EKS then you should definitely consider AWS App Mesh instead.

As you may have noticed, AWS App Mesh lacks some features for now. Its strength, however, comes from the ease of implementation and low upkeep cost.

A quick victory with little downsides? I have seen that before!

Of course, with great power comes great responsibility – if we mindlessly replace Hystrix with Istio, Consul or App Mesh we may end up in a worse place than before. There are various things to consider before diving deep into the ocean of service meshes.

With a service mesh in place, a single misconfiguration may render your applications useless (thankfully, reverting it usually propagates almost instantly). Worst of all, people may be not aware that the 404 or 503 status code they receive is not coming from the called application but from the service mesh. That can and will waste precious time of your engineering team in the beginning.

This is yet another layer of abstraction. Whenever you add a new layer, you should ask yourself if it is really worth the effort. Too much abstraction may be unhealthy, too.

Depending on the implementation of service mesh you might have to run and maintain additional containers and ensure that they are highly available at all times. This is not really a challenge when you can add your server dynamically in the cloud environments, but still require some attention from the DevOps team.

Your billing team might also pay attention – cloud costs might increase. Even though the proxies are usually very lightweight, they still require a couple hundred of RAM memory. When you run hundreds of them (each container with your app has its own proxy) it adds up to a significant value.

Do I really need to bother?

Yes. No. Well, maybe. This is certainly a sweet piece of technology and I guess there is a reason why many application-level technologies that did similar things are now in maintenance-only mode. People are moving on from a solution that added too much complexity inside the applications to one that can be tamed way easier outside of them, while also being agnostic from programming languages.

However, if your infrastructure is rather small and consists only of several microservices then maybe it is too early to implement it. If you haven’t encountered any problems with service-to-service communication then maybe you don’t need it at all. If you are in the monolithic (beautiful!) world, then you most likely won’t benefit that much from an additional abstraction burden. But if you constantly yelled “Damn, that is exactly the problem in our stack!” in the process of reading this monologue (and you use AWS widely), then definitely consider App Mesh to be your light introduction course into this idea. Its set of features can be enough for many – and not having to manage the control plane is indeed a killer feature.