CONFERENCE SUMMARY

KubeCon San Diego Summary: Top Ten Takeaways (Part 2)

Day two operations, new architecture paradigms, and end users

In this second part of my KubeCon NA 2019 takeaways article (part 1 here), I’ll be focusing more on the takeaways in relation to the “day two” operational aspects of cloud native tech, new architecture paradigms, and end user perspective of CNCF technologies.

As a reminder, here is the full top 10 list of key takeaways from KubeCon San Diego 2019:

Kubernetes needs a Ruby on Rails moment Vendors have moved higher up the stack (a.k.a. build versus buy analysis is valuable) The edge(s) are increasingly important The cluster as a unit of deployment: cattle vs pets moving up the stack Workflows may be diverging? GitOps vs UI-driven config Reducing SRE toil is important, and directly impacts the bottom line, productivity, and fun Multi-cloud is big business “Deep systems” (microservices) create new problems in understandability, observability, and debuggability Cloud platforms and tooling are embracing event-driven computing The CNCF community is amazing, but the tech is challenging to navigate as a beginner

You can read my summary of takeaways 1–5 here, and I’ll cover takeaways 6–10 in this post. In the future, I may publish a further commentary on all of the takeaways combined.

6. Reducing SRE toil is important, and directly impacts the bottom line, productivity, and fun.

The role of “site reliability engineer” (SRE) has been around for a number of years, and was arguably popularised by the original Google SRE book and the accompanying workbook (and also the great follow-on Seeking SRE book). However, this is the first KubeCon where I talked to such a large number of people self-identifying as SREs at the Datawire booth, and also where I have seen SREs being marketed to en masse in the sponsor hall.

In fact, I think the concept of reducing SRE toil was the most popular sentiment I saw in the sponsor hall. I appreciate that there is some irony with me saying this, as at our booth we had the phrase “Reduce SRE toil and release faster” as a headline. It could also be easy to scoff at this as a vendor bias, but I believe this trend makes sense from an end user perspective too, as not only are the cloud native technologies maturing, but so are the operational responsibilities and specialisms (like SRE) associated with them. The industry is beginning to better understand the responsibilities and challenges associated with SRE.

This better understanding of the SRE role, combined with the arrival of the early majority and late majority adopters into the cloud native space, is encouraging everyone to think much more about the “day two” operational aspects of running platforms based on Kubernetes, such as backups, observability, and config change workflows. If left unchecked (or not considered properly), it is easy for much of this day two work to become toil — which impacts both the fun of working with this tech and the productivity of the operations/SRE team — which is why I think the focus of this year’s event has been on reducing toil. I think this is a good thing, and it’s also the sign of a maturing ecosystem.

On a related note, after chatting with several of the booth vendors and also many of my fellow attendees, I realised that there is some confusion (and conflation) around the roles of operations, platform engineers, and SREs. However, the common theme was that there is most definitely a need for specialists that work on the operational Kubernetes front lines who are responsible for keeping the platforms running. I heard several people praise the Certified Kubernetes Administrator exam for guiding their learning.

Another key takeaway in this space worth mentioning is the continued focused on providing “guardrails” and policy as code, with the Open Policy Agent (OPA) community at the vanguard. I mentioned that “policy as code is moving up the stack” in my KubeCon EU 2019 takeaways article. The subtle change in narrative I’ve seen over the past few months is that OPA is not about restricting what engineers can do per se, it’s about limiting the risks of the scariest possibilities e.g. accidentally opening all ports, or exposing databases to the Internet.

There were a lot of great OPA talks at this KubeCon, including “Applying Policy Throughout The Application Lifecycle with Open Policy Agent” by Gareth Rushgrove, an “OPA Deep Dive” by Tim Hinrichs and Torin Sandall, and a lightning talk “CRDs All the Way Down — Using OPA for Complex CRD Validation and Defaulting” by Puja Abbassi. Even the excellent (and very popular) talk “Kubernetes at Reddit: Tales from Production” by Greg Taylor referenced OPA, and Chris Hoge perfectly captured the sentiments of this in his tweet:

7. Multi-cloud is big business

Looking at the list of this year’s KubeCon diamond and platinum sponsors, it was easy to see that the large public cloud vendors are heavily invested in the CNCF ecosystem. This has been the case for a couple of years now, but what was interesting at this KubeCon was the focus of these organisations on multi-cloud and hybrid cloud. Fellow diamond sponsors VMware and Red Hat have been banging on the drum of hybrid cloud for quite some time (alongside folks like HashiCorp, Rackspace, and Upbound), but it was noticeable this time that many of the cloud vendors were all claiming they are the best vendor for multi-cloud. Companies appeared keen to sponsor anything related to this topic, and there was even a KubeCon Day Zero “multicloudcon” event run by GitLab and Upbound:

The conversations I had in the hallway track and at the Datawire booth led me to conclude that many small-medium enterprises and most large enterprises are using more than one cloud, but that they aren’t looking for full workload portability. Most of the related conversations consisted of folks explaining how they run their website workloads in one cloud and their data pipelines or machine learning jobs in another. They liked the idea of having disaster recovery/business continuity (DR/BC) plans that allowed for the failover of one cloud’s workloads to another, but admitted in reality this wasn’t really viable at the moment, primarily from a cost of multiples in learning, implementation, and maintenance.

With the exception of AWS and it’s Outposts offering (although this is all subject to change at the AWS re:invent conference this week), both Google, with Anthos, and Azure, with Arc, appear to be betting on Kubernetes becoming the de facto multi-cloud deployment substrate.

Although the cloud vendors were largely talking about extending the private data center into the cloud (and vice versa) via the compute abstraction — e.g. managing VMs, containers, and k8s via a common cloud control plane — the Datawire team and I have been talking about the option of instead using a networking abstraction with common communication control planes. For example, the typical approach to application modernisation is often to package all of the existing applications in containers, then installing Kubernetes, and finally deploy all of the containerised applications in a big bang switch over to this new platform. As an alternative, you can install a Kubernetes cluster and deploy a cloud native API gateway (and potentially a service mesh) here, and then route from this new target cloud environment to the existing applications, which may be running in VMs or on bare metal.

Once the new routing and transport layer has been implemented, this solution effectively provides location transparency for the old applications, and (if required) they can be incrementally containerised and migrated to Kubernetes with the API gateway and service mesh handling the underlying IP address and port changes. If you want more information on this idea, I share my thoughts on this in a talk at the O’Reilly Software Architecture Conference in Berlin a few weeks back: “API Gateways and Service Meshes: Opening the Door to Application Modernisation”

One final topic within this takeaway that I wanted to share is that there appears to be a general consensus in this space that Kubernetes custom resource config and Custom Resource Definitions (CRD) are the best way to handle configuration. If all parts of the stack are defined via custom resources — application deployment config, runtime traffic management, database deployments etc — then a single workflow and deployment pipeline can be used to deliver this into production.

8. “Deep systems” (microservices) create new problems in understandability, observability, and debuggability

I’ve been hearing some interesting buzz about “deep systems” for the past few months, primarily from Ben Sigelman and the Lightstep team. The key concept is that as a microservice-based system grows, it does so primarily by the addition of more services. Some of these services are exposed at the edge, but some are simply called by other services. This increases the end user’s request handling service call chain size, and as these systems grow “deeper”, it becomes much more challenging to understand, observe, and debug them. In the keynote, “(Open)Telemetry Makes Observability Simple”, by Sarah Novotny and Liz Fong-Jones, they both explored this concept in more depth (including a very brave live coding demo in front of 12k people!), and talked about how the CNCF-hosted OpenTelemetry specification can help provide a common abstraction for observability with cloud native systems.

My Datawire colleague, Alex Gervais, and I chatted about this topic in the hallway chat, and he noted that observability solutions and practices within a cloud-native microservice-based system are in reality still quite immature. He attended several end user sessions and came to the conclusion that observability tooling is currently quite intrusive for developers, and also hard to support and scale for operators. For example, distributed tracing offers many benefits in relation to observing a system, but the trade off is that the tracing context has to be propagated (and coded accordingly) within each application along the request call chain, and any “baggage” (tracing and logging metadata) has to be collected reliably, and at scale, out of band by the underlying platform.

In regard to debugging, the Datawire team presented a number of sessions to help with this. Flynn presented “Building a Dev/Test Loop for a Kubernetes Edge Gateway with Envoy Proxy” and shared his learning building the Ambassador API gateway and establishing the correct balance between unit, integration, and end-to-end tests. Rafi Schloming and I presented Introduction to Telepresence, and outlined how this CNCF-hosted Kubernetes two-way proxy tool can be used to improve the build/debug inner development loop. Shout out to Nick Santos for the tweet from my session:

Due to the heavy rain causing electrical issues in some of the rooms in the conference venue, my colleague Abhay Saxena was promoted to the keynote stage for his talk “Use Your Favorite Developer Tools in Kubernetes With Telepresence”. Abhay walked through the debugging process with Telepresence with several applications deployed into Kubernetes, one written in Java, one in Node, and one in Python. The live demonstrations proved to be very popular, and Abhay also provided an early access preview to the new “edgectl” tool that supports multi-user service debugging with Telepresence, which will be released as part of the Ambassador Edge Stack. In the Q&A section of both my and Abhay’s talks, several folks commented that their minds had been blown with the new possibilities of debugging with Telepresence. Obviously I’m biased, but this could be a good cue to check out the tool if you haven’t before!

Also worth mentioning in the debugging space, was the release of ephemeral containers in Kubernetes 1.16, and Joe Elliot presented a fantastic session on this, “Debugging Live Applications the Kubernetes Way: From a Sidecar”.

Finally in relation to this takeaway, there was a lot of great discussion around understandability at this event. For me, the most interesting open source tool in this space was Octant, from VMware. Octant is a tool for developers to facilitate the understanding of how applications run on a Kubernetes cluster, and it provides a combination of introspective tooling, cluster navigation, and object management, along with a plugin system to further extend its capabilities.

I chatted to Bryan Liles about this in a recent InfoQ podcast, and he also gave the tool a shout out in the morning keynote:

9. Cloud platforms and tooling are embracing event-driven computing

Event-driven architecture (EDA) has been around for a long time, and the rise of new paradigms like function as a service (FaaS) and new distributed logging/messaging technologies like Apache Kafka have brought this approach back into the limelight in the cloud native space. This KubeCon definitely saw more attention being paid to event-driven and message-based systems, from the supporting infrastructure right through to the event format. For example, the CNCF graduation of the open source messaging system, NATS, was discussed in an opening day keynote.

CloudEvents, the CNCF-hosted specification for describing event data in a common way, was announced as a 1.0 release, and was discussed in detail an extended introductory/deep dive session. There were also sessions showing CloudEvents integrated with Knative, and the Microsoft team announced first-class support for the spec with Azure Event Grid, alongside the existing support offered within Red Hat’s Event Flow and SAP’s Kyma platforms. Amazon were also talking about their EventBridge offering (which uses a proprietary event format), and TriggerMesh were demonstrating their EveryBridge offering, which is being pitched as part of their SaaS-based multi-cloud serverless management platform.

Microsoft also announced the 1.0 version of the Kubernetes-based event-driven autoscaling (KEDA) component, an open-source project that can run in a Kubernetes cluster to provide fine-grained autoscaling for every pod and application. KEDA also serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.

On a related note, in September the Linux Foundation announced that Alibaba, Lightbend, Netifi and Pivotal had joined forces to establish the Reactive Foundation, a new, neutral open source foundation to accelerate the availability of reactive programming specifications and software. I’m not sure if that means we’ll see more or less of reactive systems at future KubeCons.

10. The CNCF community is fantastic, but the tech is challenging to navigate as a beginner

As with all of the previous KubeCons, I was impressed with the community spirit. The hallway track has always been great here, but I also felt that this year more effort was made in this regard with the keynotes. Sure, there were a couple of vendor pitches, and perhaps an overly long keynote or two, but being able to recruit sponsors is vital for the success of the event and I know from personal experience of running conferences that getting the sponsor/editorial balance correct is very difficult. The standout keynotes for me were: Kelsey Hightower’s poignant reminder about the need for diversity, inclusion, empathy within communities; Bryan Liles’ exploration of Kubernetes needing it’s Rails moment, and Ian Coldwater’s focus on thinking like an attacker when working with Kubernetes.

As I mentioned in part 1 of my KubeCon NA 2019 takeaways, there were a lot more folks that were new to the space attending the event this year, and although the collection of “101” style talks were no doubt useful, I did wonder if a dedicated introductory/newbie track would have been beneficial (and I saw several other folks like Liz Rice asking the same question via Twitter).

For many of us that have been involved in the cloud native space since day one, all of the technology and terminology implicitly makes sense to us now, and it can be difficult to see just how complicated all of this is for someone new to the space. However, you only need to cast your mind back to when you began this journey and had questions related to topics like: Kubernetes Ingress vs load balancers, annotations vs CRDs, resource requests and limits, persistent claims and volumes, debugging loops when using containers etc. There are so many ways that folks can get involved with the CNCF community. I invite you to help new folks out as best you can.

Wrapping Up Part 2

This second part of this KubeCon NA 2019 summary has focused on the “day two” operational aspects of cloud native tech, new architecture paradigms, and the end user perspective of CNCF technologies.

There was clearly a lot more attention paid by the program committee, attendees, and vendors to the “day two” issues of operating Kubernetes at this event, which for me is a sign of a maturing understanding of the development lifecycle and market. The topic of event-driven architecture was also much more of a focus, both from a software development and infrastructure perspective, which I believe is another sign of a maturing ecosystem; the true promise of cloud native computing can’t be realised without the supporting software architectures, and microservices alone don’t provide the elasticity and scalability required. And finally, more effort is going to be needed to help the influx of folks that are new to the cloud native space. Joining this ecosystem can be daunting, and I believe that it’s the responsibility of folks that have been in this space for some time to step up and welcome new engineers and the rest of the community.

If you missed part 1 of my KubeCon Summary, check it out here. You can learn more about the Ambassador API gateway at www.getambassador.io and you can sign up to get notified of the upcoming release of the Ambassador Edge Stack.

If you have any questions, please join the team in the Datawire OSS Slack.