When developing a new service, you will want to decide on the following few things:

SLI : service level indicator: binary value — is the criterium for a specific service met or not?

: service level indicator: binary value — is the criterium for a specific service met or not? SLO : service level objective: this is your target for the SLI over time — what portion of the time do you want to meet the SLI?

: service level objective: this is your target for the SLI over time — what portion of the time do you want to meet the SLI? SLA: service level agreement: business agreement on top of the previously established metrics

Each of the different concepts, involves different people from the organisation:

SLIs involve software engineers, site reliability engineers and product managers

SLOs involve site reliability engineers and product managers

SLAs involve sales and the customers

Naturally, things go wrong from time to time. This is where the concept of an ‘error budget’ comes in: how many failures can we still accept within the SLO for our service. This is dependant on the risk you can accept which in turn will determine your SLO (how many nines do we want to offer depends on how critical our application is, how time-sensitive the delivery is, …). Once you go over your error budget, the development effort needs to focus from delivering new features to improving reliability and availability, until your error budget is replenished.

The end of the session went a bit into ‘toil’: when you should automate processes that are manual, repetitive, devoid of long-term value and highly automatable and when is good and bad. In general, you want to automate as much as possible, but toil can also have advantages and when thinking of automating something, you should look at the ROI (e.g. automate a job that needs to be done every year and takes 15 minutes and would take 20 hours to automate is not a good ROI. In this case you’d want to document the job and share the knowledge).

Find out more information about SRE: google.com/sre or read the free books

BigQuery GIS — A GeoVisual Exploration

This session was held in the DevZone and covered the datatypes and query capabilities in BigQuery related to geospatial data.

BigQuery GIS demo

Using latitude and longitude, you can describe a point on earth, but for describing more complex shapes like a polygon, BigQuery has a `GEOGRAPHY` datatype, supporting GeoJSON, WKT (well-known text) and WKB (well-known binary).

Combining this with the power of BigQuery — being able to cope with huge amounts of data in a massively parallel way — enables you to join and analyse data from a geospatial point of view, e.g. ask how much datapoints you have within the range of another data point using native primitives in the WHERE clause of your SQL query.

BigQuery Geo Vis enables you to quickly visualise the results of your query on a map, to make the results more intuitive and business decisions in an easier way.

A quick walkthrough can be found here.

Securing serverless by breaking in

This session was held in #DevZone as well.

Securing serverless

The sessions started by the introduction of serverless by framing it against other architectures in the cloud: monolithic (cloud handles hardware), containers (cloud handles VM), serverless (cloud handles container …).

Even when deploying just a tiny app of 200 lines to a serverless runtime, you have to realise your app is probably a lot bigger in terms of lines of code. And this begs the question: are these lines actually secure as well?

Key takeaways were:

Check vulnerabilities within your dependencies. Your code might be fine, but someone else’s might not be!

Deploy granular functions and permissions.

Don’t rely on function ordering: you need to secure every function separately and not only the ones that get exposed.

Worry about all functions.

Don’t rely on immutability: assume servers can be reused.

Full slides here.

Meet the Authors — Go language

In this session (again in the DevZone — yeah, I hung out there a lot), the panel consisted of the behind Go. The session was mostly structured as a Q&A with both prepared questions as well as questions from the audience, which made it hard to capture a lot of information and unfortunately, I haven’t found the recording.

Meet the Authors — Go language

What I was able to capture:

Why use Go? Easier to manage, easy to learn, very performant, good to use on cloud (Kubernetes is written in it). To sum it up: fast & fun.

What is new? Warming up for moving to Go 2! Listening to community input and contributions, both additions to and removals from the language, improvements to dependency management, checking and responding to errors, deciding on the inclusion of generics or not, and fewer things in the standard library.

What was the motivation for making Go? Do better than Java and C++, make a compact, performant language in which concurrency is easy to do.

Biggest challenge in Go? Saying no (against new features)

For a small intro to Go, check this article by Hackernoon on Medium.

How Twitter replicates Petabytes of Data to Google Cloud Storage

In this session, Lohit, Senior Staff Software Engineer at Twitter, went into the architecture of the Data Infrastructure for Analytics at Twitter. This infrastructure is mostly based on Hadoop clusters and record over 1,5 trillion events every day. Several features were introduced, among things the FileSystem abstraction, the Data Access Layer (DAL, containing metadata) and a front-end for exploring data sets (Eagle Eye).

GCP at Twitter

On top of ViewFS, Twitter build a replication service where the destination is responsible for replicating and syncing with the source. They decided to extend this service with replication to Google Cloud Storage in order to leverage Google Cloud’s data processing capabilities like BigQuery. In process they relied heavily on the usage of the Google Storage Connector. The total move involved over 300PB’s of storage

More info on the move on this page.

Chaos: Breaking your systems to make the unbreakable

This session was about chaos and what chaos is about: systems are in a constant state of failure (it is not binary). The best way to avoid failure is to fail constantly. Failure is there to learn from.

So how do you practice chaos (without being a jerk)? You need to establish rules to keep it fun and educative.

1. Keep it short: 90 minutes should be enough.

Spend 30 minutes on planning:

Schedule it (when are you going to do it?)

Pick tests (what are you going to break?)

Write down what you expect to happen (what should happen?)

What will you do when things go wrong (what is the fallback plan?)

Share the document with the engineering organisation

50 minutes are allocated to playing (fun part — break things and see what happens):

Start in staging, run it in production (off-peak), later run it in production (primetime).

Announce that you will start in group chat.

Maintain discussion in group chat.

Monitor for outages.

Run your tests and take notes.

Add on 10 minutes for reporting:

Create tickets to track issues that need work

Write a summary & key lessons

E-mail to engineering

CELEBRATE!

2. Have a small team (usually 2 people)

Subject matter expert, the person that built the service

An SRE: who is really expert on keeping things up and running

(Optional: junior engineer or developer for mentoring and a fresh view on the systems)

3. What are the levels you want to play at — there are different options:

Level 0: Terminate service. Block access to 1 dependency

Level 1: Block all dependencies

evel 2: Terminate host

evel 3: Degrade environment (e.g. network slow, dropping packets, malformed information)

Level 4: Spike traffic (DDOS yourself)

Level 5: Terminate region/cloud: failover to other cloud or on-prem

The session ended with a demo of a chaos experiment in which the above concepts were applied — check it out here.

Advances in Stream Analytics

I covered this session quite extensively on Twitter — lots of exciting announcements and learnings from the Google Cloud Dataflow team and learnings from an Apache Beam deployment at Lyft:

Towards Zero Trust at GitLab.com

This session dealt with a topic that has been grabbing my attention later and I was keen to learn more on: how modern companies do security.

Traditional companies have a hard on the outside, soft on the inside approach. As an industry, we know this does not work, but this is still how most businesses are set-up.

Zero Trust: all devices and users that are trying to access an endpoint need to be authorized and authenticated to do so. All the decisions involved in this process are dynamic and risk-based.

It is not a product — it is a process, it is not new, and not often implemented before a major breach (only ~20% of the cloud-native companies have only implemented or started implementing zero trust).

What are the benefits:

Lateral movement is much harder (services are separate perimeters)

Stolen credentials are less valuable

Known vulnerabilities that are easy to exploit will be rarer

Non-targeted attacks have less value (resulting in higher cost for the attacker)

Before Gitlab.com embarked on their zero trust journey, they had a few things already in place: data classification policy, GCP security guidelines (enforced by Forseti), internal acceptable use policy in order not to rely on good intentions, and an HR system to know who sits in your organisation in order to give the appropriate access.

They then went into the 3 problems they solved on their journey to zero trust:

1. Managing User Identity and Access: answering a series of questions:

How do you verify endpoint integrity?

Is the person accessing data appropriate to the role?

How do you streamline onboarding/offboarding?

How do we minimize cred theft?

How are we enforcing our data classification policy?)

2. Securing our applications:

Shift security to the left in the pipeline and merge requests by educating developers and scanning every commit.

Applying Binary Authorization by only deploying trusted container images, removing the human from the deploy process, sign and annotate images during the CI phase.

Key management service

User and entity behaviour analytics

3. Securing our Infrastructure:

Vulnerability management: deploying patches in a timely manner

Who owns what asset: answered by the asset database

How to migitate abusive activities?

How to make it harder for the attacker to move laterally?

Apply Google’s best security best practices for GitLab.com

Enforcing policies in order to avoid having to rely on best intentions

How was the journey organised? Bucketise the different parts:

GitLab.com: infrastructure that handles customer data (centrally-managed)

Endpoints: user and employee laptops (individually-managed)

Backend infrastructure: 3rd party applications

Implementation across these buckets was done in parallel.

Lessons learned? It is an ongoing implementation, and ordering matters (some implementations will facilitate others), UX is important: people need to be able to get their work done, automation is key to scale, zero trust is personal to your company and your requirements.

Slides can be found here and I have also enjoyed reading through the papers Google has put on their website here.

Extracurricular activities

Of course, the Summit was a great opportunity to meet, learn from and hang out with people from all over the planet. A couple of highlights best captured in pictures:

GDE karaoke at the Community dinner

Beam Summit organisation in the after hours

Hanging out with (old and new) open-source friends!

Hanging out with Gwen Stefani

Visiting the Google San Francisco offices and making more Apache Beam friends

Beam meetup in the Community Corner at DevZone

Wrapping up

In case you are interested in learning more about other sessions and Google Cloud Platform in general, these are good resources to check out: