The command in the popup, when copied and pasted into the command line, would create a new Kubernetes pod that reruns the job. The catch is that only those with access to Prow’s underlying Kubernetes cluster could run the command — and that’s not many people. Standard protocol became Slack DMing admins when you wanted a job rerun, which was annoying for everyone involved.

My Project

This is where my project came in! My task was to make a rerun button that, when clicked, directly triggered the job. Seems easy, right?

Well, I couldn’t just let anyone rerun jobs willy nilly. A malicious actor could rerun tons of jobs and rack up a huge cloud bill or even DOS the system. An unknowledgeable user could rerun an old deployment job and push a stale image to production. This meant I needed to design a way to identify the user and decide whether to let them rerun the job in question.

Authenticating the User

Other parts of Prow already used GitHub OAuth, which made it a natural choice for my authentication system.

When an unauthenticated user attempts to rerun a job, we redirect them to GitHub’s login flow. After they successfully log in, GitHub redirects them back to Prow’s homepage and returns an access token to the server, which we store securely as a cookie.

This access token allows us to make requests to GitHub on behalf of the user, namely a request for their username. Storing the token rather than the username itself makes this system robust; in order to impersonate someone, an adversary would have to trick our server into storing that person’s access token, which itself is difficult to ascertain.

The GitHub username provides a unique identifier for the user, which is exactly what we needed.

Setting up Permissions

Now that we could determine who wanted to rerun a job, we needed a way to decide whether we wanted to let them.

This meant implementing a permissions system that was easy to configure but also provided the desired specificity, which raised the question: what was the desired specificity?

As someone brand new to Prow who had never needed to rerun a job, I wasn’t yet qualified to answer this. I needed to talk to seasoned developers, and fortunately since Prow is open source, there were many!

After discussing with my team, drafting a design doc, publicizing it via slack for community review, coming up with a couple final proposals, and presenting them in a community meeting, I decided on the following configuration system:

Global whitelist of users/groups who can rerun all jobs

jobs Per-job lists of users/groups who can rerun that specific job

job Optionally allow everyone to rerun a specific job, or all jobs

Kubernetes Prow has an on call rotation of admins who solve a wide range of issues across the system. To do this effectively, they need to be able to rerun all jobs, necessitating the global whitelist.

Kubernetes has a number of different subprojects, whose developers should be able to rerun jobs relevant to them, but not all jobs. This motivated the per-job lists.

Finally, there are some private instances of Prow that only trusted users can access. In these, anyone who can reach the rerun button should be able to use it. If the allow everyone option is enabled globally, we bypass GitHub OAuth entirely, making setup simple. Allowing everyone is also an option for individual jobs, for public instances with a few particularly harmless jobs.

GitHub Teams

In both the global whitelist and job-specific lists, I wanted to make setting up the configuration as easy as possible. Specifically, typing out usernames over and over again gets tedious, and I wanted a way to group users together. GitHub teams let me achieve this.

After creating a team of users on GitHub, you could list that team as being allowed to rerun a job instead of listing the individual members. This was especially convenient for Kubernetes, where we already had teams.

This part of the project was especially fun because I got to play around with the GitHub API (which had quite a few outages) and find an appropriate balance between mocking for simplicity and using actual calls in tests.

Config-Free Logic

Where possible, I tried not to require configuration at all. A good rule of thumb was that if someone could rerun a job by other means, they should be able to rerun a job via my button. Trusted users could already rerun presubmit test jobs from GitHub via our handy trigger plugin. I replicated that logic for my button.

CSRF Protection

Now that we had a robust authentication system and an effective permissions system, there was one last thing I needed to deal with. To avoid making a user log into GitHub every single time they tried to rerun a job, I stored their login session via a cookie as is standard. Without any additional protection, this left us vulnerable to cross-site request forgery attacks.

In these attacks, a malicious actor convinces an already authenticated user to perform an action on their behalf. In this case, someone evil could trick someone with rerun access into submitting a form that makes a POST request to our rerun endpoint. To us, it would just look like the good user is trying to rerun a job, and we would allow it — not good.

We protect against such attacks by ensuring that all requests to our endpoints originate from our site. I chose to use the gorilla csrf library for this. Gorilla csrf requires an unchanging stored secret, to which it adds a level of encryption to create per-request tokens. By encrypting per request, we manage to maintain only one stored secret and also shield this secret from exposure. When an endpoint is hit by a POST request, the server expects a valid token. This token can only be obtained through our server, so outside requests fail.

In addressing this vulnerability in my project, I discovered similar vulnerabilities in other parts of Prow. The protection I added secures POST requests across our entire application.

Final Result

New Rerun Flow

I kept the UI almost exactly the same and just added a rerun button in the pre-existing popup. This preserved the old copy-paste command to avoid disrupting the old workflow.