03. How to Find and Prevent Secrets on GitHub

Depending on the scenario, it’s possible to receive near real-time notifications that secrets have been published to a repository or, even better, prevent the secrets from being published in the first case.

You Want to Study Historical GitHub Data

Before discussing how to find and prevent new secrets from being committed, it helps to know where to find historical GitHub data. This data is useful for studying issues like secret leakage across all of GitHub over time.

There are a few academic services that gather this data and make it available. Some, like GHTorrent and GH Archive make data available as snapshots for offline processing. There are also datasets available on search platforms such as Google BigQuery (from GH Archive, GHTorrent, and even GitHub itself) that allow queries to be executed against historical data.

While historical data is incredibly useful for measuring the scale of the problem or identifying trends over time, most organizations will be more interested in how to monitor for or prevent new secrets from being committed in the future. There are a few methods to accomplish this, depending on the scenario.

You Have Full Control Over the Development Environment

Git has the ability to create hooks, which allow you to run a script at various points during the Git workflow that determines whether the workflow should continue. If you have control over the development environment used by committers, you can create a pre-commit hook that checks for sensitive information before allowing the commit to occur. This is the preferred way to prevent secrets from ever being committed to a repository.

While you can use any secret scanning tool such as truffleHog or gitleaks in a pre-commit hook, other tools such as detect-secrets from Yelp or git-secrets from Amazon Web Services make this step even easier by handling the installation for you.

Hooks are local to the development machine, and are not committed to the repository. You can emulate similar behavior by committing hooks to a separate folder in the repository and including a script to copy or link the hooks into .git , but this is not an automatic process. Tools like pre-commit make this process a bit easier.

If you’re using GitHub Enterprise, you have the option of using pre-receive hooks to run secret scanning tools before accepting the commit into the remote repository. Gitlab offers similar hooks, including a predefined blacklist designed to catch secrets being committed without the need for a separate scanning tool. Even though these checks happen server-side, a commit that fails pre-receive hooks will not be reflected in Git commit history.

You Have Control of the Repository or GitHub Organization

GitHub supports webhooks which can be triggered for various events in a repository or organization. The push event will tell you when new commits are pushed to a repository, and you can use these to trigger secret scanning tools.

Receiving webhooks will not prevent secrets from being committed to the upstream repository (as opposed to pre-commit or pre-receive hooks), but you will be notified immediately when commits are made and can use this to rapidly respond to found secrets. This could mean automatic revocation, rollback of repository state, or alerting based on the results of the scan.

It should be noted that removing secrets from a Git repository after they’ve been committed is a painful process. You must assume that any secret that does get committed is compromised, and should be invalidated if at all possible, regardless of the time it takes to scrub the secret from the repository. GitHub’s article on removing sensitive data from a repository demonstrates the process required to scrub sensitive information from GitHub, including all the ways it can go wrong. Git is very good at its job of keeping track of all historic repository states, so it is correspondingly difficult to “change history” by removing commit contents.

You’re a Service Provider Wanting to Keep Customers Safe

It’s been documented that service providers such as Microsoft monitor GitHub for API keys accidentally committed. GitHub recently made this same capability accessible to other service providers through their Token Scanning service.

In this scenario, you as a service provider provide GitHub regular expressions that match your service’s tokens and they alert you via webhook when tokens are found. This frees you from the need to monitor specific repositories as GitHub will monitor all commits for your service’s tokens. By leveraging this product, you can monitor for your own tokens being uploaded and revoke them before they can be abused.

You Control Neither the Repository nor the Developer Machines

All of the previous methods of secret detection and prevention require some level of control over the repository or organization. However, if you cannot use pre-commit hooks or receive webhooks, you still have options. GitHub exposes two APIs, the Events and Search APIs, which provide near-real-time information about commits, and can be scoped on a repository, organization, or user basis.

While the Events and Search APIs fill an important gap when it comes to secret monitoring, there are downsides to consider with both approaches. The Search API provides search results for all public repositories and allows results to be sorted by the time they were indexed, making it possible to identify new secrets a short time after they were committed. However, as we detail in the Limitations of Existing Methods section below, the Search API is limited in that it can only search for hardcoded values, as opposed to regular expressions or more advanced searches. If the secrets you’re searching for include a hardcoded value (e.g. a prefix), then tools like GitGot are available to handle the searching for you.

The Events API provides a firehose of all public events at /events . This is useful if you want to watch for certain events happening across all of GitHub, though as we detail in the What Didn’t Work section, this approach isn’t feasible due to request rate limits. Instead, watching the events for a given repository, organization, or user solves our needs for secret detection

Once we have the events, we need a way to execute our secret detection tools with the correct context. We recognized a gap in existing tooling here and created secret-bridge as a result.