For organizations using cloud architectures, security is essential. However, that doesn’t mean finding a solution is easy. One reason is that the complexity of security systems reflects the complexity of an organization as a whole. For example, when working with a centralized IT department, security regulations are likely dictated by that department, and may not mesh smoothly with autonomous working teams.

In this article, we’ll look at the strategy we implemented to find a balance between these autonomous working teams and a thorough cloud security setup, without centralized IT.

What we do at Otto Group data.works

At Otto Group data.works we maintain one of the largest retail user data pools in the German-speaking area. Additionally, our product teams develop machine learning-based recommendation and personalization services. Otto Group’s many online retailers — such as otto.de and baur.de — integrate our services into their shops to enhance the customer experience.

Our interdisciplinary product teams — consisting of data engineers, software engineers, and data scientists — work autonomously and are supported by two centralized infrastructure and data management teams. One of the main reasons we’re able to develop new products so quickly is the autonomy our teams enjoy,especially when it comes to dependencies on central infrastructure teams. This means, for example, when we start development on a new product, a GCP project is bootstrapped following basic security policies and configurations and handed over to the product team within minutes. After that, team members get full access to work freely in their GCP environment, including any Google services they need to securely access our unique data pool.

This freedom comes with the responsibility for the security of the projects. To support and enable product teams with their dual — and often dueling — goals of developing quickly and staying secure, we have established a security framework based on five pillars. (You may notice that our model resembles other security frameworks, for example NIST Cybersecurity Framework. This isn’t an accident; we researched and were influenced by many established resources when developing these pillars.)

Staying secure: our 5 pillars

For the rest of this blog, we’ll look at each pillar of our security framework individually, including how they help us stay secure while keeping up with the need to develop quickly.

Pillar 1: Detection

“Detection” is the foundation that the other pillars are built upon, defining a uniform way to audit and govern our cloud environment. To accomplish this, we built a database of cloud resources where we store the history of each resource through multiple revisions in a database — with each revision mainly consisting of a resource’s metadata. This database, which we can run queries against, lets us keep track of each resource’s lifetime.

We first tried to implement this layer with the help of Google Cloud Asset Inventory, but soon realized that we couldn’t cover our whole cloud environment — all running Cloud Functions, Cloud Dataflow jobs, and Composer environments, for example — with the Cloud Asset Inventory API at that time. After researching other solutions, we decided on Security Monkey from Netflix due to its maturity and flexibility.

With some contributions and extensions we can cover about 90% of our cloud services in use through the Google Discovery API. With this system we periodically perform scans of all our GCP projects to produce a snapshot of the resource metadata which can then be aligned over time.

This setup works quite well, but we have noticed some drawbacks. Since the system is based on a pull model, it tracks resources only when they’re scanned, not from the moment they’re created. This results in a time gap between when resources are created and when they first get discovered. This can be a problem, for example, for resources that are created and shut down in a short period of time and are never detected by a scan. And with an entire scan of all projects taking up to 45 minutes, reducing the gap time isn’t possible. We’ve also encountered occasional uptime problems on high loads, either due to scanning so many resources at once or a shortage of available compute in the Kubernetes cluster where the scan is running. These problems aren’t that pressing now, but we are looking at other alternatives based on a push model, like Historical, since Security Monkey is reaching the end of its product lifetime.

Pillar 2: Identification

With the resource inventory in place it’s easy to identify security-related issues by simply performing a query against the database. This could be a list of all compute instances with a network tag with an associated firewall rule having a wide range of allowed IP ranges, all project IAM bindings for Service Accounts which do not belong to the same project, or all Kubernetes clusters where no authorized networks for cluster master access are configured. These are only three examples of more than 100 queries we have used to identify security issues.

Identified issues are given a security score and categorized into topics like compliance, security, best practices, and cost controlling. Finally, they’re stored alongside a resource’s metadata, and presented in the Security Monkey UI where they can be commented on and justified by an authorized group of product team members. We keep track of metrics like the mean time an issue remains unresolved and track the accumulated score over time, which we visualize with other metrics in a Grafana dashboard like the one depicted below.

Overview of our Grafana dashboard showing different metrics which are derived from identified issues.

Overview of our Grafana dashboard showing different metrics which are derived from identified issues.

Pillar 3: Prevention

While identifying security issues brings transparency to a cloud environment, important issues could still remain unresolved for a certain amount of time. For certain issues, however, a low resolution latency must be guaranteed such as unauthorized access to storage objects or sharing of sensitive data.

For those topics, we created a prevention layer which actively takes measures to enforce security policies for a defined set of security issues.

This tool is simply called policy-enforcer and runs as a Cloud Dataflow job in streaming mode. Input events are ingested from three different sources:

● from a Cloud Pub/Sub topic which forwards filtered audit logs from a folder sink

● from the security monitoring system described above

● directly pushed to the system

After the policy-enforcer has checked the integrity of a message it immediately performs actions to force the desired state and audit them. The whole process takes less than four seconds. An image of a Cloud Dataflow job is depicted below.

Dataflow pipeline job processing input events from different sources and performing actions to enforce specific policies.

Dataflow pipeline job processing input events from different sources and performing actions to enforce specific policies.

Pillar 4: Exception

This pillar describes the necessity of allowing exceptions to security rules. For example, allowing a time-bound IAM role with higher privileges to access a system that needs to be debugged, or creating a resource outside of the EU, because it isn’t yet available in the desired region.

It’s important for us to have one single source for exceptions and be able to track the creation of each exception. For these reasons we decided to store exceptions in a git repository. Creating an exception follows a dual control principle process. Once the exception is created, the whitelist rule is applied right away, which signs off all related previously issues.

We use whitelisting to override the default IAM bindings assigned during the initial project creation. Through this whitelist process, we try to reduce the overall number of created issues, leaving only the most relevant for the product team to check.

Our Slack app pushes out notification on new issues. Users can jump directly to the Security Monkey or GCP from here or acknowledge the notification right away.

Pillar 5: Notification

The last layer is a reliable communication channel through which product team members can be directly notified of any issues. This channel can be used to report current security postures and for important alerts. For this functionality we created a slack app which can push messages in individual team channels, and can also respond to actions on the message.

Our Slack app pushes out notification on new issues. Users can jump directly to the Security Monkey or GCP from here or acknowledge the notification right away.

Culture

Having this system in place has been extremely helpful in providing visibility into our cloud environment from a security standpoint. We have learned, however, that it is one thing to use such a system, but an entirely different thing to build a culture where team members take ownership of building and operating great, secure software. It’s an ongoing effort to build and sustain such a culture, especially in an organization with autonomous and agile working teams like ours.

One way we try to reach this goal is by promoting continuous improvement on a technical, skill, and organization level. We do this by regularly sharing knowledge and best practices in different formats, like in our internal GCP-3D or Security Champions sessions. We also try to avoid knowledge silos by rotating daily tasks and responsibilities. This is supported by the fact that we believe in automatization, reliability, and traceability by heavily using infrastructure-as-code. It’s also vitally important to design cloud architectures with security considerations upfront rather than trying to fix issues later.

Conclusion

Establishing a profound security system while using agile working teams, especially in the cloud, is challenging — teams are responsible for their developed product and for its operation and security. This system requires a continuous effort to build an environment where all these areas achieve excellence.

As a result, we’ve learned that establishing sustainable security awareness in teams has to be a permanent topic, not just addressed at the project level. Ignoring this will result in the same technical debt problems many software products face — in fact, we use the term “security debt” in our discussions. We’ve also learned that supporting teams with a security framework is a good baseline to build additional layers on top of — these layers include processual and organizational topics. These five pillars have helped us immensely in our security journey, and we hope these concepts will help your organization, as well.