I am excited to share that we are investing in additional detection capabilities as part of the SIRT mission. There are a number of existing detection efforts across Netflix security teams. This is an opportunity to further those efforts, while creating stronger alignment between detection and response. Of course this being Netflix, our culture and our tech stack loom large in our consideration of how to expand our detection program. We want to avoid traditional pitfalls and optimize for our novel security approach. The last thing we want is a bunch of lame alerts creating busy work for a large standing SOC.

Required reading for anyone interested in this area are Ryan McGeehan’s Lessons Learned in Detection Engineering and the Alerting and Detection Strategy work from Palantir. I have borrowed liberally from their efforts. There are many ways to break this down, but I have settled on the following:

Within these categories, this is what a mature program looks like, and some questions we are grappling with:

Data Sources

Defenders need to discover existing security relevant data as well as create new data where gaps in visibility exist. This includes data discovery, deployment of instrumentation like eBPF or osquery, and working with application teams around logging best practices. What works for self-serve, or even automated, discovery and onboarding of security relevant datasets? Do you enforce schemas or is that too much friction?

The creation and/or collection of data should be justified by a quantitative reduction in risk to the organization (in dollar terms), but that can be difficult to forecast accurately until you have a chance to explore the data - something a Hunt function is great for. How would you enable hunting? Where does it fit in your overall detection strategy?

Event Pipelines

Normalizing heterogeneous datasets to a common set of security relevant fields has proven difficult (see SIEM). Rather than attempt normalization upfront, our approach is to create pipelines for each data source using templates and reusable modules. Instead of a general data pipeline with a single rules engine, we consider it more like a workflow for each data type. In theory each data source requires a bit more effort to set up, but the templates and modules help there, and the benefit is far more flexibility.

The onboarding process for new data needs to be easy, ideally partially automated and largely self-serve. Any given engineering team should be able to get their data to a state where they can rapidly test hypothesis about a new detection. It should take less than a day to get a new data feed flowing and less than an hour to set up a basic alert module with iteration times under five minutes in terms of pushing code and getting triggered events back to the engineer. What sort of frameworks or opinionated development tools could help speed up this cycle?

Reliable plumbing is critical. Pipelines need to provide transport in consideration of cost vs timeliness tradeoffs. They need to scale up to meet demands of a growing organization and number of data sources. We are moving towards a streaming model, where streams of data are enriched and inspected to create additional streams of events; although we still run batch jobs on tables for many flows. We need health checks for changes in data rate and format, along with runbooks on how to troubleshoot and repair flows when they go down.

Working towards mature rules is key to our approach to minimizing alert overload. We must encourage a level of rigor in how a rule is implemented - minimize false positives, identify blind spots, ensure data health, and perhaps most importantly define response actions. An alert without a response plan is worse than no alert at all. This is where our version of Palantir ADS comes into play. How do you encourage good alerts without adding friction to the experimentation process?

We want to enable the development of new rules as part of our standard security processes, like our post incident reviews. We also want to advocate for detections as primary security controls, alongside preventative controls, in work by other security teams. The detection team will not have a monopoly on rule ideas, and in an ideal state even non-security personnel will be writing rules on our platform, but we will have ultimate responsibility for the quality of alerting and a unique perspective to develop alerts that consider signals across the entire space. What abstractions are useful in this space? Is there a way to make rules portable across systems and companies?

Correlation Engine

When rules or models trigger they create events which may require additional enrichment, and evaluation by further rules. Every triggered rule should fire automation before it fires an alert to a human. When a human gets an alert they should be the right person, and be provided the right context and the right set of options. In our culture the person with the best understanding of the system is the system owner / oncall, whether a security team or application team. This is what I mean by SOCless; decentralizing alert triage to system experts. Within security that means you respond to the alerts you write. This aligns incentives so no one is offloading lazy alerts on another team. How have folks avoided the pitfalls of alert fatigue and other SOC challenges?

The edgier bet is that we can bring non-security personnel up to speed on a security alert more easily than we can teach a security person the details of a production system they are not familiar with. The folks who build applications are experts in their domains, but likely not experts in security, so we need to provide them with context around what the alert means, and enrichment on the overall state of the system beyond the rule that triggered it, so that they can make a decision on what to do. This blends into response orchestration.

Response Orchestration

When an enriched alert reaches a human, that alert should contain a reasonable set of response actions. ‘Ignore’ would feedback into alert tuning. ‘Redeploy cluster’ might be another, or ‘collect additional data,’ or if it cannot be resolved by the oncall, ‘escalate to alert author.’ For this to work we need really good alerts. Well documented, enriched and high signal-to-noise, along with a set of reasonable actions. What role does automation play in your perfect state? When can we fully automate alert resolution and when do we need a human? Which human?

Each alert that fires is a chance to measure our forecasts about risk, so we need to capture feedback on outcomes through our orchestration platform. We also want to make sure our alerts are healthy through offensive unit testing (also referred to as atomic red team tests). We will need to stay highly aligned with the growing red team efforts to best leverage them for testing our overall risk impact and efficacy.

Where we go from here?

I am happy to announce the latest job opening on my team Sr Security Engineer - Detection. I need a technical leader in this role to own detection. I anticipate this role will be a good bit of product leadership and architecture work with possibly some coding to glue things together. Strong program management skills to leverage existing investments, make buy/build decisions and operate across teams will be key. People management skills are a plus as we will need to define and staff a team, but you don't need to be a manager. I am extremely excited about this subject and can’t wait to find someone to work with on it. If you want to chat about the approach, or the role, or have a referral please reach out: amaestretti@netflix.com