CIM Data Model Optimizations

The Splunk community has rallied around the concept of data models, and why not? Normalizing data into common field sets helps to build use cases regardless of what vendor your data comes from. Common Information Model was created, and has become a staple of any Enterprise Security (ES) deployment. But is it efficient? Certainly not out of the box. There’s a considerable amount of tuning required to make CIM data models perform nearly as well as an equivalent raw search. This is critical to the success of any ES deployment.

Data Model Blues

CIM data models are comprised of base searches that are generally based on tags. Tags rely on key-value pairs and/or event types to work, but filtering on tags and event types tends to consume a lot of resources based on the way Splunk search works. The docs state that in the search parsing order, event types and tags are derived last at seventh and eighth, respectively. That means each event in every index that’s searched must be decompressed, loaded into memory, and complete the full parsing process before being filtered out of a CIM search. This has improved over time as Splunk continues to optimize the data model search.

No indexes are specified in the CIM searches by default, which means every index is included. This is the opposite of an ideal search that has indexes, sourcetypes, and keywords specified to limit resource consumption. Don’t underestimate how much having no free memory can bog down your indexer tier.

Hardware Requirements

Splunk docs state that your indexer tier for any ES deployment should be between 40 and 100 GB per indexer per day. If you have a 400 GB/day deployment, you would have between 4 and 10 indexers. The more indexers you have, the better performance you will see. We’ve found that 70 GB/day/indexer makes for a reasonable balance (depending on the number of concurrent users).

Don’t try to compensate for fewer indexer nodes with faster storage or more CPU cores; that’s not how Splunk works. CIM searches fall under “dense search,” which is primarily CPU-bound. Certain data models have acceleration enabled by default, and those will run an acceleration job at the indexer tier every five minutes for the past five minutes of data. Each search will consume a single CPU core until it’s done. If these jobs take longer than five minutes, your dataset may never be fully accelerated and your search results will be inaccurate. Ideally, each job should finish in under two minutes. Check your ES Data Model Audit dashboard to see how well your DM searches are running. Alternatively, run this search using the Line Chart visualization:

index=_internal sourcetype=scheduler component=SavedSplunker ACCELERATE NOT skipped run_time=* | rex field=savedsearch_id "ACCELERATE_(?:[A-F0-9\-]{36}_)?(?<acceleration>.*?)_ACCELERATE" | timechart span=5m max(run_time) AS run_time by acceleration

Remember that each standalone search head or search head cluster will run its own CIM acceleration jobs if CIM is installed and configured, so try to avoid acceleration redundancy across your environment.

Indexing strategy

Splunk recommends splitting data into indexes based on two factors: security and retention. What we recommend is adding a third criteria: performance. Specifically, CIM is designed to search a given list of indexes using tags, and if those tags match only a small portion of those events then there will be slower searches and wasted resources. One example is to split up firewall logs from VPN devices into a separate index so that they can be searched independently. The Authentication and Network Sessions data models can be pointed to the new VPN index and completely bypass the massive set of firewall network traffic logs that would never match.

CIM Tuning

Common Information Model can be tuned by adding indexes to your data model searches in two ways:

1. Through the CIM Setup dashboard within Enterprise Security.

2. By editing macros.conf in etc/apps/Splunk_SA_CIM/local directly.

Both of these options have the same end result. We recommend going with #2 to provide more flexibility and better tuning options. Specifically, if you have indexes with a lot of non-matching data for a given data model, you can exclude or include specific sourcetypes for those indexes. This will reduce the memory footprint and increases speed of searches by limiting the indexes that are searched. But how do you know which indexes and sourcetypes to add to the data models? Generally, you want to reference the Splunk docs for each Technology Add-on (TA) and sourcetype to identify the corresponding data models. See the Source types for the Splunk Add-on for Cisco ASA for an example.

Tuning Automation

If you’re thinking that sounds tough, you’re right. It’s too easy to miss something and end up with a compliance issue or missing data in your reports. That’s why we’re giving you tools to do this more easily. Run the following searches for your CIM data models over 24 hours to automatically produce configurations to apply to your macros.conf file.

| tstats count from datamodel=Authentication where index!="_*" by index sourcetype | fields - count | format | fields search

Your output should look similar to the following:

( ( index="wineventlog" AND sourcetype="WinEventLog" ) OR ( index="wineventlog" AND sourcetype="XmlWinEventLog" ) OR ( index="vpn" AND sourcetype="juniper:sslvpn" ) OR ( index="os" AND sourcetype="linux_secure" ) OR ( index="rsa" AND sourcetype="rsa:securid:runtime:syslog" ) OR ( index="salesforce" AND sourcetype="sfdc:loginhistory" ) OR ( index="os" AND sourcetype="syslog" ) )

Next, login to your Splunk server via SSH, sudo to the splunk user, and create your template for local/macros.conf:

$ cd /opt/splunk/etc/apps/Splunk_SA_CIM $ mkdir local $ grep -P -A2 '^\[cim_(?!Application_).*_indexes\]' default/macros.conf | grep -v '\-\-' > local/macros.conf $ vi local/macros.conf

From here change the macro definition for the Authentication macro to look like the following:

[cim_Authentication_indexes] definition = ( ( index="wineventlog" AND sourcetype="WinEventLog" ) OR ( index="wineventlog" AND sourcetype="XmlWinEventLog" ) OR ( index="vpn" AND sourcetype="juniper:sslvpn" ) OR ( index="os" AND sourcetype="linux_secure" ) OR ( index="rsa" AND sourcetype="rsa:securid:runtime:syslog" ) OR ( index="salesforce" AND sourcetype="sfdc:loginhistory" ) OR ( index="os" AND sourcetype="syslog" ) )

Easy, right? Now rinse and repeat for the rest of your data models to make sure they’re optimally tuned. Of course, you’ll want to repeat this procedure if you add more data sources to your environment; however, this process only works with the Splunk CIM macros having no indexes specified.

We hope this helps to tune your Splunk Enterprise Security deployment or Common Information Model-based use case. We’d love to hear your feedback in the comments!