L1 Misses = event 0xD1, umask 0x08 = MEM_LOAD_RETIRED.L1_MISS = "Retired load instructions missed L1 cache as data sources." Grammar in that manual can be atrocious at times

L2 Misses = event 0xD1, umask 0x10 = MEM_LOAD_RETIRED.L2_MISS = "Retired load instructions missed L2. Unknown data source excluded."

L3 Misses = event 0xD1, umask 0x20 = MEM_LOAD_RETIRED.L3_MISS = "Retired load instructions missed L3. Excludes unknown data sources."

L1i Miss = event 0x80, umask 0x02 = ICACHE_64B.IFTAG_MISS = "Instruction fetch tag lookups that miss in the instruction cache (L1i). Counts at 64-byte cache-line granularity"

L2 Code Miss = event 0x24, umask 0x24 = L2_RQSTS.CODE_RD_MISS = "L2 cache misses when fetching instructions"

GTA V: some city driving, which tends to peg the CPU at times and hit some stutters

Overwatch: a game of quick play

ESO: running the alik'r desert dolmens, which have a mob of players farming them. framerate often drops into the 30s or below. On a vet halls of fabrication raid run late last year, I recorded 0.88 IPC. Framerate was worse back then, but I don't raid anymore. Too many bugs...

Dark Souls III: continuously parry Gundyr (the first boss) and almost punch him to death before I get wrecked.

Let's deep dive factors affecting Skylake's performance on certain games/benchmarks. Skylake and other Intel CPUs have several performance counters that can be programmed with events documented somewhat well in the Intel Software Developer's Manual (download the combined volume pdf from their site, start at page 3544 for Skylake/Kaby Lake/Coffee Lake events).Starting with IPC as a quick overview:Right away, we can see my i5-6600K achieves far lower IPC on games than in benchmarks. Ok, why? Let's look at how often load instructions cause cache misses:I feel like misses per 1K instructions gives a better representation of how cache misses impact performance, because it accounts for how often load instructions occur in the instruction stream. You can have a really bad cache hit rate, but if you have very few load instructions, it doesn't really matter.With all the benchmarks above except Geekbench's Dijkstra sub test, the i5-6600K's 6 MB L3 cache pretty much catches everything. Either the tests don't deal with big data sets, Skylake's prefetchers are really good, or both. The few games I tested on the other hand do miss L3 quite a bit.Now a look at instruction cache misses:That 32 KB L1 instruction cache is just not big enough for games, cinebench, and a couple tests in Geekbench (LLVM, SQLite). And instruction cache misses hurt more than data cache misses. With a data cache miss, you might have enough other instructions in flight to keep the execution units busy with. An instruction cache miss reduces the amount of work the out of order execution engine can look at. GTA V and Overwatch really suffer here, with code fetches often missing L2 too.Quick conclusion: Cache misses are a much larger problem with games than some popular benchmarks. Games suffer heavily from both data and instruction cache misses. If we want to improve gaming performance, we need bigger caches and faster RAM to cover cases when we do miss in the last level cache.I used both Intel's VTune software (available for free with a community license) and their open source PCM tool. PCM was used to get all the Geekbench data, but I didn't use VTune because it reported really bad mux reliability for Geekbench. VTune tries to collect data on a lot of performance events, but the CPU only has four programmable counters and three fixed function counters (instructions retired, unhalted cycles, reference cycles). So, VTune programs in a set of events, collects some data, reprograms the counters with another set of events, collects data, and so on. Somehow, it knows if this muxing isn't leading to good accuracy, and that seems to be the case for Geekbench. VTune was used to get data from Overwatch, ESO, GTA V, and Cinebench.IPC is calculated by doing (instructions retired) / (core clock cycles when logical processor is not in a halt state). That accounts for fluctuations in clock frequency and CPU idle time. If the OS doesn't schedule work on a CPU core, it's in halted state.Data cache misses are counted with:These events count retired instructions, which excludes instructions that are never retired (for example, those fetched after a mispredicted branch). They also don't give any info about how effective prefetches are. Another caveat is, if an instruction requests data that results in a cache miss, subsequent instructions requesting data from the same 64 byte cache line will count as fill buffer hits - not separate cache misses. Thus, the L2/L3 miss counters don't totally capture the impact of those cache misses. You could have a dozen instructions in close proximity requesting data that's not in L3. The first will count as a L3 miss, and the rest count as fill buffer hits. Ugh.Instruction cache misses are counted with:I'm still working on trying to count L3 code misses. There's OFFCORE_RESPONSE:request=DEMAND_CODE_READ:response=L3_MISS.ANY_SNOOP, but adding that and response=L3_HIT.ANY_SNOOP gives a count greater than L2_RQSTS.CODE_RD_MISS. That doesn't make any sense, especially when adding L2 code reads misses and hits approximately equals L1i misses. These are also speculative events, meaning code fetches after a mispredicted branch that are later discarded are counted here.What happened in games:Finally, an ask for reviewers (i.e. Anandtech): Can we get detailed analysis for benchmarked applications on different CPUs, so we can better understand how they stress different aspects of CPU microarchitectures? Tools are free from both Intel and AMD (CodeXL), and both publish docs listing performance monitoring events we can track.