The following is a graphic I’ve used in the past to frame various software performance optimization techniques. It is not a comprehensive inventory of all software performance optimization techniques (or concerns) but I’ve found it serves a purpose in managing the amount of effort that, in general, should be spent on each technique outside of extreme cases such as trading platforms and profiling. The left side is your typical localized bottom up approach to speeding up code execution steps. Most developers involved in performance improvement efforts start on the left side because it feels less abstract and much more tangible than system dynamics or software adaptation, both of which require a much higher level of understanding of exactly what does happen outside of the code editor. On the right side, the focus is on the manipulation or mirroring of system execution behavior with the aim to globally control, circulate and conserve costs associated with coordination, contention, congestion, consumption, and coherence. An almost top down viewpoint. The left side addresses local execution efficiency; the right side more so global execution intelligence. But in the middle, we have execution heuristics which I myself view as the first and very basic step along a path towards making software smarter – a smartness that is somewhat independent of a particular benchmark environment. Software performance optimization heuristics once ascertained and adequately assessed, should feed back into many of the other performance optimization activities on either side.

It is easy to focus efforts on micro-benchmarking though in general the improvements reported rarely translate into measurable gains of the same magnitude within a real-world environment that is far more noisier than a micro-benchmark. In fact, it can prove counter-productive when it places more stress elsewhere in the pipeline processing. A developer can waste an enormous amount of time chopping off branches, here and there, and yet never acquire a real understanding of the forest that they or their applications operate within. Each chop off of a branch can indeed be rewarding but also frustrating when one takes a step back and sees the negligible impact in a much broader context. Unless the forest only holds a few trees, a micro-service or low latency service, it’s far too easy to lose all sense of perspective the longer one stays and stares down at The Underworld. Get in, get out quickly!

This Wikipedia definition places an emphasis on experience codified in rules of thumb and strategies though at times in the course of troubleshooting engagements it would seem developers rely far more on expectation (not formed by experience) than the actual experience (of some sort) in writing efficient code. I suspect this is because there is still not enough attention given to observing how software machines do actually operate; operate in the sense of motioned activity and not some disconnected metric created to feed an illusion of control held within the typical operations department – the dashboard.

If you take a look at many of the case studies put out by an application performance monitoring vendor you repeatedly come across the “surprise” that developers experience when the product solution throws up some data up into a dashboard that hints at an underlying performance or scalability issue. This is the big selling point here but I can’t help being saddened that as a supposed engineering profession we seem to be always surprised when we do learn how our creations actually execute and service others (man and machine). The love of creation needs to extend further into the lifecycle. I am convinced that it is because the software does not have memories, memories that can be recalled and perceived by a developer, operator or even the software itself for the purpose of offline/online training of behavior and prediction. The majority of software performance visualizations render aggregate (or resulting) data but not the motion of software activity at a level that lends well to how we ourselves process sensory data in the physical world. This is a topic I hope to come back to in a future article but for now, I would like to demonstrate how record and replay facilitate the production of software performance optimizations heuristics.

In Search of Heuristics

The first step in the creation of a heuristic is collecting a large amount of data, past experiences, to be mined for common patterns of behavior that our code could accommodate more efficiently and predictably. To obtain the experiential data for my particular need, in the optimizing of a high performance software activity metering engine, I ran up an Apache Cassandra server instance instrumented at runtime by Satoris and then pushed a large amount of workload to it using the distributed cassandra-stress tool. The Satoris agent, configured with the stenos extension enabled, created a 5GB file holding over 1.7 billion software execution events, a BEGIN and END event for each method invocation.

To simulate the playing back of the software execution behavior within the file I execute

> java -jar stenos.jar <stenos-recording-file>

It takes just over a minute to simulate the playback of the entire file, that is more than 25M events/sec, as the simulation by default runs at the fastest possible speed whilst still recreating all metered threads and having them simulate the execution of instrumented and measured method invocations as they occurred within the real application runtime. The simulation is a near replica of the application without the actual bytecode and heap state needing to be present.

In order to mine the recording for common patterns I need only create a ProbesInterceptor[Factory] class that plugs into the simulation and receives callbacks when a thread ( Context ) begins and ends a method ( Probe ) invocation. Here is the configuration pattern used.

j.s.p.interceptor.enabled=true j.s.p.interceptors=${interceptor},${interceptor},... j.s.p.interceptor.${interceptor}.factory.class=... j.s.p.interceptor.${interceptor}.factory.classpath=...

For my specific use case I wanted to understand the sequencing of calls within an execution scope, thread, stack or method, for the purpose of prediction and possible shortcutting of slow execution paths. For this I used a Cache class that maintained the last four method name lookups and in doing so updated both local and global statistics related to hits, misses and scans. The first strategy I investigated was one that held a single Cache instance per thread ( Context ). It is important to note that the interception code is only used to investigate (or simulate) the cache behavior. I don’t need to be concerned with the efficiency of the interception code only the effectiveness of cache strategy as the results will eventually be translated into some optimized behavior within the metering engine itself.

Note: The Cache.sync() call is used to propagate local cache instance statistics to a global statistics container.

The installation of the interception is achieved with the following configuration.

j.s.p.interceptor.enabled=true j.s.p.interceptors=tc j.s.p.interceptor.tc.factory.class=com.autoletics.probes.cache.ThreadCacheIntercept j.s.p.interceptor.tc.factory.classpath=./classes

Below are the cache statistics printed out following completion of the playback. 4 out of 10 cache lookups find a matching method name. On average it takes 2.9 cache field entry comparisons to find a hit. Note a scan for a miss will always be 4. The overall average scan cost for both hits and misses is 3.55. Not a terribly efficient approach but probably better than most would have expected which can be attributed to the adaptive hotspot strategy employed within the metering engine that dynamically disables and eliminates high-frequency low-cost methods from the measurement and in turn the recording. Being able to do this is extremely powerful with the interception code acting much like a query, and collector, executed against an active database of actual software execution behavior.

Instead of having a single Cache instance held per thread the next strategy employs a stack of Cache instances per thread. The assumption, or expectancy, being that a particular method will repeatedly appear at particular thread call stack depth(s).

Again the installation of the interception code is pretty straightforward to configure within the simulation.

j.s.p.interceptors=sc j.s.p.interceptor.sc.factory.class=com.autoletics.probes.cache.StackCacheIntercept j.s.p.interceptor.sc.factory.classpath=./classes

The results tabulated below show how effective this new stack based cache strategy is indeed – a near 100% cache hit rate. The average scan cost is between 2 and 3 entry fields. Again the cache strategy is helped immensely by the recording being intelligently filtered by the hotspot metering extension enabled within the Satoris agent during the actual real server benchmark.

The next, and final, cache strategy keeps track of a cache per probe (method). When a probe has started the cache of the caller probe (method) is searched. For root callers, a thread level cache is used. The stack structure maintained by each thread is used merely to keep track of the caller-and-callee chain for pushing and popping. There is no need to be concerned with the relatively expensive map access call because this would not be needed within the underlying metering engine as the probe implementation itself would offer direct access to the associated cache field entries. In all of the examples, I have kept to the Probes Open API available on Github.

Here is the configuration for the installation of the interception code.

j.s.p.interceptors=mc j.s.p.interceptor.mc.factory.class=com.autoletics.probes.cache.MethodCacheIntercept j.s.p.interceptor.mc.factory.classpath=./classes

The cache miss count has dropped down to an almost negligible amount and the scan cost per hit is now at 2.28. So a cache implementation with only 3 entries would suffice with this strategy. The strategy is most effective because of the typical degree of call fan-out in the code base as well as the intelligent disablement of irrelevant leaf callees at each caller node in the call tree by the hotspot metering extension, in both real and simulated environments.

The recording file would have contained some call event noise at the start as metering engine within the real application runtime observed and learned about the nature of the execution before deciding which methods to continue to measure and which methods to disable the instrumentation within. To factor this out from the results I ran the playback again but this time only activating the interception query code when a probe was labeled as a hotspot and unmanaged. This would reduce the overall call back invocations as interception would be deferred until later in the simulated benchmark playback.

j.s.p.interceptor.guard.name.labels=!managed

With the above configuration added the scan per cache hit is now below 2. With this strategy, a cache would only need to have two entry fields.

The slight discrepancies in the overall cache totals can be attributed to small differences in global ordering leading to the hotspot metering extension disabling probes earlier or later across playback executions. Strict ordering of metered call event processing is maintained per thread.

In a follow-up article, I will demonstrate how to approach this software performance optimization using self-adaptive software techniques employed online during both real and simulated benchmark runs.