Background

At the weekend I posted a video on Vimeo titled “A Time Machine for Augmenting Past Software Execution in the Cloud”. In the video I demonstrate how using an Aspect Oriented Programming (AOP) like programming interface to the Satoris metering engine, that is embedded within Simz and Stenos, you can effectively intercept software execution behavior across space, machine and process, as well as time, online and offline.

Imagine writing a method invocation interceptor class that is called within a process when a thread invokes a method – something that has been possible for some time using various frameworks, such as Spring and JEE/CDI, and AOP technologies, such as AspectJ. Now imagine that the same interceptor class can intercept method invocations across multiple Java runtimes and threads and not be present within each of the runtimes without a single line of code change – a mirrored runtime down to the thread and call stack as well as some environment state. Now, let’s go even further and imagine the very same interception class being able to receive the same callbacks within the same mirrored threads from a past recording, which can repeatedly run. Again no changes, though space and time aspects of the entire environment have changed. The power of this is not just in the distribution of the interception across multiple machines but in the ability to go back in time in the execution of software and augment it with future capabilities.

Demonstration

In the video, I walk through four main distributed application monitoring scenarios. In the first part of the video I instrument a DataStax Cassandra server with Satoris configured to send the metering data measured by the instrumentation applied at runtime to a remote Simz server. A monitoring console is connected the Simz server, completely oblivious to the fact that the runtime it is monitoring is merely a mirror of another runtime. I then configured the Simz runtime to install a custom interceptor implementation I had written to output the call profile history for any request exceeding a specified clock time threshold. I could just as easily have installed the interceptor in the actual Cassandra server runtime but then I would have had to worry about possible slowdowns caused by the interception, and it’s writing to the System.out, as the invocation of callbacks occurs within the thread of execution that is being metered (measured).

In the second part of the video I enabled the Stenos extension within the Simz runtime to create a binary metering recording, similar to what is transmitted over the wire, that can be played back when the Cassandra server is not actually running. Finally using Stenos I recreate the entire simulated environment and software execution behavior, but this time with the interceptor extension enabled. The interceptor, now in an offline mode, behaves the exact same as when online within the Simz runtime. Importantly this is achieved without having to query a database. Instead, the interceptor experiences the software execution behavior as if it was present within the runtime of the application when it initially occurred.

Here are the two interfaces within the Probes Open API implemented by any interceptor wishing to become a time lord of the Java and JVM universe.

Implementation

Below is the actual source code of the extension I used in the above video demonstration. Most of the code is related to environment configuration and reporting. Basically the InterceptorFactory creates an instance of the Interceptor class for each thread that is metered (and possibly simulated) within the runtime (real or not, online or offline). The Interceptor class then keeps track of whether a call trace has been started or ended for an associated thread. When it enters into a trace it creates a SavePoint and when it exits from the trace it generates a ChangeSet using the previously stored SavePoint for the same thread Context . It then checks whether the delta for the clock time Reading exceeds a threshold and if so it dumps the ChangePoint and Change instances within the ChangeSet .

Here is a sample of the output produced by the above custom interceptor during the startup of the DataStax Cassandra server.

Below is a table I’ve found useful in explaining the underlying model of the metering engine and the mapping from concrete to abstract to allow both Simz and Stenos simulate any software execution behavior, and not just within the Java runtime. It is not meant to be a comprehensive mapping, and the naming within the generic model is debatable, but hopefully it helps in making the above code appear less foreign and complicated when it is very straightforward and natural.

Can’t a time lord travel, or see, into the future? Using the Probes Open API I can simulate the future execution of method invocations and the timing of such invocations before writing actual code – ideal for testing interceptors that look to detect faults that have yet to occur in production.

Considerations

Lately, there has been a renewed interest in playback debuggers, such as Apple’s Swift Playground, as well as dynamic trace toolkits with the latest insane idea being the injection of JavaScript callbacks into native binaries and then mucking around with memory pointers to target objects and arguments. JavaScript based interception is something I previously did for one particular customer using the Rhino library (which I should revisit now that Nashorn is available in Java 8), but allowing such a language to manipulate native memory, in-process, is crazy but fun. Anyway back in 2010 when I was developing the earlier versions of Stenos and Simz I had a few key design requirements and thoughts that are worth keeping in mind if you are considering going crazy with JavaScript and dynamic tracing.

– Dynamic runtime adaptation of software execution behavior should be done by way of behavioral signals and signal aware software libraries. Probe interceptors and metering extensions should influence software execution behavior through the raising of and damping of adaptive signals.

– Developers and Operations should be able to decide the degree of isolation for any dynamic tracing or runtime augmentation. Later this evolved into the possibility to distribute probe interceptors and metering extensions across process and machine boundaries.

– To support distribution and isolation of interceptors or metering extensions the code cannot directly reference a class or method. Instead, the code uses the reflective namespace and conceptual runtime model provided by the Probes Open API.

– It should be possible to simulate and test interceptors and metering extensions in a sandbox. Later this turned into the ability to record calls into the Probes Open API and then play them back in a separate process, offline or online, with the entire application runtime mirrored including thread and call stack creation as well as meter readings.

– The performance impact on the runtime should be negligible, whether in-process or out-of-process. The most crucial success factor and one that other tools have in the past, and still to this day, fail to achieve. With the Satoris hotspot metering extension, I was able to instrument a significant amount of any codebase and then at runtime whittle this down using self-aware and self-adaptive capabilities of both the instrumentation and measurement code.

– The probes interceptor or metering extension should not be concerned with the actual machine or process boundaries of the intercepted thread execution. The software execution behavior is paramount – leading me to envisage a kind of Matrix for the (Java Virtual) Machine in which the real-time behavior of multiple connected JVMs streamed into a single observation and control plane creating machine consciousness. I later came across research work into mirror neurons in the human mind.

When it came to the deployment of the technologies into a production environment with 100s and 1000s of machines and nodes a few technical challenges arose including:

– How to compress the streaming protocol between client and mirrored runtimes such that a single method entry or exit event is described in as little as 4 bytes.