The Architecture of A Machine Learning Framework

As soon as we knew a bit about the art of machine learning we eagerly advanced to the creating-the-actual-framework part. Because there are no guides for that, we resorted to reading the source code of established frameworks — all for us relevant parts, many times, until we understood internal structure and control flow. There is no special ingredient here, all it took was time and electricity. In the meantime, we had decided to use C# as our primary language — mostly because we were already very familiar with it and didn’t want to also have to learn a new language, but officially also because there were no proper neural network frameworks for .NET.

Alongside reading the source code of machine learning libraries (mainly Deeplearning4J, Brainstorm and Tensorflow) we sketched out how we wanted our own framework to be used. We felt like there was some unnecessary confusion in getting to know machine learning frameworks as an outsider and we set out to design our API to avoid that. Note that because our design makes sense to us doesn’t mean that it makes more sense than the existing ones to other people, nor do we recommend everyone wishing to use machine learning to write their own framework, just to spare their own sanity.

How to Talk Machine Learning to a Framework

How do you make any framework do what you want it to do? How do you get it to train a specific model from some specific data using a specific optimiser on some specific hardware while visualising the outputs in some specific configuration? There are a great number of things a machine learning framework should be able to do, and all of them should be easily usable, configurable, interchangeable, and readable. This is not a problem unique to machine learning frameworks; all kinds of programming frameworks are supposed to be used in some specific way. Because everything depends on this user-facing side, it’s usually considered first, so that’s what we did too.

Many of the well established machine learning frameworks support the general workflow of defining either the computation graph directly or the model structure using layers (as with neural networks). We thought the latter was easier for newcomers because you wouldn’t even have to know what a computation graph is and adopted that for our design. Our envisioned workflow was inspired by our mostly object-oriented programming experience, as is evident from our first “official”code example draft:

Create a Sigma environment to contain and manage everything else

Sigma sigma = Sigma.Create(“minsttest”);

Optionally add “monitors” to monitor the enviroment (e.g. in a GUI)

GUIMonitor gui = (GUIMonitor) sigma.AddMonitor(new GUIMonitor(“Sigma GUI Demo”));

gui.AddTabs({“Overview”, “Data”, “Tests”});

Tell the monitors to get ready before adding trainers

sigma.Prepare();

Define a dataset to use with our data processing pipeline (ETL style)



DataSetSource targetSource = new MultiDataSetSource(new FileSource(“mnist.targets”), new CompressedFileSource(new FileSource(“mnist.targets.tar.gz”), new URLSource(“

DataSet data = new DataSet(new ImageRecordReader(inputSource, {28, 28}).Extractor({ALL} => {inputs: {Extractor.BatchSize, 1, 28, 28}}).Preprocess(Normalisor()), new StringRecordReader(targetSource).Extractor({0} => {targets: {Extractor.BatchSize, 1}}); DataSetSource inputSource = new MultiDataSetSource(new FileSource(“mnist.inputs”), new CompressedFileSource(new FileSource(“mnist.inputs.tar.gz”), new URLSource(“ http://….url…../mnist.inputs.targ.gz ")));DataSetSource targetSource = new MultiDataSetSource(new FileSource(“mnist.targets”), new CompressedFileSource(new FileSource(“mnist.targets.tar.gz”), new URLSource(“ http://....url...../mnist.targets.targ.gz "));DataSet data = new DataSet(new ImageRecordReader(inputSource, {28, 28}).Extractor({ALL} => {inputs: {Extractor.BatchSize, 1, 28, 28}}).Preprocess(Normalisor()), new StringRecordReader(targetSource).Extractor({0} => {targets: {Extractor.BatchSize, 1}});

Define a network architecture using neural network layers

Network network = new Network(“mynetwork”);

network.Architecture = Input(inputShape: {28, 28}) + 2 * FullyConnected(size: 1024) + SoftmaxCE() + Loss();

Create a trainer within the previously created enviroment

Trainer trainer = sigma.CreateTrainer(“mytrainer”);

Assign structural parameters to the trainer (network, initialiser, data)

trainer.SetNetwork(network);

trainer.SetInitialiser(new GaussianInitialiser(mean: 0.0, standardDeviation: 0.05));

trainer.SetTrainingDataIterator(MinibatchIterator(batchSize: 50, data[“inputs”], data[“targets”]);

trainer.AddNamedDataIterator("validation": MinibatchIterator(batchSize: 20, inputs: validationData["inputs"], targets: trainingData["targets"]));

Assign behavioural parameters to the trainer (optimisers, hooks)

trainer.SetOptimiser(new SGDOptimiser(learningRate: 0.01);

trainer.AddActiveHook(EarlyStopper(patience: 3));

trainer.AddActiveHook(StopAfterEpoch(epoch: 2000));

Configure optional settings for monitors or other systems

gui.AccentColor[“trainer1”] = Colors.DeepOrange;

gui.tabs[“overview”].AddSubWindow(new LineChartWindow(name: “Error”, sources: {“*<Trainer>.Training.Error”}));

gui.tabs[“overview”].AddSubWindow(new LineChartWindow(name: “Accuracy”, sources: {“*<Trainer>.Training.Accuracy”}));

Start the environment (that starts the trainers that start the operators that start the workers that start the actual training)

sigma.Run();

All in all, it was intended to look and feel more like a smart configuration file than actual programming as we thought that would be the easiest to read, understand and write. Our naïve ideas on how a machine learning should look like were inspired by our C#/Java based programming experience.



It should be noted that the final framework is very similar to what we envisioned early on with these code examples: adjusting a few syntax tidbits and interchanging with the exact names, the above example from about a year ago can be used 1:1 in our current framework. The jury is still out on whether that’s a sign of good or really bad design. Also note the python-style variable keywords notation for layer constructor arguments, which was soon discarded in favour of something that actually compiles in C#.

Core Components of Our Machine Learning Framework

After all this research, the in-depth code examples, and the structure sketches we thought our framework needed, we finally arrived at the principle architecture for what we call “Sigma.Core”. Our overall architecture is divided into core components which represent individual namespaces (logically separate groups of functionality and code). Core components interact with each other using exposed interfaces and the lifecycle. While our lifecycle was designed upfront, most of the interfaces were defined and changed as needed.

Utils: Common Helpers, Observers, Exceptions.. and Registries

Utils contains mostly boring and standard, well, utility stuff. But also registries, which represent an enhanced key-value store and are a key part of our architecture. Registries enable us to keep a global access-protected and type-protected data store across multiple threads and even processes. This originated from our desire to analyse and visualise everything in any way, for which we required a global way to access everything by identifier — a registry.

Our registry implementation is a classic key-value table (i.e. a “dictionary”) with a string key and a value of any type, which in itself may contain more registries. The type of value may be restricted using a special data type table, which protects it from nasty errors (e.g. when changing a value modifier from 0.4 to “banana”). Nested registries are resolved using registry resolvers in dot notation, like “network.layers.1-input”. Nested identifiers may also include fancy wildcards and type tags in angel brackets (e.g. “network.layers.d*<fc>.weights” for all layers tagged as “fc” that start with “d”).

// verbose usage example with two sub-registries

Registry root = new Registry(tags: “root”);

Registry trainer1 = new Registry(root , tags: “trainer”);

Registry trainer2 = new Registry(root , tags: “trainer”);

RegistryResolver resolver = new RegistryResolver(root); root[“trainer1”] = trainer1;

root[“trainer2”] = trainer2; // declare parameters in registry

trainer1[“accuracy”] = trainer2[“accuracy”] = 0.0f; resolver.ResolveSet(“<trainer>.accuracy”, 0.02f);

Besides registries, the Utils component also defines time dependent variables and constants. These constants are used for timekeeping to communicate about certain events happening, such as an optimisation iteration, or a pause in execution, or a complete reset. All these events are what we call time scales — abstract units of a certain occurrence that we might want to time with. And that timing is done through time steps, which are countdowns of a certain time scale event happening a certain number of times. This is particularly convenient for executing specific code when e.g. the optimisation algorithm has completed 10 iterations or the trainer was halted again.

Data: Datasets, Data Processing, Data Extraction, Data Sources

The data component is — very surprisingly — everything data. It contains

the actual datasets,

the data record blocks that make up datasets in various formats,

the data records that make up data blocks in various formats,

the data buffers that make up data records in various formats, and

the pipeline to load, extract, prepare and cache data blocks from disk, web, or wherever else, and make them available to datasets.

We support two kinds of datasets: extracted and raw. In contrast to extracted datasets, which are extracted from an external source, raw datasets are “manually” populated from code (useful for debugging and experimentation). Data record blocks are parts of a dataset, consist of many individual records, each representing one data row. To avoid loading the entirety of a potentially very large dataset into memory at once, we employ partial data record blocks which are then further split up by data iterators before being fed to the model.

In practice, the code for reading even moderately complex data streams into compliant record blocks turned out rather long and verbose. To balance out the need for detailed configuration in complex cases we added simplified templates as well as ready-to-use datasets. For example, this is the full code for the processing pipeline of the popular MNIST images (28x28 fields monochrome digits for classification):

// get from disk if already available, otherwise download remotely

IDataSource localInputSource = new FileSource("train-images-idx3-ubyte.gz");

IDataSource onlineFallbackInputSource = new UrlSource("<url>/train-images-idx3-ubyte.gz");

// then decompress automatically

IDataSource inputSource = new CompressedSource(new MultiSource(localInputSource, onlineFallbackInputSource)); // read source bytewise into 784-long (28x28) records, skip header

IRecordReader mnistInputReader = new ByteRecordReader(headerLengthBytes: 16, recordSizeBytes: 28 * 28, source: inputSource); // extract entire record range (0-28 pixels along each dimension) // into inputs sub-block

IRecordExtractor mnistInputExtractor = mnistInputReader.Extractor("inputs", new[] { 0L, 0L }, new[] { 28L, 28L }); // normalise 8-bit greyscale input values

mnistIinputExtractor.Preprocess(new NormalisingPreprocessor(0, 255)); // one-hot encode targets (labels from 0 to 9 for each digit)

mnistTargetExtractor.Preprocess(new OneHotPreprocessor(minvalue: 0, maxValue: 9)); Dataset dataset = new Dataset("mnist-training", Dataset.BlockSizeAuto, mnistImageExtractor, mnistTargetExtractor); // use 80% of each block as training data, rest as validation

IDataset[] slices = dataset.SplitRecordwise(0.8, 0.2);

IDataset trainingData = slices[0];

IDataset validationData = slices[1]; // iterate ready to use record blocks with minibatch size of 1

// the output of the data iterator is directly fed to the model

MinibatchIterator trainingIterator = new MinibatchIterator(1, trainingData);

Architecture: Abstract Model Layout Definitions

Abstract definitions for machine learning models made of layer constructs. Constructs are lightweight placeholder layers defining what a layer will look like before its fully instantiated; only behaviour and parameters without the heavy memory footprint of a full layer. These layers may be in any order (though it’s advisable to put inputs first and outputs last) and connected with however many other layers they would like.

// verbose manual definition of layer contructs

// (the # represents automatic numbering by order)

LayerConstruct input = new LayerConstruct("#-input", typeof(InputLayer));

input.Parameters["shape"] = new int[] { 4 }; // simplified manual definition of layer constructs

LayerConstruct output = OutputLayer.Construct(3); // manual direct linkage of two layer constructs

input.AddOutput(output);

output.AddInput(input);

In the above example, input and output constructs are defined and linked manually. Manual linkage and configuration are supported to facilitate arbitrarily linked network architectures beyond linear models. In contrast to these point-to-point models, linear models may be defined through a more intuitive, simplified “stack-via-plus” notation:

Network.Architecture = InputLayer.Construct(4)

+ FullyConnectedLayer.Construct(12)

// multiplication (*) may be used to duplicate architecture

+ 2 * FullyConnectedLayer.Construct(3)

+ OutputLayer.Construct(3)

+ SoftMaxCrossEntropyCostLayer.Construct();

Layers: Neural Network Layer Implementations

“Layers” is an unfortunate misnomer since the “Layers” component design includes all types of layered structures and not only neural network layers. We started out with just neural networks, but later expanded our architecture to all kinds of machine learning structures that can be divided into “layers”. Nevertheless, a layer in our implementation is for all intents and purposes a neural network layer. Analogue to neural network layers in theory, “our” layers are defined by their

size (in all dimensions),

other meta parameters (e.g. name, activation function)

trainable parameters (e.g. weights, biases)

behaviour (in code, inferred by their instantiation type)

Note that the split into meta-parameters and trainable parameters is a cosmetic one and not strictly necessary, implemented for usability. The layer-type-specific behaviour is implemented in each layer’s ILayer.Run function, which is called every iteration of the optimisation algorithm by the owning trainer. Precisely, the to-us-mystical layer function is defined in code as:

void Run(ILayerBuffer buffer, IComputationHandler handler, bool trainingPass);

The layer buffer interface bundles all relevant transient parameters required for a single invocation of the run function; that is, all parameters, inputs from the previous layer and outputs to the next layer. It represents a data container without any special behaviour, merely used to reduce clutter when using the function. As the name IComputationHandler suggests, the computation handler is used to define computations on the parameters in the buffer. The less exciting “is training pass” flag is used to disable training features (such as randomisation) in production mode.

Math: Low-level Mathematical Variables and Relations

The math component is exactly what you would expect (or maybe not, our models can’t predict your expectations yet): mathematical and low-level computational definitions, i.e. mathematical variables and their relations. All mathematical variables are programming objects and define interfaces for other variables to interact with by means of operations in the computation handler. These objects can either be scalars (represented as INumbers) or n-dimensional arrays (e.g. vectors, matrices, all represented as INDArrays).

For further abstraction, the user is never presented with the live data but rather with these abstract representations. And even when requested, a copy is returned — the only way to modify the live data is through the given computation handler. This hassle with forcing every data manipulation through the computation handler is highly useful for asynchronous processing. The requested computations can be executed separately without having to synchronise data with the main thread all the time (enormously useful for multi-threading and GPU support). Also, the component is cleaner by separating the concerns of “what to do” and “how to do it” clearly.

// create a new mathematical processor with 1 core, 32-bit precision

IComputationHandler handler = new CpuFloat32Handler(); // compute the matrix dot product of "a" and a new 3x4 matrix

INDArray c = handler.Dot(a, handler.NDArray(3,4));

Besides, the heavy abstraction of mathematical objects neatly serves the ability to swap and interchange mathematical processing backends without disturbing the end user or the model developer. Want to use your single CPU-core with 32-bit precision for development but then deploy to your magic high-end multi-GPU server farm with 64-bit precision for optimal results? No problem, just change a line in the configuration (i.e. trainer definition) and all your custom layers and models work exactly the same.

Training: Detailed Training Process Configuration

The largest component with many sub-components, all concerned with the actual training process. A training process is defined in a “trainer”, which is a container object that may specify the following components:

Initialisers define how model parameters are initialised, which can be configured with registry identifiers. For example,

trainer.addInitialiser(“layers.*.biases”, new GaussianInitialiser(0.1, 0.0));

would initialise all parameters named “biases” with a Gaussian distribution scaled by 0.1 (mean 0). Similarly, weights and other parameters can be initialised to random (or other) distributions or custom constants.

Modifiers modify registry identifiable parameters according to specific rules at runtime, for example to clip weights to a certain range. Modifiers are a feature we observed in another machine learning framework and deemed convenient for quick prototyping. As such, modifiers were intended to be the simplest way of specifying rules for parameters. As we however invested a lot of time into improving the usability of the substantially more powerful hook system with similar templates, the modifier system became obsolete.

Optimisers define how a model learns (e.g. gradient descent). Because we mainly considered neural networks, we only implemented gradient based optimisers. Because there are no algorithmic constraints for the optimiser, the interface theoretically supports any kind of optimisation algorithm, even randomised or genetic ones. For reference, the concerned method from the API which defines a single optimisation step (i.e. iteration):

/// <summary>

/// Run a single iteration of the network (model) optimisation /// process (e.g. backward pass only).

/// Note: The gradients are typically used to update the parameters /// in a certain way to optimise the network.

/// </summary>

/// <param name="network">The network to optimise.</param>

/// <param name="handler">The computation handler to use.</param> void Run(INetwork network, IComputationHandler handler);

Hooks “hook” into the training process at certain time steps and execute arbitrary code. Communication — albeit only rudimentarily — between hooks is realised using a shared global registry. And using additional helper logic, hooks can be applied conditionally when certain criteria are met, e.g. if the parameter “error” hasn’t decreased for over 5 iterations.

Often, the kind of logic you would want to implement as a hook is very similar to a basic “if this, then that” system — if a new top score has been reached, print all metrics and sound a notification. Or if 1000 iterations are completed and the score hasn’t increased for 5 iterations, stop the training process and store the current network on disk.

The “if this” part is accomplished using the aforementioned criteria, which are used to form conditional pseudo-statements like

IF <parameter> INCREASES | DECREASES | REACHES <value> DO <...>

Such statements may also include a repeat specifier if the condition has to remain true for a certain number of time steps before the criteria is met (e.g. score has to decrease 5 times). Multiple criteria may also be combined into a new criteria using classic Boolean operators (AND, OR, NOT).

The “do that” part can truly be anything, but there are a few common themes:

loading network state or parameters (mostly custom / inline),

storing network state or parameters (Saviors),

computing metrics based on network state and parameters (Processors),

scoring network performance using validation datasets (Scorers),

printing anything to console, file or network (Reporters),

for each of which there are multiple templates and base classes to use or expand if insufficient. Of course, with multiple hooks and multiple worker threads there quickly arises a problem: how to resolve dependencies? What happens when one hook requires the result of another hook?

Hook dependency management to the rescue! This unsuspecting sub-component turned out to be tricky due to a few unforeseen difficulties. The main reason for supporting managed dependencies was to move the burden of ensuring properly ordered execution of all hooks from the user to the framework. Thus, now our system has to figure out which hooks resolve to which dependencies, what to do with cyclic dependencies (hint: ban them) and so on. This part of the problem can be solved fairly easily using a dependency graph and by ordering priorities in certain hooks (i.e. first hooks that get data, then hooks that process data, then hooks that print data).

For some reason we did not anticipate that just ordering the hooks correctly didn’t help the actual execution part when multiple threads are involved, which is always the case in our multithreaded operator / worker architecture. Firstly, the worker thread shouldn’t be “distracted” from its actual job (i.e. doing optimisation) for too long executing these hooks. This can be countered by setting a limit on the amount of time a hook may take and offloading “slow” hooks to a separate worker thread. Naturally, this creates another set of difficulties, namely that this separate thread may not access the original data directly. We can’t just copy everything either, as that’s very slow, so we have to first figure out what part is actually needed, only copy that and then dispatch.

Secondly, and much more painfully, there may be cross-region hooks and therefore cross-thread dependencies. As every person who has ever tried to do multiple difficult things at the same time knows, multithreading isn’t easy. It becomes even harder in performance-critical applications that need to exchange information (i.e. the parameters) and then execute conditional code on shared data based on that information. After lots of trial-and-error, our final solution was satisfyingly simple: do the same thing we did for the first problem, just with more hooks bundled together. We figure out which hooks need to be in such a “bucket execution thread” together by analysing their dependencies, owners and thread ids, which is a basic sorting problem. Tada.

Operators: Training Management and Work Delegation

Operators operate the training process. They delegate work to workers and then combine their results according to user configured parameters. Further, operators are an essential design point enabling the simple deployment of multi-core, multi-GPU or even multi/cross-device processing during training. Key to this design is our separation of “global” and “local” processing: The global scope is the most recent global and public version of all data in the operator while the local scope is individual to each worker.

The global state is fetched by workers to their respective local scope. The workers then proceed to duly do their work within their scope, handling events on their own, and report back with their results when they’re done with an iteration. A global timestep event is ejected when all local workers have submitted their work for that timestep (e.g. iteration), facilitating fine control in distributed learning (e.g. notification when everyone is done).

Handlers: Low-level Mathematical Processing

The direct low-level processing of mathematical operations is done in the Handlers component. Our backend handlers are specialised mathematical processors that execute mathematical operations for a certain system or device using a certain data type and precision. They apply the operations defined in the Math component using placeholders of n-dimensional arrays and scalars to raw data. There currently is no limit on the accuracy of mathematical operations, giving programmers and maintainers the freedom to favour speed over accuracy when implementing optimised routines.

Backend handlers may implement their processing in whatever way they like, if they correspond to only two important restrictions:

May not complain or otherwise act up when multiple threads simultaneously request the same operations on different data.

When the underlying data of a variable is requested, all operations concerning it must be finished when returned.

This may sound trivial, but in fact it requires the backend handler to keep tab on all ongoing operations across all variables and tidy up quickly when someone needs to peek under the hood. For reference, our CUDA (GPU) processor accomplishes this by duplicating all host operations and variables to the GPU and keeping the host memory version as a “shallow copy”. After initial synchronisation, transfers are only done when a result is requested.

Sigma: Global Environments For Trainers

The main component that can create and manage Sigma environments. Sigma environments are containers and laid-back managers to all the action — they loosely connect trainers, monitors and environments and enable them to pass messages. A Sigma environment may contain multiple trainers, each of which may be attached to multiple independent monitors simultaneously. The only requirement the Sigma environment has is that all components’ lifecycles must end before itself can shutdown gracefully (it runs in its own thread).

Monitors: Talking to The Outside

Because monitors were meant to be separately usable components, they reside outside the core project. Nevertheless, monitors are important components that, when attached to a corresponding Sigma environment and trainer, can provide managed external access to the training process. Essentially, they are how you would typically interact with a Sigma trainer when you’re not a framework programmer — for example with graphical applications, monitoring websites, external logging and so on.

Monitors can fetch any kind of information from the global training data registry, e.g. for visualisation or logging. Special behaviour like shutdown can be injected using commands, a special form of hooks that are only invoked once. Due to their logical separation from other Sigma components, monitors can be used (almost) independently and can also be pretty much anything.