Introduction

TensorFlow is a programming paradigm that Google developed internally for Machine Learning. Since then they have open sourced it and there is a considerable community that uses this new paradigm. I intend to cover major building blocks and architecture of TensorFlow here.

TensorFlow is developed for heterogeneous environments ranging from powerful GPUs to lightweight mobile devices. It can be used for both for inference and learning on such varied devices. TensorFlow is the second generation of ML infrastructure developed at Google — leaning from experiences of DistBelief

Core improvements over DistBelief seem to be the following. Some of these might seem a bit abstract at this point, but will become clear as we dive into details.

Describing the computations in Dataflow-like paradigm Ability to train and run models in very heterogenous systems — ranging from a mobile device to a powerful server farm. Expressing parallelism in a better way that allows a bunch of nodes to updated shared parameters A unified interface instead of having large scale deployments for training and then small scale for inference which used to cause undesirable abstractions in the system. Performance improvements

TensorFlow Programming Paradigm

TensorFlow programs are written by programmers using a frontend language of their choice, like python or c++. These programs get converted into a directed graph. This graph represents a dataflow computation with data flowing from nodes via the graph edges. Each node has multiple inputs and outputs and it represents a certain operation. Most of the data that flows in the graph is represented as Tensor. Tensors are typed arrays of arbitrary dimensions — a construct quite useful when your are dealing with Matrices or system of equations.

There are also control and looping constructs that can be overlaid, inserted in in the graph for better control. Lets define some fundamental elements of TensorFlow system.

Operation: Operation is some operation like matrix multiplication that can be inferred at the time of graph construction.

Kernel: Kernel is a device specific implementation of a given operation e.g. implementation specific for a GPU.

Session: Most programs create a session and use it for running through the data flow computations. Run interface can compute the graph dependencies and can then places nodes on appropriate devices. Idea is to setup the graph once and then run the data computations through it many times. One can also run a subgraph by setting up appropriate inputs and outputs.

Variables: Variables can be used for storing data across multiple runs of the graph. Most tensors are temporary and get deallocated during the given run. Parameters of a model are good candidates for storing in Variables as they are generally updated after every run of a subgraph.

Tensor: Tensors are the arrays of arbitrary dimensions. They represent the data that flows in the system and act as inputs and outputs of graph nodes.

Devices: Devices execute the given tasks on them. A devices gets identified using things such as type of device i.e. CPU vs GPU. Devices manage their own memory and also schedule various tasks to be executed on the given device. Depending on the type of device, a special allocator can be used for allocating memory for tensors.

TensorFlow architecture

In terms of system components, TensorFlow consists of Master, Worker and Client for distributed coordination and execution. Session interface that was described in the last section acts as the client that communicates with the master and worker processes. Workers arbitrate access to devices. Master plays an important roles in coming up with the correct schemes for workers to execute.TensorFlow can execute in a completely local environment or a distributed environment where all these 3 entities can reside on different machines.

Execution on Single or multiple Devices

LHS — Single GPU device. RHS — Heterogeneous devices each working on part of the subgraph

Executing on a single device is the simplest case where there is one worker and one device. It can leverage graph dependencies that are built into the graph. Independent nodes can leverage parallelism of the device. If there are dependencies for the given node on other nodes, then the execution of that node waits until all dependencies have been executed.

In case of multiple devices, the problem seems like a scheduling problem. Which graph node should land on which device? What about all the data that needs to travel across device boundaries? One of the heuristics used in TensorFlow is to model the cost of placement as a function of input/output tensor sizes in bytes and the computation time. Computation time can be statically estimated or based on prior execution times. Using these, the scheduler runs a simulation starting with the input nodes. It considers the all the feasible devices for the given nodes and then uses greedy heuristic to place a node on a device. This takes into account cost of byte transfer and also computation time needed on the device. The device where node execution would finish the earliest, is selected.

Generated nodes: Send, Receive, Save, Restore, Fetch and Feed

Once the node placement has been established, some edges of the graph will have to cross the device boundaries. This is accomplished by placing synthetic send and receive stub nodes in each device. (Obviously users of the system don’t have to worry about this and TensorFlow system can compose these nodes behind the scenes.) All the communication is handled by these nodes. These nodes take tensors as inputs.

Placing subgraphs on two devices, with tensors a and x crossing device boundaries using generated send and receive nodes

These communication centric nodes help with abstracting all the communication and with further optimization. One such optimization is that the same tensor gets a single downstream receive node for communication so as to avoid extra copying(In the diagram above, consider the tensor a recv node on Device B). Another advantage of send and receive nodes is that they can be used for cross device synchronization and also workers can then communicate with each other without having to go the master for every scheduling dependency. Another advantage of these nodes is that, internally they can be optimized for TCP/RDMA.

Earlier we described Variables that are used for saving the state. Since there can be failures in the system, TensorFlow provides a checkpointing mechanism. This is accomplished using Save and Restore nodes to which variables get connected as the graph is getting constructed. Every so often values can be persistently written to the file system. Conversely, a Restore node is used during the first iteration after the restart to populate the given Variable.

In a data heavy environment, it is often useful to execute subgrahps. TensorFlow allows for this by naming graph nodes and providing ports for input/outputs. Missing input/output can be filled by having user specify the input. Inputs to subgraph are replaced by feed nodes and outputs are replaced by fetch nodes.

Distributed Execution

Send and Receive nodes clearly facilitate the distributed execution of different subgraphs. If one of the node fails for some reason, then the whole execution is aborted. For failure detection, periodic health checks are performed by the master with workers. Send and Receive nodes can also detect communication errors among each other and failure to respond. In such cases, they can again instantiate the overall abort of the graph execution.

Advanced TensorFlow capabilities

We covered basic paradigm of TensorFlow so far. Here are some advanced capabilities that are useful in machine learning context.

Gradient computations

One of the most common algorithms in ML is gradient descent. The central idea behind this algorithm is to take partial derivates of the (Guess-actual) errors with features and then try to minimize that error. Derivatives — being slopes at given points give the direction to go towards to minimize the error. So tensorflow provides a built-in way to compute gradients for the graph. The graph data-flow fits in very nicely with how derivatives are computed in the real world using chain rule.

See the following example. In this we try to take partial derivative of z w.r.t x i.e. dz/dx. z, x and y are 1-d tensors. Given that the partial derivative of z w.r.t x is 2x, the output of this program is [array([20, 40, 60, 80], dtype=int32)]. Similar gradient computation can be done for other tensors such as y.

import tensorflow as tf

x = tf.constant([10, 20, 30, 40])

y = tf.constant([1, 2, 3, 4])

z = x * x + y/5

s = tf.Session()

g = tf.gradients(z, x)

print(s.run(g))

See the graph structure generated by TensorFlow. It created a tree of operations on the left hand side for representing the function. It then also created a corresponding tree of gradients by exploring the dependency of z on x. This can be done by tracking tensor z which depends on x and then adding forward links and computing the gradients along the path.

LHS is the graph for Z = X*X + Y/5 and RHS is the derivative computation for the partial derivative dz/dx i.e. tf.gradients(z,x). See how gradient graph traverses in the opposite direction like the chain rule of composite functions.

Such gradient computation can be memory heavy. Beyond basic heuristics of releasing temporary variables quickly by using efficient ordering, more future ideas revolve around moving memory from GPU to CPU etc.

Device Constraints

Nodes can be specified with constraints such as run this node on a GPU. Another useful constraint is: place a Variable on the same device with a specific node for better data locality. This affects the node placement algorithm above. The node-to-device placement algorithm first finds feasible devices for the graph nodes. Then second pass figures out which nodes need to be colocated and then intersection of this set, along with the greedy heuristics described earlier finalize the placement of nodes on devices

Control Flow Statements

We briefly mentioned earlier about graph dataflow getting augmented with control statements. TensoreFlow provides switch, break, for-loop like constructs. It is important to note that because different nodes that are in a loop can be placed on different devices, for-loop like constructs becomes a problem of distributed loop control and termination. Because of this, the graph construction includes control nodes and these control nodes communicate with each other during start of every iteration and end of an iteration and eventually the termination of the loop.

Queues

Queuing constructs are available in TensorFlow for asynchronous computation. When one graph node is done producing its output, it can queue that data and the consumer node can dequeue it later when it is ready. Similarly, some data can be prefetched to the queues so that the device an begin working on it as soon as it is done with previous computation. Queues can also be used for batching gradient computations. In addition to standard FIFO queue, shuffling queues are useful for randomization of data — another common need in ML.

Containers

Containers internally become the backing store for persistent state like the ones used by Variables. The scope of containers is bound to the start of the process start and the end. But Named containers can be used for longer-lived state and across devices.

Optimizations

Tensorflow was approximately 6 times quicker compared to Distbelief on certain models. Overall optimizations in TensoreFlow focus on:

Eliminating common subgraphs that perform the same operations and have the same copies of data and then making those point towards the common node. Communication related improvements focus on delaying execution of receive nodes until they are ready to receive some data Similar to other asynchronous systems, asynchronous kernels provide a way for a callback to be called when computation is complete, instead of waiting for the computation to finish. TensorFlow mostly uses optimized libraries such as Eigen for linear algebra related computations. A lot of ML algorithms are tolerant to approximate floating point arithmetic. TensorFlow uses less number of bits in the mantissa part when floating point numbers are being transmitted across device boundaries.

Conclusions

Overall I found the dataflow based approach quite interesting. The graph structure and some of the gradient computation and related insights are quite powerful IMO. Overall this paradigm seems pretty powerful for data rich applications like those prevalent in ML.