Multiple remarks:

The “filesystem simulation” isn’t credible but we’ll keep this behaviour consistent overall our tests so we can ignore its impact.

We feed data to our model using the feed_dict system. This force TF to create a copy of Python data into the session.

system. This force TF to create a copy of Python data into the session. We are only using ~31% of our GPU through the whole training

of our GPU through the whole training It takes ~18 seconds to train this NN

One could think that this is all we can get on such a simple task but one would be mistaken, think about it:

Everything is synchronous and single-threaded in this script (you have to wait for each python call to finish before going on the next python call)

We keep moving back and forth, between Python and the underlying C++ wrapper.

How avoiding all those pitfalls?

The solution is in the queue system of TF. You can think about it as designing your input pipeline beforehand right into the graph and stop doing everything in Python! In fact, we will try to remove any Python dependency we have from the input pipeline.

This will also give use nice properties of multi-threading, asynchronicity and memory optimisation due to the removal of the feed_dict system (which is very cool because if you plan to train your model later on a distributed infrastructure, TF will shine right out of the box).

But first, let’s explore queues in TF with simple examples. Again, read the comment to follow my thoughts:

Hanging example of a flawed queue system

What happened here? Why are we hanging in the void like that?

Well, this is how TF has been implemented, dequeue operations cause the whole graph to wait for more data if the queue gets empty. this behaviour happens only if you use the queue manually like that, but this is clearly cumbersome and totally useless as we are still in only one thread calling enqueue and dequeue operations.

Note: To be asynchronous, they have to be in their own thread, not the main one. As my French grandma used to say: if many cooks have to share the same knife to make a meal, they won’t be faster than only one cook…

To solve this, let me introduce the QueueRunner and the coordinator which their only purposes are to handle queues in their own threads and ensuring synchronisation (starting, queueing, dequeuing, stopping, etc.).

The QueueRunner needs 2 things:

A queue

Some enqueue operations (you can have multiple enqueue operations for one queue)

The coordinator needs nothing: it is a handy high-level API to handle queues under the “tf.train” namespace. If you create, as we did, a custom queue and you add a QueueRunner to handle it. As long as you don’t forget to add the QueueRunner to the QUEUE_RUNNERS collections of TF, you can use the high-level API safely.

Let’s take the precedent example and make the changes needed to handle queues in their own thread:

Little thought exercise: Before you look at the resulting log, can you figure out how many times tf.random_normal has been called?

Spoiler, here is a dump of the result: