feed_dict

tf.Session.run()

tf.Tensor.eval()

tf.data

feed_dict

tf.data

Implementing a minimal image pipeline in 5 minutes

tf.data.Dataset

tf.data.Iterator

tf.data.Dataset

[ [Tensor(image), Tensor(label)], [Tensor(image), Tensor(label)], ... ]

tf.data.Iterator

Dataset(*list of image files*) → Dataset(*actual images*)

→ Dataset(*6400 images*) → Dataset(*64 batches with 100 images each*)

→ Dataset(*list of audio files*) → Dataset(*shuffled list of audio files*)

Defining the computation graph

# define list of files files = ['a.png', 'b.png', 'c.png', 'd.png'] # create a dataset from filenames dataset = tf.data.Dataset.from_tensor_slices(files)

tf.data.Dataset.map()

num_parallel_calls=n

map()

tf.data.Dataset.batch()

# Create batches of 64 images each dataset = dataset.batch(64)

tf.data.Dataset.prefetch(buffer_size)

buffer_size

buffer_size=1

dataset = dataset.prefetch(buffer_size=1)

iterator = dataset.make_initializable_iterator()

tf.data.Iterator.get_next()

batch_of_images = iterator.get_next()

feed_dict

batch_of_images

Running the session

iterator.initializer

tf.errors.OutOfRangeError

with tf.Session() as session: for i in range(epochs): session.run(iterator.initializer) try: # Go through the entire dataset while True: image_batch = session.run(batch_of_images) except tf.errors.OutOfRangeError: print('End of Epoch.')

Most beginner tensorflow tutorials introduce the reader to themethod of loading data into your model where data is passed to tensorflow through theorfunction calls. There is, however, a much better and almost easier way of doing this. Using theAPI you can create high-performance data pipelines in just a few lines of code.In a naivepipeline the GPU always sits by idly whenever it has to wait for the CPU to provide it with the next batch of data.pipeline, however, can prefetch the next batches asynchronously to minimize the total idle time. You can further speed up the pipeline by parallelizing the loading and preprocessing operations.To build a simple data pipeline you need two objects. Astores your dataset and aallows you to extract items from your dataset one-by-one.for an image pipeline could (schematically) look like this:to retrieve image-label pairs one by one. In practice, multiple image-label pairs would be batched together so that the iterator pulls out one entire batch at a time.A dataset can be created either from a source (like a list of filenames in Python) or by applying a transformation to an existing dataset. Here are some examples of possible transformations:A minimal data pipeline for images could look like this:First create a tensor from a list of files.to apply this function to all elements (file paths) in the dataset.You can also add aargument toto parallelize the function calls.to create batches:to the end of your pipeline. This ensures that the next batch is always immediately available to the GPU and reduces GPU starvation as mentioned above.is the number of batches that should be prefetched.is usually sufficient although in some cases, especially when the processing time per batch varies, it can help to increase it.to create a placeholder-tensor that tensorflow fills with the next batch of imagesreplaces your previous placeholder variable.Now run your model as usual but make sure to evaluate theopevery epoch and catch theexceptionevery epoch.

The program nvidia-smi allows you to monitor your GPU utilization and can help you understand bottlenecks in your data pipeline. The average GPU utilization should usually be above 70-80%.

A more complete data pipeline

Shuffle

tf.data.Dataset.shuffle()

dataset = tf.data.Dataset.from_tensor_slices(files) dataset = dataset.shuffle(len(files))

Data augmentation

tf.image.random_flip_left_right()

tf.image.random_brightness()

tf.image.random_saturation()

Labels

# files is a python list of image filenames # labels is a numpy array with label data for each image dataset = tf.data.Dataset.from_tensor_slices((files, labels))

.map()

def load_image(path, label): # load image return image, label dataset = dataset.map(load_image)

Useto shuffle the filenames. The argument specifies how many elements should be shuffled at a time. In general, it's recommended to shuffle the entire list at once. See this answer on Stackoverflow You can use the functionsto perform simple data augmentation on your images.To load labels (or other metadata) along your images simply include them when creating the intial dataset:allow label data to pass through: