In this post, we will read multiple .csv files into Tensorflow using generators. But the method we will discuss is general enough to work for other file formats as well. We will demonstrate the procedure using 500 .csv files. These files have been created using random numbers. Each file contains only 1024 numbers in one column. This method can easily be extended to huge datasets involving thousands of .csv files. As the number of files becomes large, we can’t load the whole data into memory. So we have to work with chunks of it. Generators help us do just that conveniently. In this post, we will read multiple files using a custom generator.

This post is self-sufficient in the sense that readers don’t have to download any data from anywhere. Just run the following codes sequentially. First, a folder named “random_data” will be created in current working directory and .csv files will be saved in it. Subsequently, files will be read from that folder and processed. Just make sure that your current working directory doesn’t have an old folder named “random_data”. Then run the following code cells.

We will use Tensorflow 2 to run our deep learning model. Tensorflow is very flexible. A given task can be done in different ways in it. The method we will use is not the only one. Readers are encouraged to explore other ways of doing the same. Below is an outline of three different tasks considered in this post.

Outline: Create 500 ".csv" files and save it in the folder “random_data” in current directory. Write a generator that reads data from the folder in chunks and preprocesses it. Feed the chunks of data to a CNN model and train it for several epochs.

1. Create 500 .csv files of random data As we intend to train a CNN model for classification using our data, we will generate data for 5 different classes. Following is the process that we will follow. Each .csv file will have one column of data with 1024 entries.

Each file will be saved using one of the following names (Fault_1, Fault_2, Fault_3, Fault_4, Fault_5). The dataset is balanced, meaning, for each category, we have approximately same number of observations. Data files in “Fault_1” category will have names as “Fault_1_001.csv”, “Fault_1_002.csv”, “Fault_1_003.csv”, …, “Fault_1_100.csv”. Similarly for other classes. import numpy as np import os import glob np.random.seed(1111) First create a function that will generate random files. def create_random_csv_files(fault_classes, number_of_files_in_each_class): os.mkdir("./random_data/") # Make a directory to save created files. for fault_class in fault_classes: for i in range(number_of_files_in_each_class): data = np.random.rand(1024,) file_name = "./random_data/" + eval("fault_class") + "_" + "{0:03}".format(i+1) + ".csv" # This creates file_name np.savetxt(eval("file_name"), data, delimiter = ",", header = "V1", comments = "") print(str(eval("number_of_files_in_each_class")) + " " + eval("fault_class") + " files" + " created.") Now use the function to create 100 files each for five fault types. create_random_csv_files(["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"], number_of_files_in_each_class = 100) 100 Fault_1 files created. 100 Fault_2 files created. 100 Fault_3 files created. 100 Fault_4 files created. 100 Fault_5 files created. files = glob.glob("./random_data/*") print("Total number of files: ", len(files)) print("Showing first 10 files...") files[:10] Total number of files: 500 Showing first 10 files... ['./random_data/Fault_1_001.csv', './random_data/Fault_1_002.csv', './random_data/Fault_1_003.csv', './random_data/Fault_1_004.csv', './random_data/Fault_1_005.csv', './random_data/Fault_1_006.csv', './random_data/Fault_1_007.csv', './random_data/Fault_1_008.csv', './random_data/Fault_1_009.csv', './random_data/Fault_1_010.csv'] To extract labels from file name, extract the part of the file name that corresponds to fault type. print(files[0]) ./random_data/Fault_1_001.csv print(files[0][14:21]) Fault_1 Now that data have been created, we will go to the next step. That is, define a generator, preprocess the time series like data into a matrix like shape such that a 2-D CNN can ingest it.

2. Write a generator that reads data in chunks and preprocesses it Generator are similar to functions with one important difference. While functions produce all their outputs at once, generators produce their outputs one by one and that too when asked. yield keyword converts a function into a generator. Generators can run for a fixed number of times or indefinitely depending on the loop structure used inside it. For our application, we will use a generator that runs indefinitely. The following generator takes a list of file names as first argument. The second argument is batch_size . batch_size determines how many files we will process at one go. This is determined by how much memory do we have. If all data can be loaded into memory, there is no need for generators. In case our data size is huge, we can process chunks of it. As we will be solving a classification problem, we have to assign labels to each raw data. We will use following labels for convenience. Class Label Fault_1 0 Fault_2 1 Fault_3 2 Fault_4 3 Fault_5 4 The generator will yield both data and labels. import pandas as pd import re # To match regular expression for extracting labels def data_generator(file_list, batch_size = 20): i = 0 while True: if i*batch_size >= len(file_list): # This loop is used to run the generator indefinitely. i = 0 np.random.shuffle(file_list) else: file_chunk = file_list[i*batch_size:(i+1)*batch_size] data = [] labels = [] label_classes = ["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"] for file in file_chunk: temp = pd.read_csv(open(file,'r')) # Change this line to read any other type of file data.append(temp.values.reshape(32,32,1)) # Convert column data to matrix like data with one channel pattern = "^" + eval("file[14:21]") # Pattern extracted from file_name for j in range(len(label_classes)): if re.match(pattern, label_classes[j]): # Pattern is matched against different label_classes labels.append(j) data = np.asarray(data).reshape(-1,32,32,1) labels = np.asarray(labels) yield data, labels i = i + 1 To read any other file format, inside the generator change the line that reads files. This will enable us to read different file formats, be it .txt or .npz or any other. Preprocessing of data, different from what we have done in this blog, can be done within the generator loop. Now we will check whether the generator works as intended or not. We will set batch_size to 10. This means that files in chunks of 10 will be read and processed. The list of files from which 10 are chosen can be an ordered file list or shuffled list. In case, the files are not shuffled, use np.random.shuffle(file_list) to shuffle files. In the demonstration, we will read files from an ordered list. This will help us check any errors in the code. generated_data = data_generator(files, batch_size = 10) num = 0 for data, labels in generated_data: print(data.shape, labels.shape) print(labels, "<--Labels") # Just to see the labels print() num = num + 1 if num > 5: break (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels Run the above cell multiple times to observe different labels. Label 1 appears only when all the files corresponding to “Fault_1” have been read. There are 100 files for “Fault_1” and we have set batch_size to 10. In the above cell we are iterating over the generator only 6 times. When number of iterations become greater than 10, we see label 1 and subsequently other labels. This will happen only if our initial file list is not shuffled. If the original list is shuffled, we will get random labels. Now we will create a tensorflow dataset using the generator. Tensorflow datasets can conveniently be used to train tensorflow models. A tensorflow dataset can be created form numpy arrays or from generators.Here, we will create it using a generator. Use of the previously created generator as it is in tensorflow datasets doesn’t work (Readers can verify this). This happens because of the inability of regular expression to compare a “string” with a “byte string”. “byte strings” are generated by default in tensorflow. As a way around, we will make modifications to the earlier generator and use it with tensorflow datasets. Note that we will only modified three lines. Modified lines are accompanied by commented texts beside it. import tensorflow as tf print(tf.__version__) 2.2.0 def tf_data_generator(file_list, batch_size = 20): i = 0 while True: if i*batch_size >= len(file_list): i = 0 np.random.shuffle(file_list) else: file_chunk = file_list[i*batch_size:(i+1)*batch_size] data = [] labels = [] label_classes = tf.constant(["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"]) # This line has changed. for file in file_chunk: temp = pd.read_csv(open(file,'r')) data.append(temp.values.reshape(32,32,1)) pattern = tf.constant(eval("file[14:21]")) # This line has changed for j in range(len(label_classes)): if re.match(pattern.numpy(), label_classes[j].numpy()): # This line has changed. labels.append(j) data = np.asarray(data).reshape(-1,32,32,1) labels = np.asarray(labels) yield data, labels i = i + 1 Test whether modified generator works or not. check_data = tf_data_generator(files, batch_size = 10) num = 0 for data, labels in check_data: print(data.shape, labels.shape) print(labels, "<--Labels") print() num = num + 1 if num > 5: break (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels (10, 32, 32, 1) (10,) [0 0 0 0 0 0 0 0 0 0] <--Labels Note that the new generator created by using a few tensorflow commands works just fine as our previous generator. This new generator can now be integrated with a tensorflow dataset . batch_size = 15 dataset = tf.data.Dataset.from_generator(tf_data_generator,args= [files, batch_size],output_types = (tf.float32, tf.float32), output_shapes = ((None,32,32,1),(None,))) Check whether dataset works or not. num = 0 for data, labels in dataset: print(data.shape, labels.shape) print(labels) print() num = num + 1 if num > 7: break (15, 32, 32, 1) (15,) tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32) (15, 32, 32, 1) (15,) tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32) (15, 32, 32, 1) (15,) tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32) (15, 32, 32, 1) (15,) tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32) (15, 32, 32, 1) (15,) tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32) (15, 32, 32, 1) (15,) tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32) (15, 32, 32, 1) (15,) tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1.], shape=(15,), dtype=float32) (15, 32, 32, 1) (15,) tf.Tensor([1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.], shape=(15,), dtype=float32) This also works fine. Now, we will train a full CNN model using the generator. As is done in every model, we will first shuffle data files. Split the files into train, validation, and test set. Using the tf_data_generator create three tensorflow datasets corresponding to train, validation, and test data respectively. Finally, we will create a simple CNN model. Train it using train dataset, see its performance on validation dataset, and obtain prediction using test dataset. Keep in mind that our aim is not to improve performance of the model. As the data are random, don’t expect to see good performance. The aim is only to create a pipeline.

3. Building data pipeline and training CNN model Before building the data pipeline, we will first move files corresponding to each fault class into different folders. This will make it convenient to split data into training, validation, and test set, keeping the balanced nature of the dataset intact. import shutil Create five different folders. fault_folders = ["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"] for folder_name in fault_folders: os.mkdir(os.path.join("./random_data", folder_name)) Move files into those folders. for file in files: pattern = "^" + eval("file[14:21]") for j in range(len(fault_folders)): if re.match(pattern, fault_folders[j]): dest = os.path.join("./random_data/",eval("fault_folders[j]")) shutil.move(file, dest) glob.glob("./random_data/*") ['./random_data/Fault_1', './random_data/Fault_2', './random_data/Fault_3', './random_data/Fault_4', './random_data/Fault_5'] glob.glob("./random_data/Fault_1/*")[:10] # Showing first 10 files of Fault_1 folder ['./random_data/Fault_1/Fault_1_001.csv', './random_data/Fault_1/Fault_1_002.csv', './random_data/Fault_1/Fault_1_003.csv', './random_data/Fault_1/Fault_1_004.csv', './random_data/Fault_1/Fault_1_005.csv', './random_data/Fault_1/Fault_1_006.csv', './random_data/Fault_1/Fault_1_007.csv', './random_data/Fault_1/Fault_1_008.csv', './random_data/Fault_1/Fault_1_009.csv', './random_data/Fault_1/Fault_1_010.csv'] glob.glob("./random_data/Fault_3/*")[:10] # Showing first 10 files of Fault_3 folder ['./random_data/Fault_3/Fault_3_001.csv', './random_data/Fault_3/Fault_3_002.csv', './random_data/Fault_3/Fault_3_003.csv', './random_data/Fault_3/Fault_3_004.csv', './random_data/Fault_3/Fault_3_005.csv', './random_data/Fault_3/Fault_3_006.csv', './random_data/Fault_3/Fault_3_007.csv', './random_data/Fault_3/Fault_3_008.csv', './random_data/Fault_3/Fault_3_009.csv', './random_data/Fault_3/Fault_3_010.csv'] Prepare that data for training set, validation set, and test_set. For each fault type, we will keep 70 files for training, 10 files for validation and 20 files for testing. fault_1_files = glob.glob("./random_data/Fault_1/*") fault_2_files = glob.glob("./random_data/Fault_2/*") fault_3_files = glob.glob("./random_data/Fault_3/*") fault_4_files = glob.glob("./random_data/Fault_4/*") fault_5_files = glob.glob("./random_data/Fault_5/*") from sklearn.model_selection import train_test_split fault_1_train, fault_1_test = train_test_split(fault_1_files, test_size = 20, random_state = 5) fault_2_train, fault_2_test = train_test_split(fault_2_files, test_size = 20, random_state = 54) fault_3_train, fault_3_test = train_test_split(fault_3_files, test_size = 20, random_state = 543) fault_4_train, fault_4_test = train_test_split(fault_4_files, test_size = 20, random_state = 5432) fault_5_train, fault_5_test = train_test_split(fault_5_files, test_size = 20, random_state = 54321) fault_1_train, fault_1_val = train_test_split(fault_1_train, test_size = 10, random_state = 1) fault_2_train, fault_2_val = train_test_split(fault_2_train, test_size = 10, random_state = 12) fault_3_train, fault_3_val = train_test_split(fault_3_train, test_size = 10, random_state = 123) fault_4_train, fault_4_val = train_test_split(fault_4_train, test_size = 10, random_state = 1234) fault_5_train, fault_5_val = train_test_split(fault_5_train, test_size = 10, random_state = 12345) train_file_names = fault_1_train + fault_2_train + fault_3_train + fault_4_train + fault_5_train validation_file_names = fault_1_val + fault_2_val + fault_3_val + fault_4_val + fault_5_val test_file_names = fault_1_test + fault_2_test + fault_3_test + fault_4_test + fault_5_test # Shuffle data (We don't need to shuffle validation and test data) np.random.shuffle(train_file_names) print("Number of train_files:" ,len(train_file_names)) print("Number of validation_files:" ,len(validation_file_names)) print("Number of test_files:" ,len(test_file_names)) Number of train_files: 350 Number of validation_files: 50 Number of test_files: 100 batch_size = 10 train_dataset = tf.data.Dataset.from_generator(tf_data_generator, args = [train_file_names, batch_size], output_shapes = ((None,32,32,1),(None,)), output_types = (tf.float32, tf.float32)) validation_dataset = tf.data.Dataset.from_generator(tf_data_generator, args = [validation_file_names, batch_size], output_shapes = ((None,32,32,1),(None,)), output_types = (tf.float32, tf.float32)) test_dataset = tf.data.Dataset.from_generator(tf_data_generator, args = [test_file_names, batch_size], output_shapes = ((None,32,32,1),(None,)), output_types = (tf.float32, tf.float32)) Now create the model. from tensorflow.keras import layers model = tf.keras.Sequential([ layers.Conv2D(16, 3, activation = "relu", input_shape = (32,32,1)), layers.MaxPool2D(2), layers.Conv2D(32, 3, activation = "relu"), layers.MaxPool2D(2), layers.Flatten(), layers.Dense(16, activation = "relu"), layers.Dense(5, activation = "softmax") ]) model.summary() Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) (None, 30, 30, 16) 160 _________________________________________________________________ max_pooling2d (MaxPooling2D) (None, 15, 15, 16) 0 _________________________________________________________________ conv2d_1 (Conv2D) (None, 13, 13, 32) 4640 _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 6, 6, 32) 0 _________________________________________________________________ flatten (Flatten) (None, 1152) 0 _________________________________________________________________ dense (Dense) (None, 16) 18448 _________________________________________________________________ dense_1 (Dense) (None, 5) 85 ================================================================= Total params: 23,333 Trainable params: 23,333 Non-trainable params: 0 _________________________________________________________________ Compile the model. model.compile(loss = "sparse_categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"]) Before we fit the model, we have to do one important calculation. Remember that our generators are infinite loops. So if no stopping criteria is given, it will run indefinitely. But we want our model to run for, say, 10 epochs. So our generator should loop over the data files just 10 times and no more. This is achieved by setting the arguments steps_per_epoch and validation_steps to desired numbers in model.fit() . Similarly while evaluating model, we need to set the argument steps to a desired number in model.evaluate() . There are 350 files in training set. Batch_size is 10. So if the generator runs 35 times, it will correspond to one epoch. Therefor, we should set steps_per_epoch to 35. Similarly, validation_steps = 5 and in model.evaluate() , steps = 10 . steps_per_epoch = np.int(np.ceil(len(train_file_names)/batch_size)) validation_steps = np.int(np.ceil(len(validation_file_names)/batch_size)) steps = np.int(np.ceil(len(test_file_names)/batch_size)) print("steps_per_epoch = ", steps_per_epoch) print("validation_steps = ", validation_steps) print("steps = ", steps) steps_per_epoch = 35 validation_steps = 5 steps = 10 model.fit(train_dataset, validation_data = validation_dataset, steps_per_epoch = steps_per_epoch, validation_steps = validation_steps, epochs = 10) Epoch 1/10 35/35 [==============================] - 1s 40ms/step - loss: 1.6268 - accuracy: 0.2029 - val_loss: 1.6111 - val_accuracy: 0.2000 Epoch 2/10 35/35 [==============================] - 1s 36ms/step - loss: 1.6101 - accuracy: 0.2114 - val_loss: 1.6079 - val_accuracy: 0.2600 Epoch 3/10 35/35 [==============================] - 1s 35ms/step - loss: 1.6066 - accuracy: 0.2343 - val_loss: 1.6076 - val_accuracy: 0.2000 Epoch 4/10 35/35 [==============================] - 1s 34ms/step - loss: 1.5993 - accuracy: 0.2143 - val_loss: 1.6085 - val_accuracy: 0.2400 Epoch 5/10 35/35 [==============================] - 1s 34ms/step - loss: 1.5861 - accuracy: 0.2657 - val_loss: 1.6243 - val_accuracy: 0.2000 Epoch 6/10 35/35 [==============================] - 1s 35ms/step - loss: 1.5620 - accuracy: 0.3514 - val_loss: 1.6363 - val_accuracy: 0.2000 Epoch 7/10 35/35 [==============================] - 1s 36ms/step - loss: 1.5370 - accuracy: 0.2857 - val_loss: 1.6171 - val_accuracy: 0.2600 Epoch 8/10 35/35 [==============================] - 1s 35ms/step - loss: 1.5015 - accuracy: 0.4057 - val_loss: 1.6577 - val_accuracy: 0.2000 Epoch 9/10 35/35 [==============================] - 1s 35ms/step - loss: 1.4415 - accuracy: 0.5086 - val_loss: 1.6484 - val_accuracy: 0.1400 Epoch 10/10 35/35 [==============================] - 1s 36ms/step - loss: 1.3363 - accuracy: 0.6143 - val_loss: 1.6672 - val_accuracy: 0.2200 <tensorflow.python.keras.callbacks.History at 0x7fcab40f6150> test_loss, test_accuracy = model.evaluate(test_dataset, steps = 10) 10/10 [==============================] - 0s 25ms/step - loss: 1.6974 - accuracy: 0.1500 print("Test loss: ", test_loss) print("Test accuracy:", test_accuracy) Test loss: 1.6973648071289062 Test accuracy: 0.15000000596046448 As expected, model performs terribly.