In part 1, we looked at a simple example of how we can split a file into several smaller files based on the value of a date for a given field. In this part 2, we are going to expand on those ideas and improve the solution.

Scenario

So, unlike Part 1 where we had a small, known range of dates in our file, lets imagine we have the same 16GB csv file with many millions of rows of data, instead with an unknown range of dates in the file. How would we go about splitting the file into smaller files based on the date field? For testing purposes and explaining the new solution, I am going continue using the sample csv file I used in Part 1.

Method

The second row of components in this job are the same as in Part 1, however the tForeach_1 has been replaced by the top row of components.

Why the need for four components on that top row instead of just one? Well, rather than being told what dates to use to split the file up, the job works it out itself.

tFileInputDelimited_2

The tFileinputDelimited component is going to read the rows from my csv file into the job.

tMap_1

In the tMap, I ignore the ID field as it is not needed at this stage. I have two output columns: yearmonth and date. The yearmonth takes the date from the date column and turns it into a year month string in the format yyyyMM.

tUniqRow_1

I select the yearmonth column as the key attribute in my tUniqRow component. The set of rows that are output by this component will be a distinct list of all the months (in yyyyMM format) that exist in the data.

tFlowToIterate

The rows of yearmonths are sent to the tFlowToIterate component at which point they are stored as global variables. It is now appropriate to cast our minds back to Part 1, where we had a tForeach component with a list of dates as defined by me. We then iterated through this list writing to a new file, each time based on the current date in the list.

We are still going to iterate through a list of dates, however this time the list has been built and stored as global variables by the top row of components.

tFileInputDelimited_1

As in Part 1, the tFileinputDelimited_1 component is going to read the rows from my csv file into the job.

tFilterRow_1

The purpose of the tFilterRow is identical to Part 1, however the code is slightly different as the dates came from the tFlowToIterate instead of the tForeach.

tFileOutputDelimited_1

As in Part 1, the tFileOutputDelimted_1 component writes the rows to a delimited file. However, the code in the file name field is again slightly different.

TalendDate.formatDate(“yyyyMM”,((java.util.Date)globalMap.get(“row4.date”)))

Output

I ran this job with the same data set as in Part 1 to test the job; the output folder is shown in the below screenshot.

In Part 1 there was an output file for each date defined in the tForeach which lead to some empty output files in the output folder. See the screenshot below for what the output folder from Part 1 looked like.

So, the new solution has made two improvements on the first job; it can identify the range of dates in the file without human intervention and it only outputs files with data in.