With the small coding contest some weeks ago we got many comments and it’s worth to make a conclusion for the solutions in different languages. What language is easier to write, has better memory usage or better performance?

To clarify the question:

Remove duplicate lines from a file

We have a text file with 13 Million entries, all numbers in the range from 1.000.000 to 100.000.000. Each line has only one number. The file is unordered and we have about 1% duplicate entries. Remove the duplicates.

Be efficient with the memory usage and / or find a performant solution. The script should be started with the filename and linenumber and print the result to stdout.



benchmark preparation

First I generate the random numbers as test files with a simple python script. All tests will started with bench.py over all test files. See the sourcecode on github.com.

The command line tool /usr/bin/time can detect the cpu and memory consumption. This will be started as a sub process. After each run the user time, system time and maximum memory usage will saved. The result is stored into a JavaScript Object for the highchart. You can see the charts in the middle of this article.

Feel free to add new solutions and make some improvements.

first solution: sorting the numbers

The first solution in the comments was sorting all numbers. The expected memory usage should be 13.000.000 * 4 byte (100.000.000 fits into 32bit integers) if the sorting algorithm is not using an extra array for swapping. The average case effort for the sorting algorithm (quicksort or merge sort) is O(n + log n).

The command line tool sort can do this:

sort -u rand_numbers.txt > unique_numbers.txt

A small optimization with

sort

is comparing alpha-numerical instead string. It will use less memory:

sort -u -n rand_numbers.txt > unique_numbers.txt

The same solution in c should be compare-able with the sort-command. But I could not find the exact sort algorithm behind the method qsort. The implementation of standard qsort method could be mergesort, partition-exchange sort or quicksort and the memory usage will be higher than the pure memory amount for an array of integer.

solve_qsort.c

int main(int argc, char*argv[]) { char line[100]; int i=0, last; FILE *fp = fopen(argv[1], "r"); int count = (int)atoi(argv[2]); //allocating the size for the n values int32_t *digits = (int32_t *)malloc(count * sizeof(int32_t)); // reading the lines, convert into an int, push into array while (fgets(line, 100, fp)) { digits[i++] = (int32_t)atoi(line); } fclose(fp); //sort the complete array qsort(digits, count, sizeof(int), compare); //Print all entries, ignoring doubles last = -1; for (i=0; i<count; i++) { if (last!=digits[i]) printf("%d

", digits[i]); last = digits[i]; } }

This example is simple and not optimized, the same solution in python is not surprising (but shorter):

solve_number_sort.py

last = "" for n in sorted(open(sys.argv[1])): if last != n: sys.stdout.write( n ) last = n

This simple python version sort the input file as strings and print the values (sys.stdout.write is faster than the print method!). Converting the input file into an integer array will save memory usage. The result files will have a large diff because sorting as numbers or as string has different results.

solve_number_sort.py

values = map(int, open(sys.argv[1])) values.sort() last = 0 for n in values: if last != n: sys.stdout.write("%d

" % n) last = n

benchmark result of sorting

In the charts you can see the O(n*log n) execution time. I separated the user and system time to clarify the algorithm calculation time vs. the system reading time.

The non-linear computing time results of the randomized entries in the test files.

limiting the memory usage

The sort command can split the input array into small chunks (for sorting), save it to disk and merge it with limited memory usage. The command line tool

ulimit -d <n>

limits the memory for all processes and sort use temporary files. But I chose the build-in parameter from the sort command with the

--buffer-size=SIZE

option.

sort -u -n -S 20M rand_numbers.txt

After successful testing I the built an memory limited sorting variant with python as a second example. The magic merge-component can be found in the module heapq. It offers the functionality to merge a list of open file iterators with the pre-sorted chunks.

merge_sort.py

import sys, tempfile, heapq limit = 40000 def sortedFileIterator(digits): fp = tempfile.TemporaryFile() digits.sort() fp.write("

".join(map(lambda x:str(x), digits))) fp.seek(0) return fp iters = [] digits = [] for line in file(sys.argv[1]): digits.append(int(line.strip())) if len(digits)==limit: iters.append(sortedFileIterator(digits)) digits = [] iters.append(sortedFileIterator(digits)) #merge all sorted ranges and filter doubles oldItem = -1 for sortItem in heapq.merge(*iters): if oldItem != sortItem: print sortItem.strip() oldItem = sortItem

The both constant lines in the chart are the two examples with constant memory usage.

remove duplicates with hashmap

The second solution simply put all entries into a hashmap. The values be will used as keys and the data structure will remove the duplicate entries automatically. This can be seen in the perl example.

Some languages offers a set - this data structure stores only the unique keys without a value. The point of interest while benchmarking will be the memory usage for the “easy to use” build-in data structure for millions of integers.

perl command line

perl -lne'exists $h{$_} or print $h{$_}=$_'

solve_set.py

for n in set(open(sys.argv[1])): sys.stdout.write( n )

solve_set.lua

local set = {} for n in io.lines(arg[1]) do if not set[n] then print(n) set[n] = true end end

All solutions can be written in short time and will work. But the memory usage is terrible! The solutions needs 10 times of memory than the raw integer array. And the normal effort for using a hashmap with O(n) (set and get for n million in this case) is not correct. The “brute force inserting” keys into a hashmap will trigger the reorganisation of the bucket tables of the hashmap data structure! You can see this in the memory usage chart.

Using a bitarray instead a hashmap

The limits in the problem description offers a more elegant solution. The raw memory usage for 13 million 32-bit integer is ~49MB. Mapping all integers from 1 million to 100 million to the bit position (in linear memory) will use ~ 11MB (99 million bits / 8). So the memory usage for the bitarray will be lower and constant. And the computation for the very short map function will be short.

solve_bittarray.c

int main(int argc, char*argv[]) { char line[100]; const minValue = 1000000; const maxValue = 100000000; char *bitarray = (char *)malloc((maxValue - minValue) / 8); FILE *fp = fopen(argv[1], "r"); int pos; while (fgets(line, 100, fp)) { pos = atoi(line); if (!(bitarray[(pos-minValue)>>3] & 1<<(pos%8))) { printf("%d

", pos); bitarray[(pos-minValue)>>3]|=(1<<(pos%8)); } } fclose(fp); return 0; }

We got a python solution with the module bitarray. If you dont have the module bitarray you have to install it. On the default ubuntu installation you need the python-dev and the setuptools packages. The bitarray module is available with easy_install after successful package installlation.

sudo apt-get install python-dev python-setuptools sudo easy_install bitarray

The python solution is short, but the execution time is much longer than the c variant.

solve_bitarray.py

import sys, bitarray minValue = 1000000 maxValue = 100000000 bits = bitarray.bitarray(maxValue-minValue+1) for line in file(sys.argv[1]): i = int(line) if not bits[i - minValue]: bits[i - minValue] = 1 sys.stdout.write(line)

Using mmap instead of normal fileio

A co-worker rated the file access higher than the computation time and offered a variant with mmap. The mmap function maps the input file into memory and you can iterate over it as a byte array. Because the parsing part of integers is easy the variant should be compareable with the bitarray c variant.

https://github.com/ChristianHarms/uc_sniplets/blob/master/no_duplicates/solve_nmap_bitmap.c

The memory usage for the bittarray-mmap variant will increase because the command line tool recognize the mapped memory as process memory usage.

conclusion

The bitarray solution has constant, low memory usage and fast execution time. It will only work because the available numbers are limited in range.

Using hashmap / set solutions can result in massive memory usage.

Sorting the input entries is the only one solution with the possibility to work with limited memory.









Find more about language performance with the "c++ / go / java / scala" language performance benchmark by google on readwriteweb.com.