BigBash It! - Convert SQL queries into bash scripts















Converts your SQL select queries into an autonomous Bash One-liner that can be executed on almost any *nix device to make quick analyses or crunch GB of log files in csv format. Perfectly suited for BigData tasks on your local machine. They perform surprisingly well - thanks to the highly optimized unix tools such as sort and (m)awk - and are often even faster than DB queries for larger data sets. BigBash is open-source and you can find more information, for instance details on the supported SQL syntax, on the Github page.



Getting started 1. Create a table Every set of csv-files that you want to query using SELECT needs to be defined as a table as in a normal sql db. Create a table by using CREATE TABLE that reassembles your csv-file(s) structure. For example if your csv-file "persons.csv" has 5 columns (id, name, street, city, country) then use

CREATE TABLE persons (id, name, street, city, country);

You can add data types like INT or TEXT as well as the UNIQUE attribute which can speed up the resulting bash script and ensures correct sorting.



2. Map table to files After creating a table use the special MAP command to define the file(s) that should be mapped to it. It is possible to select a range of files using wildcard operators as well as an arbitrary bash command where the output lines are used as table rows. For instance, to map a set of gzip'ed compressed csv files to our person table write MAP persons TO 'persons_*.csv.gz' DELIMITER ',' TYPE 'GZ' REMOVEHEADER;

REMOVEHEADER denotes that the first line in every file is a header and should be ignored.



3. The select query You query this table using a standard SQL select command. To get persons living in Berlin sorted by name, enter SELECT name, street FROM persons WHERE city = 'Berlin' ORDER BY name;



4. Compile to a bash script If you have put these three lines in the editor and hit "Create Bash script" you should get a resulting bash script like this one, which cou can directly run in your command line to get query result: (trap "kill 0" SIGINT; export LC_ALL=C; find persons_*.csv.gz -print0 | xargs -0 -i sh -c "gzip -dc {} | tail -n +2"|tr $',' $';'|cut -d $';' -f2,3,4 |awk -F ';' '($3 == "Berlin") {print}'|sort -t$';' -k 1,1|awk -F ';' '{print $1";"$2}') More Look at the examples below for more complex queries with joins and groups. You can find more documentation on the Every set of csv-files that you want to query using SELECT needs to be defined as a table as in a normal sql db. Create a table by using CREATE TABLE that reassembles your csv-file(s) structure. For example if your csv-file "persons.csv" has 5 columns (id, name, street, city, country) then useYou can add data types like INT or TEXT as well as the UNIQUE attribute which can speed up the resulting bash script and ensures correct sorting.After creating a table use the special MAP command to define the file(s) that should be mapped to it. It is possible to select a range of files using wildcard operators as well as an arbitrary bash command where the output lines are used as table rows. For instance, to map a set of gzip'ed compressed csv files to our person table writeREMOVEHEADER denotes that the first line in every file is a header and should be ignored.You query this table using a standard SQL select command. To get persons living in Berlin sorted by name, enterIf you have put these three lines in the editor and hit "Create Bash script" you should get a resulting bash script like this one, which cou can directly run in your command line to get query result:Look at the examples below for more complex queries with joins and groups. You can find more documentation on the Github page

SQL Input

--This examples uses the cap data set from 'http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml' CREATE TABLE yellowCab (VendorID,tpep_pickup_datetime,tpep_dropoff_datetime, passenger_count INT,trip_distance REAL,pickup_longitude,pickup_latitude,RatecodeID, store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount REAL, extra,mta_tax REAL,tip_amount REAL,tolls_amount REAL,improvement_surcharge,total_amount REAL); --We map the table to the first million lines of the cap drives from 2015/12 that we download on-thy-fly. --If you want to make more queries, download the data set first (~1Gb) and use the following map command --MAP yellowCab TO 'yellow_tripdata_2015-12.csv' DELIMITER ',' REMOVEHEADER; MAP yellowCab TO 'curl -s - N "https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-12.csv" | tail -n+2 | head -n 1000000' DELIMITER ',' TYPE 'RAW'; --The select query that will be compiled to a bash one-liner: Get the average passenger count grouped by trip_distance SELECT int(trip_distance + 0.5), SUM(passenger_count)/COUNT(*) FROM yellowCab GROUP BY int(trip_distance + 0.5) HAVING count(*) >= 100 ORDER BY int(trip_distance + 0.5);

--This example uses the MovieLens ml-100k set from 'http://grouplens.org/datasets/movielens/' CREATE TABLE ratings (userId INT,itemId INT,rating INT,timestamp); CREATE TABLE movies (movieId UNIQUE, movieTitle, releaseDate, videoReleaseDate,IMDbURL, genre_unknown, genre_Action, genre_Adventure, genre_Animation, genre_Children, genre_Comedy, genre_Crime, genre_Documentary, genre_Drama, genre_Fantasy, genre_FilmNoir, genre_Horror, genre_Musical, genre_Mystery, genre_Romance, genre_SciFi, genre_Thriller, genre_War, genre_Western); --We map the tables to files directly downloaded from GroupLens MAP ratings TO 'curl -s -N "http://files.grouplens.org/datasets/movielens/ml-100k/u.data"' DELIMITER '\t' TYPE 'RAW'; MAP movies TO 'curl -s -N "http://files.grouplens.org/datasets/movielens/ml-100k/u.item"' DELIMITER '|' TYPE 'RAW'; --Calculate the avg. rating and output the joined movie titles SELECT movieTitle, SUM(rating)/COUNT(*) FROM ratings HASH JOIN movies ON movies.movieId = ratings.itemId GROUP BY itemId HAVING count(*) >= 10 ORDER BY SUM(rating)/COUNT(*) desc LIMIT 20;

--This example uses the MovieLens ml-100k set from 'http://grouplens.org/datasets/movielens/' CREATE TABLE ratings (userId INT,itemId INT,rating INT,timestamp); CREATE TABLE movies (movieId UNIQUE, movieTitle, releaseDate, videoReleaseDate,IMDbURL, genre_unknown, genre_Action, genre_Adventure, genre_Animation, genre_Children, genre_Comedy, genre_Crime, genre_Documentary, genre_Drama, genre_Fantasy, genre_FilmNoir, genre_Horror, genre_Musical, genre_Mystery, genre_Romance, genre_SciFi, genre_Thriller, genre_War, genre_Western); --We map the tables to files directly downloaded from GroupLens MAP ratings TO 'curl -s -N "http://files.grouplens.org/datasets/movielens/ml-100k/u.data"' DELIMITER '\t' TYPE 'RAW'; MAP movies TO 'curl -s -N "http://files.grouplens.org/datasets/movielens/ml-100k/u.item"' DELIMITER '|' TYPE 'RAW'; --The select query that will be compiled to a bash one-liner: --Self join the ratings to get the cooccurrence count SELECT movies.movieTitle, movies2.movieTitle, COUNT(*) FROM ratings HASH JOIN movies ON movies.movieId = ratings.itemId JOIN ratings AS ratings2 ON ratings.userId = ratings2.userId HASH JOIN movies AS movies2 ON movies2.movieId = ratings2.itemId WHERE ratings.itemId > ratings2.itemId GROUP BY ratings.itemId, ratings2.itemId ORDER BY COUNT(*) desc LIMIT 10;

--Wikipedia pages statistics, IMPORTANT: Set correct data type on the last two columns, otherwise output will be wrong CREATE TABLE Wikipages (project_name, page, nr_of_requests INT, page_size INT); --Map the table to the pagecount file from 2016-06-20 8am that is downloaded on the fly (~85MB) MAP Wikipages TO 'curl -s -N https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-06/pagecounts-20160624-080000.gz | gzip -dc' DELIMITER ' ' TYPE 'RAW'; --Select top-10 pages from English Wikipedia, remove special pages like useraccount SELECT page, nr_of_requests FROM Wikipages WHERE project_name='en' and index(page, 'Special:') != 1 ORDER BY nr_of_requests DESC Limit 10;

--Create your tables using the standard create table command CREATE TABLE movies (id INT UNIQUE, title TEXT, genres TEXT); --Map every table to a file or even to an output pipe MAP movies TO 'movies.dat.gz' DELIMITER '::' TYPE 'GZ' REMOVEHEADER; --The select query that will be compiled to a bash one-liner. --You can use joins, groups but no subselects SELECT title FROM movies ORDER BY title LIMIT 10; Output separator Important: Choose an output delimiter that is not contained in the columns of the csv files, otherwise this could lead to subtle (or not so subtle) errors in the output. Use sort based aggregation If disabled, aggregations (e.g. SUM over groups) will be done in memory. This could lead to out-of-memory problems if number of groups is very large.

If enabled, aggregations use an additional sorting step which makes the query slower.

Create Bash script



Disclaimer: In no event we take any responsibility for any damages arising out of the use of the generated queries (including but not limited to loss of data and inaccuracies of the results sustained by you or 3rd parties)

Disclaimer 2: This is kind of a hack project and for sure full of bugs. Please open an issue here when you encountering any.

Copy to clipboard



