SQL Server is Microsoft's enterprise relational database offering. It was first released in 1989 and has seen support on various Windows and OS/2 platforms since it's release. In October 2017, Microsoft released SQL Server 2017 for Linux. To date, Ubuntu 16, Red Hat Enterprise Linux 7.3 and 7.4 as well as SUSE Enterprise Linux Server v12 are supported.

Though the Linux distribution is missing features found in the Windows offering, the result is a very useful and feature-rich database that fits in well in a UNIX environment.

In this post I'll walk through setting up SQL Server 2017, performing basic data import and export tasks as well as building reports via Jupyter Notebook and automating tasks using Apache Airflow.

Installing SQL Server 2017 on Ubuntu To start I'll add the public key and package listings from Microsoft to my apt repository. The following was run on a fresh install of Ubuntu 16.04.2 LTS. $ wget -q https://packages.microsoft.com/keys/microsoft.asc -O - \ | sudo apt-key add - $ wget -q https://packages.microsoft.com/config/ubuntu/16.04/mssql-server-2017.list -O - \ | sudo tee /etc/apt/sources.list.d/mssql-server-2017.list $ wget -q https://packages.microsoft.com/config/ubuntu/16.04/prod.list -O - \ | sudo tee /etc/apt/sources.list.d/msprod.list $ sudo apt update I'll then install SQL Server, one of its tooling packages, git and Python. $ sudo apt install \ git \ mssql-server \ mssql-tools \ python-pip \ python-virtualenv The following will allow you to configure the SQL Server license as well as set the SQL Server system administrator password. $ sudo /opt/mssql/bin/mssql-conf setup Below are the various licensing options available. Choose an edition of SQL Server: 1) Evaluation (free, no production use rights, 180-day limit) 2) Developer (free, no production use rights) 3) Express (free) 4) Web (PAID) 5) Standard (PAID) 6) Enterprise (PAID) 7) Enterprise Core (PAID) 8) I bought a license through a retail sales channel and have a product key to enter. Details about editions can be found at https://go.microsoft.com/fwlink/?LinkId=852748&clcid=0x409 Use of PAID editions of this software requires separate licensing through a Microsoft Volume Licensing program. By choosing a PAID edition, you are verifying that you have the appropriate number of licenses in place to install and run this software. Enter your edition(1-8): I'll add the tools binaries folder to my PATH environment variable. $ echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc $ source ~/.bashrc

MSSQL-CLI Up & Running Most SQL Server practitioners will be familiar with sqlcmd as the tool you most often come across when interacting with the server via the command line. But in late 2017, Microsoft released a new CLI tool. It's open source, written in Python and feels a lot more like the CLI tools that accompany Presto and PostgreSQL. The following will create a Python virtual environment and install the CLI tool. $ virtualenv ~/.ms $ source ~/.ms/bin/activate $ pip install mssql-cli The CLI picks up on various environment variables. The following will set the username and password for the CLI so that they don't have to be entered each time the tool is launched. Set the user to SA and type in its password below. $ read MSSQL_CLI_USER $ read MSSQL_CLI_PASSWORD $ export MSSQL_CLI_USER $ export MSSQL_CLI_PASSWORD Microsoft has included optional telemetry collection in this command. The following will opt out of monitoring. $ export MSSQL_CLI_TELEMETRY_OPTOUT = True By default the CLI tool will expect a SQL command to sit on a single line and delimit from other commands via a return carriage. To execute one or more commands, a GO command would be issued alone on a single line. This aligns the behaviour of this tool with the sqlcmd tool. The following will change this functionality so that SQL commands can span multiple lines and will execute once a semi-colon followed by the return carriage are entered. This is referred to as "Multi-line mode". $ mkdir -p ~/.config/mssqlcli/ $ vi ~/.config/mssqlcli/config [main] multi_line = True

Bulk Loading I'll clone fivethirtyeight's data repository so that I can import a 2014 FIFA World Cup tournament results prediction. $ git clone https://github.com/fivethirtyeight/data.git I'll then launch the CLI tool. It will pick up on various environment variables and connect to the local SQL Server installation. $ mssql-cli I'll create a WorldCup2014 database and create a Predictions table in it. CREATE DATABASE WorldCup2014 ; USE WorldCup2014 ; CREATE TABLE Predictions ( country NVARCHAR ( 50 ), country_id NVARCHAR ( 3 ), group_ NVARCHAR ( 1 ), spi NUMERIC ( 4 , 2 ), spi_offense NUMERIC ( 5 , 4 ), spi_defense NUMERIC ( 5 , 4 ), win_group NUMERIC ( 16 , 15 ), sixteen NUMERIC ( 16 , 15 ), quarter NUMERIC ( 16 , 15 ), semi NUMERIC ( 16 , 15 ), cup NUMERIC ( 16 , 15 ), win NUMERIC ( 16 , 15 )); I'll then load in the predictions CSV data. Note that I'm using the full path of the CSV file. This command skips the header row in the CSV. BULK INSERT Predictions FROM "/home/mark/data/world-cup-predictions/wc-20140611-132709.csv" WITH ( FIRSTROW = 2 , FIELDTERMINATOR = ',' , ROWTERMINATOR = '

' , MAXERRORS = 0 , KEEPIDENTITY );

Exporting Data / Backups The Linux version of SQL Server allows you to backup and restore databases using the .bak file format. These are binary files that hold a very complete picture of the schemas, stored procedures and data as well as the server's state, configuration and logs. To demonstrate I'll create a db.bak file in my home folder and change it's group ownership to mssql so the server has permissions to work with the file. $ touch db.bak $ sudo chown mark:mssql db.bak I'll then launch the CLI and issue a command to backup the WorldCup2014 database to the file I just created. $ mssql-cli BACKUP DATABASE WorldCup2014 TO DISK = '/home/mark/db.bak' ; To restore the database I'll first pick out the logical names for the schemas, data and log files. Below you can see they're WorldCup2014 and WorldCup2014_log respectively. RESTORE FILELISTONLY FROM DISK = '/home/mark/db.bak' ; -[ RECORD 1 ]------------------------- LogicalName | WorldCup2014 PhysicalName | /var/opt/mssql/data/WorldCup2014.mdf Type | D FileGroupName | PRIMARY Size | 8388608 MaxSize | 35184372080640 FileId | 1 CreateLSN | 0 DropLSN | 0 UniqueId | cd3f2e59-0848-4e62-adcd-5555a95021f4 ReadOnlyLSN | 0 ReadWriteLSN | 0 BackupSizeInBytes | 2883584 SourceBlockSize | 4096 FileGroupId | 1 LogGroupGUID | NULL DifferentialBaseLSN | 0 DifferentialBaseGUID | 00000000-0000-0000-0000-000000000000 IsReadOnly | 0 IsPresent | 1 TDEThumbprint | NULL SnapshotUrl | NULL -[ RECORD 2 ]------------------------- LogicalName | WorldCup2014_log PhysicalName | /var/opt/mssql/data/WorldCup2014_log.ldf Type | L FileGroupName | NULL Size | 8388608 MaxSize | 2199023255552 FileId | 2 CreateLSN | 0 DropLSN | 0 UniqueId | f9909495-16b1-4d42-85d9-ac56f85cf611 ReadOnlyLSN | 0 ReadWriteLSN | 0 BackupSizeInBytes | 0 SourceBlockSize | 4096 FileGroupId | 0 LogGroupGUID | NULL DifferentialBaseLSN | 0 DifferentialBaseGUID | 00000000-0000-0000-0000-000000000000 IsReadOnly | 0 IsPresent | 1 TDEThumbprint | NULL SnapshotUrl | NULL I will create two target files to restore the schemas, data and logs to in my home folder and assign group ownership of them to the mssql group. $ touch ~/restored. { mdf,ldf } $ sudo chown mark:mssql ~/restored. { mdf,ldf } I will then launch the CLI, create a new database called restored , switch into single user mode and restore the backup to the two restored files in my home directory. $ mssql-cli CREATE DATABASE restored ; ALTER DATABASE restored SET SINGLE_USER WITH ROLLBACK IMMEDIATE ; RESTORE DATABASE restored FROM DISK = '/home/mark/db.bak' WITH REPLACE , RECOVERY , MOVE 'WorldCup2014' TO '/home/mark/restored.mdf' , MOVE 'WorldCup2014_log' TO '/home/mark/restored.ldf' ; With the database recovered I can switch to the restored database and select out the three countries the model predicted to top the 2014 World Cup as well as their probabilities at winning. USE restored ; SELECT TOP 3 country , win FROM Predictions ORDER BY win DESC ; +-----------+-------------------+ | country | win | |-----------+-------------------| | Brazil | 0.453437174610376 | | Argentina | 0.127798660739543 | | Germany | 0.106981061567489 | +-----------+-------------------+

Export to CSV If you want to dump the output of a SELECT statement to CSV files and the output is made up of nothing more than integers, fractional numbers and strings that don't require escaping then the sqlcmd can be of good use. The following will set the environment variable for the database password for the sqlcmd . $ export SQLCMDPASSWORD = $MSSQL_CLI_PASSWORD The following will dump the predictions to a CSV file. $ sqlcmd \ -S localhost \ -U SA \ -d WorldCup2014 \ -s ',' \ -W \ -h-1 \ -Q "SET NOCOUNT ON; SELECT * FROM Predictions" \ > out.csv

Export to SQL In early 2017, work began on mssql-scripter, a Python-based CLI tool that could export SQL Server databases to plain-text SQL files. The following will install the tool and set the environment variable for the database password so you don't need to enter it each time you use the tool. $ pip install mssql-scripter $ export MSSQL_SCRIPTER_PASSWORD = $MSSQL_CLI_PASSWORD The following will dump out the World Cup 2014 database. $ mssql-scripter \ --server localhost \ --database WorldCup2014 \ --user SA \ --schema-and-data \ > dump.sql A lot of effort has gone in to ensure as much state of the database as possible is preserved in the .sql dump file. This is the first 30 lines of the above command's output. $ head -n30 dump.sql USE [ master ] GO /****** Object: Database [WorldCup2014] Script Date: 8/12/18 7:37:35 AM ******/ CREATE DATABASE [ WorldCup2014 ] CONTAINMENT = NONE ON PRIMARY ( NAME = N 'WorldCup2014' , FILENAME = N '/var/opt/mssql/data/WorldCup2014.mdf' , SIZE = 8192 KB , MAXSIZE = UNLIMITED , FILEGROWTH = 65536 KB ) LOG ON ( NAME = N 'WorldCup2014_log' , FILENAME = N '/var/opt/mssql/data/WorldCup2014_log.ldf' , SIZE = 8192 KB , MAXSIZE = 2048 GB , FILEGROWTH = 65536 KB ) GO IF ( 1 = FULLTEXTSERVICEPROPERTY ( 'IsFullTextInstalled' )) begin EXEC [ WorldCup2014 ].[ dbo ].[ sp_fulltext_database ] @ action = 'enable' end GO ALTER DATABASE [ WorldCup2014 ] SET ANSI_NULL_DEFAULT OFF GO ALTER DATABASE [ WorldCup2014 ] SET ANSI_NULLS OFF GO ALTER DATABASE [ WorldCup2014 ] SET ANSI_PADDING OFF GO ALTER DATABASE [ WorldCup2014 ] SET ANSI_WARNINGS OFF GO ALTER DATABASE [ WorldCup2014 ] SET ARITHABORT OFF GO ALTER DATABASE [ WorldCup2014 ] SET AUTO_CLOSE OFF GO ALTER DATABASE [ WorldCup2014 ] SET AUTO_SHRINK OFF GO ALTER DATABASE [ WorldCup2014 ] SET AUTO_UPDATE_STATISTICS ON

Plotting Data with Jupyter Notebook You can source SQL Server-stored data for analysis in Jupyter Notebook via the pymssql library. The following will install Pandas, Jupyter Notebook, Matplotlib as well as pymssql. $ pip install \ jupyter \ matplotlib \ pandas \ pymssql I'll set a password for the Notebook server and then launch it. $ mkdir ~/.jupyter $ jupyter-notebook password $ jupyter-notebook \ --no-browser \ --ip = 0 .0.0.0 \ --NotebookApp.iopub_data_rate_limit = 100000000 With the server running I'll open it in a web browser and navigate to http://127.0.0.1:8888 . When the Web UI loads, select "New" in the top right corner and then "Python 2". You should now see a new, blank notebook. The following will plots the relationship between the probabilities of various competing countries reaching the quarter and semi finals of the 2014 World Cup. I usually keep my imports in the top cell. import os import pandas as pd import pymssql The code to generate the scatter plot lives in the second cell. with pymssql . connect ( server = "127.0.0.1" , user = "SA" , password = os . environ [ 'MSSQL_CLI_PASSWORD' ], database = "WorldCup2014" ) as conn : df = pd . read_sql ( "SELECT quarter, semi FROM Predictions" , conn ) df . plot . scatter ( x = 'quarter' , y = 'semi' , c = 'DarkBlue' )