MySQL 5.7 introduces both a new native JSON datatype, and a set of SQL functions to be able to manipulate and search data in a very natural way on the server-side. Today I wanted to show a simple of example of these features in action using sample data from SF OpenData.

Importing Sample Data

Having good sample data is useful, because it helps you self-validate that results are accurate. It also helps provide good data distribution, which is important when adding indexes.

My chosen data set from SF OpenData is the most popular item under “Geographic Locations and Boundaries” and contains approximately 200K city lots. The first step is to download and import it into MySQL:

# First downloading the data # To save a step, I'm going to fetch them pre-formatted in JSON cd /tmp curl -O https://raw.githubusercontent.com/zemirco/sf-city-lots-json/master/citylots.json grep "^{ \"type" citylots.json > features.json # Importing into MySQL CREATE TABLE features ( id INT NOT NULL AUTO_INCREMENT, feature JSON NOT NULL, PRIMARY KEY (id) ); LOAD DATA INFILE '/tmp/features.json' INTO TABLE features (feature); 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # First downloading the data # To save a step, I'm going to fetch them pre-formatted in JSON cd / tmp curl - O https : //raw.githubusercontent.com/zemirco/sf-city-lots-json/master/citylots.json grep "^{ \"type" citylots . json > features . json # Importing into MySQL CREATE TABLE features ( id INT NOT NULL AUTO_INCREMENT , feature JSON NOT NULL , PRIMARY KEY ( id ) ) ; LOAD DATA INFILE '/tmp/features.json' INTO TABLE features ( feature ) ;

Here is an example what each one of the features (parcel of land) looks like:

{ "type":"Feature", "geometry":{ "type":"Polygon", "coordinates":[ [ [-122.42200352825247,37.80848009696725,0], [-122.42207601332528,37.808835019815085,0], [-122.42110217434865,37.808803534992904,0], [-122.42106256906727,37.80860105681814,0], [-122.42200352825247,37.80848009696725,0] ] ] }, "properties":{ "TO_ST":"0", "BLKLOT":"0001001", "STREET":"UNKNOWN", "FROM_ST":"0", "LOT_NUM":"001", "ST_TYPE":null, "ODD_EVEN":"E", "BLOCK_NUM":"0001", "MAPBLKLOT":"0001001" } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 { "type" : "Feature" , "geometry" : { "type" : "Polygon" , "coordinates" : [ [ [ - 122.42200352825247 , 37.80848009696725 , 0 ] , [ - 122.42207601332528 , 37.808835019815085 , 0 ] , [ - 122.42110217434865 , 37.808803534992904 , 0 ] , [ - 122.42106256906727 , 37.80860105681814 , 0 ] , [ - 122.42200352825247 , 37.80848009696725 , 0 ] ] ] } , "properties" : { "TO_ST" : "0" , "BLKLOT" : "0001001" , "STREET" : "UNKNOWN" , "FROM_ST" : "0" , "LOT_NUM" : "001" , "ST_TYPE" : null , "ODD_EVEN" : "E" , "BLOCK_NUM" : "0001" , "MAPBLKLOT" : "0001001" } }

In this case all 200K documents do follow a common format, but I should point out that this is not a requirement. JSON is schema-less 🙂

Example Queries

Query #1: Find a parcel of land on Market street, one of the main streets in San Francisco:

SELECT * FROM features WHERE feature->"$.properties.STREET" = 'MARKET' LIMIT 1\G ************************* 1. row ************************* id: 12250 feature: {"type": "Feature", "geometry": {"type": "Polygon", "coordinates": [[[-122.39836263491878, 37.79189388899312, 0], [-122.39845248797837, 37.79233030084018, 0], [-122.39768507706792, 37.7924280850133, 0], [-122.39836263491878, 37.79189388899312, 0]]]}, "properties": {"TO_ST": "388", "BLKLOT": "0265003", "STREET": "MARKET", "FROM_ST": "388", "LOT_NUM": "003", "ST_TYPE": "ST", "ODD_EVEN": "E", "BLOCK_NUM": "0265", "MAPBLKLOT": "0265003"}} 1 row in set (0.02 sec) 1 2 3 4 5 6 7 SELECT * FROM features WHERE feature -> "$.properties.STREET" = 'MARKET' LIMIT 1 \ G * * * * * * * * * * * * * * * * * * * * * * * * * 1. row * * * * * * * * * * * * * * * * * * * * * * * * * id : 12250 feature : { "type" : "Feature" , "geometry" : { "type" : "Polygon" , "coordinates" : [ [ [ - 122.39836263491878 , 37.79189388899312 , 0 ] , [ - 122.39845248797837 , 37.79233030084018 , 0 ] , [ - 122.39768507706792 , 37.7924280850133 , 0 ] , [ - 122.39836263491878 , 37.79189388899312 , 0 ] ] ] } , "properties" : { "TO_ST" : "388" , "BLKLOT" : "0265003" , "STREET" : "MARKET" , "FROM_ST" : "388" , "LOT_NUM" : "003" , "ST_TYPE" : "ST" , "ODD_EVEN" : "E" , "BLOCK_NUM" : "0265" , "MAPBLKLOT" : "0265003" } } 1 row in set ( 0.02 sec )

Using the short hand JSON_EXTRACT operator (->) I can query into a JSON column in a very natural way. The syntax "$.properties.STREET" is what we call a JSON path, and for those familiar with javascript I like to compare this to a CSS selector similar to what you would use with JQuery.

To learn more about the JSON path syntax, I recommend checking out our manual page, or this blog post by Roland Bouman.

Query #2: Find any parcels of land that do not specify a street:

SELECT * FROM features WHERE feature->"$.properties.STREET" IS NULL LIMIT 1\G Empty set (0.39 sec) 1 2 3 4 SELECT * FROM features WHERE feature -> "$.properties.STREET" IS NULL LIMIT 1 \ G Empty set ( 0.39 sec )

With JSON being schemaless, this finds the documents which do not have the expected structure. In this example we can see that all documents have a $.properties.STREET specified, and thus the query returns zero results.

Comparing the JSON type to TEXT

In this example I am running a query which deliberately needs to access all 200K JSON documents. This could be considered a micro-benchmark, as it does not quite reflect what you would experience in production, where you will often have indexes:

# as JSON type SELECT DISTINCT feature->"$.type" as json_extract FROM features; +--------------+ | json_extract | +--------------+ | "Feature" | +--------------+ 1 row in set (1.25 sec) # as TEXT type ALTER TABLE features CHANGE feature feature LONGTEXT NOT NULL; SELECT DISTINCT feature->"$.type" as json_extract FROM features; +--------------+ | json_extract | +--------------+ | "Feature" | +--------------+ 1 row in set (12.85 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # as JSON type SELECT DISTINCT feature -> "$.type" as json_extract FROM features ; + -- -- -- -- -- -- -- + | json_extract | + -- -- -- -- -- -- -- + | "Feature" | + -- -- -- -- -- -- -- + 1 row in set ( 1.25 sec ) # as TEXT type ALTER TABLE features CHANGE feature feature LONGTEXT NOT NULL ; SELECT DISTINCT feature -> "$.type" as json_extract FROM features ; + -- -- -- -- -- -- -- + | json_extract | + -- -- -- -- -- -- -- + | "Feature" | + -- -- -- -- -- -- -- + 1 row in set ( 12.85 sec )

To explain what is happening here in more detail:

For simplicity, I’ve ensured that in both examples the dataset fits in memory.

The JSON functions, including the short-hand json_extract() operator (->) will work on both a native JSON data type, as well TEXT/BLOB/VARCHAR data types. This is very useful because it provides a nice upgrade for users prior to MySQL 5.7 who frequently already store JSON.

operator (->) will work on both a native JSON data type, as well TEXT/BLOB/VARCHAR data types. This is very useful because it provides a nice upgrade for users prior to MySQL 5.7 who frequently already store JSON. We can see that the native JSON datatype is indeed about 10x faster than TEXT – 1.25 seconds versus 12.85 seconds. This can be explained because the native type does not have to do any parsing or validation of the data, and it can retrieve elements of a JSON document very efficiently.

Conclusion

Hopefully this serves as a useful example of importing sample JSON data, and running a few sample queries. In my next post I take this a step further by showing how you can index JSON data by using virtual columns.