Happy 4/20!

Today is April 20th, the unofficial holiday of marijuana afficionados the world over. Happy 4/20! Or, as we in the data business like to say, Happy 20%!

Recreational marijuana has been legalized in a dozen US states, medical use is legal in dozens more, and two-thirds of Americans support the drug's full legalization. But despite marijuana's increasing acceptance in law and culture, it remains illegal at the federal level, and is still seen as taboo in a way that alcohol (a much more dangerous drug) is not. That's why I can use Bitmoji to render this delightful image of alcohol abuse:

But if I want a cartoon version of myself enjoying another legal substance in my home state of California, the best I get is this oblique reference to a cloud of smoke.

Here at Liquidata, we don't shy away from dealing with controversial datasets and analysis. That's why we've chosen to mark the holiday by bringing you 28 grams of high-potency marijuana-related datasets to peruse. (This is just under the legal limit on marijuana data in most states.)

Sections

This is a long blog, so we've summarized it for you. Feel free to jump to the topic that most interests you!

Why are there no pictures of weed in ImageNet?

ImageNet is the world's largest open dataset of labeled images for machine learning. Liquidata has a queryable version, which lets you find all known synonyms for weed:

% dolt sql - q "select * from words_synsets where word='marijuana'" + | word | lex_id | synset_id | synset_type | word_num | + | marijuana | 0 | 02949691 | n | 2 | | marijuana | 0 | 12397210 | n | 1 | + % dolt sql - q "select * from words_synsets where synset_type='n' and synset_id='02949691'" + | word | lex_id | synset_id | synset_type | word_num | + | cannabis | 0 | 02949691 | n | 1 | | ganja | 0 | 02949691 | n | 4 | | marihuana | 0 | 02949691 | n | 3 | | marijuana | 0 | 02949691 | n | 2 | + % dolt sql - q "select * from words_synsets where synset_type='n' and synset_id='03600806'" + | word | lex_id | synset_id | synset_type | word_num | + | joint | 0 | 03600806 | n | 1 | | marijuana cigarette | 0 | 03600806 | n | 2 | | reefer | 0 | 03600806 | n | 3 | | spliff | 0 | 03600806 | n | 5 | | stick | 2 | 03600806 | n | 4 | +

You can try these synonym queries yourself on the web here.

Now that we know the synset_id of marijuana and joints, we can use them to find some pictures of these terms:

% dolt sql - q "select * from images_synsets where synset_type='n' and synset_id='03600806'" + | synset_type | synset_id | image_id | image_url | + + % dolt sql - q "select * from images_synsets where synset_type='n' and synset_id='02949691'" + | synset_type | synset_id | image_id | image_url | + +

What?! There are no images of joints or weed known to ImageNet! This is a grave omission. So let's fix that. I'll start by checking out a new branch, then inserting some weed and joint images I found with Google image search:

% dolt checkout - b zachmu / weed Switched to branch 'zachmu/weed' % dolt sql image_net > insert into images_synsets values - > ( 'n' , '03600806' , 1 , 'https://d3atagt0rnqk7k.cloudfront.net/wp-content/uploads/2016/11/15161632/journey-of-the-joint.jpg' ) , - > ( 'n' , '03600806' , 2 , 'https://cdn.mos.cms.futurecdn.net/HYFGvd6wdFHKS8XgwBMQec-970-80.jpg' ) ; Query OK , 2 rows affected image_net > insert into images_synsets values - > ( 'n' , '02949691' , 1 , 'https://files.nccih.nih.gov/files/marijuana-GettyImages-116036237-square.jpg' ) , - > ( 'n' , '02949691' , 2 , 'https://cbs4indy.com/wp-content/uploads/sites/22/2019/11/gettyimages-1148304954.jpg' ) ; Query OK , 2 rows affected image_net > exit Bye

I'll make sure the diff looks good, then push these changes back to DoltHub:

% dolt diff diff --dolt a/images_synsets b/images_synsets diff --dolt a/images_synsets b/images_synsets --- a/images_synsets @ i2cqdt24f72vc74kgifff9von0q7uq39 +++ b/images_synsets @ hd6oe18u2o0s95e3i53iqaf3j8t4shmn +-----+-------------+-----------+----------+----------------------------------------------------------------------------------------------------+ | | synset_type | synset_id | image_id | image_url | +-----+-------------+-----------+----------+----------------------------------------------------------------------------------------------------+ | + | n | 02949691 | 1 | https://files.nccih.nih.gov/files/marijuana-GettyImages-116036237-square.jpg | | + | n | 02949691 | 2 | https://cbs4indy.com/wp-content/uploads/sites/22/2019/11/gettyimages-1148304954.jpg | | + | n | 03600806 | 1 | https://d3atagt0rnqk7k.cloudfront.net/wp-content/uploads/2016/11/15161632/journey-of-the-joint.jpg | | + | n | 03600806 | 2 | https://cdn.mos.cms.futurecdn.net/HYFGvd6wdFHKS8XgwBMQec-970-80.jpg | +-----+-------------+-----------+----------+----------------------------------------------------------------------------------------------------+ % dolt push origin zachmu/weed Successfully uploaded 1 of 1 file ( s ) .

Now that I've pushed to DoltHub, I can open a pull request to get the data updated. Let's hope it's accepted!

Cannabis testing in Washington state

We previously reported on the availability of this dataset while announcing the launch of saved queries, but in case you missed it: researchers have analyzed the lab testing results of over 200,000 tests of commercial recreational marijuana products in Washington state. This lets us find the strongest weed available in Washington state:

doltsql > select test_strain , avg ( thc_max ) from tests where inventory_type = 'Flower Lot' group by 1 order by 2 desc limit 20 ; + | test_strain | AVG ( tests . thc_max ) | + | Liberty Haze 00016 | 37.80834 | | Liberty Haze 00027 | 37.300564 | | Liberty Haze 00022 | 36.47355 | | Liberty Haze 00025 | 35.974540000000005 | | BD_7_16 - 11 - 02 | 35.4394 | | Kosher Tangie | Liberty Haze 00020 | 34.18342 | | WIFI_8_16 - 09 - 02 | 33.5982 | | Liberty Haze 00026 | 33.49863833333333 | | Grape OG | OREG_T6_16 . 04.27 | 32.62886666666667 | | Tangie Land | 32.5347 | | Bubba Kush 00010 | 32.485366666666664 | | Fruity Pebbles OG | 32.4271 | | Kosher Kush 00005 | 32.263335 | | CANTELOPE | 31.7811 | | Liberty Haze 00021 | 31.76712857142857 | | Liberty Haze 00013 | 31.7639 | | GGLU_T03_16 . 12.26 | 31.749850000000002 | | WIFI25 x ACHEM | 31.7303 | +

Or you can look up the highest CBD variety for the medical crowd. There's lots of other interesting nuggets to dig up there, so get querying.

Connecticut medical marijuana brand registry

Connecticut publishes a list of all of its medical marijuana products with extensive lab testing information, much more in-depth than any other source I've seen. In addition to the standard measurements of THC, THC-A, and CBD, Connecticut measures a pharmacopia of other cannaboids. Check it out:

marijuana_data> desc ct_medical_brand_registry; +--------------------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------------------------+--------------+------+-----+---------+-------+ | brand_name | VARCHAR(255) | NO | PRI | | | | dosage_form | VARCHAR(255) | NO | | | | | producer | VARCHAR(255) | NO | | | | | product_image | VARCHAR(255) | NO | | | | | label_image | VARCHAR(255) | NO | | | | | approval_date | VARCHAR(80) | NO | | | | | registration_number | VARCHAR(255) | NO | | | | | tetrahydrocannabinol_thc | VARCHAR(255) | YES | | | | | tetrahydrocannabinol_acid_thca | VARCHAR(255) | YES | | | | | cannabidiols_cbd | VARCHAR(255) | YES | | | | | cannabidiol_acid_cbda | VARCHAR(255) | YES | | | | | a_pinene | VARCHAR(255) | YES | | | | | b_myrcene | VARCHAR(255) | YES | | | | | b_caryophyllene | VARCHAR(255) | YES | | | | | b_pinene | VARCHAR(255) | YES | | | | | limonene | VARCHAR(255) | YES | | | | | ocimene | VARCHAR(255) | YES | | | | | linalool_lin | VARCHAR(255) | YES | | | | | humulene_hum | VARCHAR(255) | YES | | | | | cbg | VARCHAR(255) | YES | | | | | cbg_a | VARCHAR(255) | YES | | | | | cannabavarin_cbdv | VARCHAR(255) | YES | | | | | cannabichromene_cbc | VARCHAR(255) | YES | | | | | cannbinol_cbn | VARCHAR(255) | YES | | | | | tetrahydrocannabivarin_thcv | VARCHAR(255) | YES | | | | | a_bisabolol | VARCHAR(255) | YES | | | | | a_phellandrene | DOUBLE | YES | | | | | a_terpinene | VARCHAR(255) | YES | | | | | b_eudesmol | VARCHAR(255) | YES | | | | | b_terpinene | DOUBLE | YES | | | | | fenchone | DOUBLE | YES | | | | | pulegol | DOUBLE | YES | | | | | borneol | VARCHAR(255) | YES | | | | | isopulegol | DOUBLE | YES | | | | | carene | VARCHAR(255) | YES | | | | | camphene | VARCHAR(255) | YES | | | | | camphor | DOUBLE | YES | | | | | caryophyllene_oxide | VARCHAR(255) | YES | | | | | cedrol | DOUBLE | YES | | | | | eucalyptol | VARCHAR(255) | YES | | | | | geraniol | DOUBLE | YES | | | | | guaiol | VARCHAR(255) | YES | | | | | geranyl_acetate | DOUBLE | YES | | | | | isoborneol | DOUBLE | YES | | | | | menthol | VARCHAR(255) | YES | | | | | l_fenchone | DOUBLE | YES | | | | | nerol | DOUBLE | YES | | | | | sabinene | DOUBLE | YES | | | | | terpineol | DOUBLE | YES | | | | | terpinolene | VARCHAR(255) | YES | | | | | trans_b_farnesene | VARCHAR(255) | YES | | | | | valencene | VARCHAR(255) | YES | | | | | a_cedrene | VARCHAR(255) | YES | | | | | a_farnesene | VARCHAR(255) | YES | | | | | b_farnesene | DOUBLE | YES | | | | | cis_nerolidol | DOUBLE | YES | | | | | fenchol | VARCHAR(255) | YES | | | | | trans_nerolidol | VARCHAR(255) | YES | | | | +--------------------------------+--------------+------+-----+---------+-------+

I'll be honest: I don't know what isopulegol , valencene , or most of those compounds are. Probably some of you do, though!

Annoyingly, this dataset uses strings for many of its compound measurements, mainly so it can encode values like "<.10" but probably also so that it's easier to import the data without cleaning it first. So let's add a numeric THC column that we can use to find the strongest medical cannabis in Connecticut. Then we'll remove the '<' and '%' characters from the data to make them into floating points. In general, this kind of cleanup is easier to do in "post-production" rather than before importing the dataset. Write queries, not code!

marijuana_data > alter table ct_medical_brand_registry - > add column thc_percent float null after tetrahydrocannabinol_thc ; marijuana_data > update ct_medical_brand_registry - > set thc_percent = replace ( replace ( tetrahydrocannabinol_thc , '%' , '' ) , '<' , '' ) ; Query OK , 7078 rows affected

Now we can find the strongest medical cannabis in Connecticut:

marijuana_data > select brand_name , thc_percent from ct_medical_brand_registry order by thc_percent desc limit 10 ; + | brand_name | thc_percent | + | Lexikan Distillate T96 % S 8353 | 95.51 | | Lexikan Isolate T95 S 7837 | 95.12 | | Garnet T95 % TD 7484 | 94.81 | | Garnet T95 % TD 7491 | 94.81 | | Opal T95 % TD 7494 | 94.72 | | Fioraden C Distillate T94 % I 8366 | 94.15 | | Scarlett T94 % TD 7939 | 93.55 | | Onyx T93 % TD 7914 | 93.01 | | Onyx T93 % TD 7944 | 93.01 | | Citron T93 % TD 8533 | 92.8 | +

While we're in here, let's examine how many products are available by producer and by dosage type:

marijuana_data > select producer , count ( * ) from ct_medical_brand_registry group by 1 order by 2 desc limit 10 ; + | producer | COUNT ( * ) | + | ADVANCED GROW LABS LLC | 2623 | | THERAPLANT LLC | 1717 | | CURALEAF LLC | 1586 | | CONNECTICUT PHARMACEUTICAL SOLUTIONS LLC | 1152 | + marijuana_data > select dosage_form , count ( * ) from ct_medical_brand_registry group by 1 order by 2 desc limit 10 ; + | dosage_form | COUNT ( * ) | + | Extracts , sprays , tinctures or oils | 1299 | | Raw material ( flower ) | 1288 | | Flower | 416 | | Cigarettes ( pre - rolled cones ) | 399 | | Raw Flower | 309 | | Concentrate | 303 | | Vaporizer Cartridge | 216 | | Capsules or pills | 215 | | Baked goods | 177 | | 3.5 g Flower | 176 | +

Surprisingly, there are only four producers of medical marijuana products known to the state of Connecticut! Also, Connecticut is sloppy with its input: look how many ways they keyed in "flower" in their database:

marijuana_data > select dosage_form , count ( * ) from ct_medical_brand_registry - > where lower ( dosage_form ) like '%flower%' - > group by 1 order by 2 desc limit 10 ; + | dosage_form | COUNT ( * ) | + | Raw material ( flower ) | 1288 | | Flower | 416 | | Raw Flower | 309 | | 3.5 g Flower | 176 | | 3.5 g Raw Flower | 92 | | flower | 40 | | Flower SC | 35 | | Raw FLower | 19 | | raw flower | 9 | | Raw flower | 3 | +

We're not saying that Connecticut data entry specialists were getting high on their own supply, but we're not not saying that.

Leafly catalog

Leafly is widely considered the most comprehensive source of marijuana strain information on the internet, with thousands of user reviews for thousands of kinds of flower. A kind soul known only as kingburrito666 uploaded the entire leafly strains database to kaggle, and we imported it to DoltHub, where it can be queried.

Leafly's major contribution to the marijuana dataset genre is subjective evaluations on the effects of different strains and how they taste. So, if you are in the mood for an indica to compose some poetry to, and you want it to taste like lemons, you can ask:

marijuana_data > select strain , rating , effects , flavor from leafly - > where type = 'indica' and effects like '%Creative%' - > and flavor like '%Citrus%' - > order by rating desc limit 10 ; + | strain | rating | effects | flavor | + | Blood - Orange - Kush | 5 | Sleepy , Uplifted , Euphoric , Happy , Creative | Diesel , Citrus , Orange | | Bright - Fire | 5 | Hungry , Relaxed , Uplifted , Creative , Euphoric | Pungent , Earthy , Citrus | | California - Hash - Plant | 5 | Uplifted , Creative , Energetic , Giggly , Happy | Citrus | | Negra - 44 | 5 | Creative , Energetic , Relaxed , Hungry , Talkative | Sweet , Earthy , Citrus | | Durga - Mata | 4.9 | Relaxed , Happy , Giggly , Uplifted , Creative | Minty , Earthy , Citrus | | Golden - State - Banana | 4.8 | Euphoric , Relaxed , Sleepy , Creative , Happy | Citrus , Sweet , Tropical | | Northern - Wreck | 4.8 | Relaxed , Euphoric , Happy , Tingly , Creative | Earthy , Woody , Citrus | | White - Tahoe - Cookies | 4.7 | Aroused , Sleepy , Uplifted , Creative , Euphoric | Spicy / Herbal , Citrus , Tree , Fruit | | Flowerbomb - Kush | 4.6 | Happy , Relaxed , Creative , Hungry , Euphoric | Sweet , Earthy , Citrus | | Forbidden - Fruit | 4.6 | Relaxed , Happy , Euphoric , Sleepy , Creative | Citrus , Grapefruit , Tropical | +

What could be easier? Leafly also provides a description of all strains. Let's find the ones that mention "smile":

marijuana_data > select strain , type , rating , effects , flavor - > from leafly where description like '%smile%' - > order by rating desc ; + | strain | type | rating | effects | flavor | + | Blush | hybrid | 5 | Talkative , Creative , Happy , Hungry , Relaxed | Pepper , Honey , Earthy | | Congo | sativa | 5 | Uplifted , Energetic , Euphoric , Happy , Talkative | Ammonia , Diesel , Sweet | | Monolith | indica | 5 | Relaxed , Sleepy , Tingly , Euphoric , Focused | Pungent , Earthy , Pine | | Purple - Swish | indica | 5 | Happy , Energetic , Sleepy , Giggly , Hungry | Berry , Sweet , Grape | | Lemonhead - Og | hybrid | 4.9 | Relaxed , Uplifted , Happy , Euphoric , Creative | Lemon , Citrus , Sweet | | Blue - Frost | hybrid | 4.5 | Uplifted , Happy , Relaxed , Euphoric , Creative | Blueberry , Cheese , Menthol | | La - Kush | hybrid | 4.5 | Relaxed , Happy , Creative , Uplifted , Euphoric | Menthol , Pungent , Pine | | Lemonder | hybrid | 4.5 | Happy , Uplifted , Giggly , Energetic , Euphoric | Citrus , Lemon , Earthy | | Super - Sour - Skunk | hybrid | 4.5 | Happy , Euphoric , Talkative , Sleepy , Uplifted | Citrus , Earthy , Diesel | | Amnesia - Haze | sativa | 4.3 | Happy , Euphoric , Uplifted , Energetic , Creative | Earthy , Citrus , Lemon | | Strawberry - Cough | sativa | 4.3 | Happy , Uplifted , Euphoric , Energetic , Relaxed | Strawberry , Sweet , Berry | | Ogiesel | hybrid | 4.2 | Euphoric , Energetic , Creative , Relaxed , Happy | Earthy , Pine , Diesel | | A - Train | hybrid | 4.1 | Creative , Euphoric , Relaxed , Happy , Hungry | Earthy , Woody , Citrus | | Soul - Shine | hybrid | 4.1 | Euphoric , Happy , Creative , Hungry , Uplifted | | | Fruitylicious | indica | 4 | Relaxed , Happy , Euphoric , Uplifted , Aroused | Sweet , Blueberry , Minty | +

The possibility for creative analysis of strains here is endless. And we think you'll agree that querying is better than searching for results with a web form or writing a python script.

Kushy app dump

Like Leafly, Kushy has a large data set on strains, including concentrations of various cannabinoids. It also goes farther, and provides comprehensive information on brand names, branded products, and shops. This lets us write some interesting queries. For example, which cities have the most weed shops?

select city , count ( * ) from kushy_shops group by 1 order by 2 desc limit 10 ; + | city | COUNT ( * ) | + | Los Angeles | 139 | | Portland | 130 | | Denver | 105 | | Las Vegas | 59 | | Colorado Springs | 57 | | Seattle | 57 | | < NULL > | 41 | | San Diego | 28 | | Eugene | 26 | | Chicago | 25 | +

I'm a little surprised to see Portland beat out Seattle here.

We can also find out which state has the most brands of weed products:

select location , count ( * ) from kushy_brands group by 1 order by 2 desc limit 10 ; + | location | COUNT ( * ) | + | CA | 1115 | | CO | 91 | | WA | 81 | | MI | 69 | | OR | 68 | | Ca | 64 | | Colorado | 44 | | California | 33 | | AZ | 31 | | Oregon | 30 | +

Not surpisingly, CA dominates in number of brands, but I'm surprised by the degree of this dominance. California is, by leaps and bounds, the state to launch a branded marijuana product company. Also note that Kushy has some data quality issues of their own that need cleaning up.

DEA funding for cannabis eradication in legal states

As we mentioned at the beginning, marijuana remains illegal at the federal level, and is still an active target of the Drug Enforcement Administration. The DEA still funds a program called the Domestic Cannabis Suppression / Eradication Program, which mainly involves subsidies to state and local law enforcement agencies to shut down grow operations. I was able to find how much money has been spent on this program between 2012 and 2016 via a FOIA request published by the DEA. Unfortunately, the DEA is only one step above the famed "print out the data, place on a wooden table, take a picture, and upload the picture" school of web development. Yes, they publish information on the DCE/SP as a single JPEG, almost as if they don't anyone to easily consume it.

Fortunately, online OCR tools are pretty decent at this point, so it didn't take too long to get this data cleaned up and imported into DoltHub.

Joining this data with a dataset on state marijuana legalization laws lets us discover how much of your tax dollars the federal government is spending to exterminate a drug that state residents voted to make legal in states where it's medical as well as recreational:

marijuana_data > select sum ( funding_2016 ) from dea_cannabis_eradication_funding d - > join state_legal_status s on lower ( s . state ) = lower ( d . state ) - > where s . medical ; + | SUM ( dea_cannabis_eradication_funding . funding_2016 ) | + | 2077390 | + marijuana_data > select sum ( funding_2016 ) from dea_cannabis_eradication_funding d - > join state_legal_status s on lower ( s . state ) = lower ( d . state ) - > where s . recreational ; + | SUM ( dea_cannabis_eradication_funding . funding_2016 ) | + | 5465828 | +

So in 2016, the federal government spent $5.6 million to eradicate a legal drug in recreational states, and $2 million in medical states. To be fair, California received $4MM in funding from the program in 2016 and didn't legalize recreational weed until the end of 2016, but it was medically legal for two decades before that in the Golden State.

Conclusion

We hope you have enjoyed this compendium of marijuana data. We believe that Dolt and DoltHub are the best way to share and collaborate on data on the internet. Now that we've done the hard part of cleaning up this data and making it available in a structured format, you can help us by submitting pull requests to keep it up to date. Download Dolt today to try it out yourself!