Cassandra Conclusion: “One way that Cassandra deviates from Mongo is that it offers much more control on how it’s data is laid out. Consider a scenario where we are interested in laying out large quantities of data that are related, like a friend’s list. Storing this in MongoDB can be a bit tricky – it’s not great at storing lists that are continuously growing. If you don’t store the friends in a single document, you end up risking pulling data from several different locations on disk (or on different servers) which can slow down your entire application. Under heavy load this will impact other queries being performed concurrently.”[1]

If you have a project that is mature, it requires a lot of consecutive data that you will want to read later without jumping around to different disks. Cassandra looks like a strong candidate for:

Show last 50 items for “TheMostIntrestingPersonInTheWorld”: item1,item2,..item3000.. Show me last comments on “TheLucasMovie”: comment1,comment2,comment3, Show water level in Louisiana RiverIoT: level at 8am,level at 8:01am,level at 8:02am, x 100-1000 locations

Great if you have data structure already setup, and it fits above model. [2][3]

MongoDB Conclusion: No structure. import mongodb, mydb = db.myawsomedatabase, mydb.insert(start adding data). Done.

You have a project and you are not sure how NoSQL will handle it but you want to try it. [4]

You have a working process but its grown to a point where traditional RDMS can’t handle the IO load. [5]

You don’t have time to create table structures just now, you just want to get going, and see what happens.

You want to find documentation with python fast, and benefit from large community examples.





#Add cassandra repo to /etc/apt/sources.list deb http://www.apache.org/dist/cassandra/debian 37x main sudo apt-get update update-alternatives --config java #pick openjdk 8 sudo apt-get install cassandra #status nodetool status nodetool info nodetool tpstats #python virtualenv -p python3 env_py3 source env_py3/bin/activate pip install cassandra-driver

Installation:

Python:

from cassandra.cluster import Cluster cluster=Cluster() session = cluster.connect() #nodetool status #nodetool info #nodetool tpstats #https://github.com/dkoepke/cassandra-python-driver/blob/master/example.py session.execute("CREATE KEYSPACE vindata WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '1' }") session.execute("use vindata") #http://www.slideshare.net/ebenhewitt/cassandra-datamodel-4985524 slide 23 session.execute(""" CREATE TABLE emissions ( vin text, make text, year text, zip_code_of_station text, co2 text, year_month_key int, PRIMARY KEY (vin) ) """) #https://www.youtube.com/watch?v=97VBdgIgcCU #Load mydata import glob print(glob.glob("./data/*.dat")) session.execute("use vindata") for datafile in glob.glob("./data/*.dat"): f=open(datafile, 'r') data={} for row in f.readlines(): data={} data['vin']=row[:20].strip() data['make']=row[20:24].strip() data['year']=row[24:28].strip() data['zip_code_of_station']=row[42:47].strip() data['co2']=row[47:48].strip() ymk='20'+datafile[-12:-8] data['year_month_key']=ymk #print(data) session.execute( """ INSERT INTO emissions (vin, make, year,zip_code_of_station,co2,year_month_key) VALUES (%s,%s,%s,%s,%s,%s) """, (data['vin'],data['make'],data['year'],data['zip_code_of_station'],data['co2'],data['year_month_key']) ) f.close() future=session.execute_async("SELECT * FROM emissions where vin='1B4GP33R9TB205257'") rows = future.result() for row in rows: print(row)

Installation

sudo aptitude install mongodb /etc/init.d/mongodb start #python virtualenv -p python3 env_py3 source env_py3/bin/activate pip install pymongo

Python

#http://api.mongodb.com/python/current/tutorial.html from pymongo import MongoClient client = MongoClient('mongodb://localhost:27017/') #create database db = client.vindata #create collection/table emissions = db.emissions #Load data from mydata import glob print(glob.glob("./data/*.dat")) for datafile in glob.glob("./data/*.dat"): f=open(datafile, 'r') data={} for row in f.readlines(): data={} data['vin']=row[:20].strip() data['make']=row[20:24].strip() data['year']=row[24:28].strip() data['zip_code_of_station']=row[42:47].strip() data['co2']=row[47:48].strip() #data['year_month_key']=201608 ymk='20'+datafile[-12:-8] data['year_month_key']=ymk #print(data) emissions.insert(data) f.close() emissions.count() emissions.find_one() emissions.find_one({"vin":"1B4GP33R9TB205257"}) #http://altons.github.io/python/2013/01/21/gentle-introduction-to-mongodb-using-pymongo/ #https://www.youtube.com/watch?v=f7l8PTjQ160&index=4&list=PLGOsbT2r-igmFK9IKEGAnBaklqtuW7l8W #https://www.youtube.com/watch?v=FVyIxdxsyok #-------BONUS-------------- import pandas cursor=emissions.find({"year_month_key":"201608"}) result=pandas.DataFrame(list(cursor)) result.describe() result.columns #http://lucasmanual.com/mywiki/Pandas #later http://alexgaudio.com/2012/07/07/monarymongopandas.html

Sources:

1. https://academy.datastax.com/mongodb-to-cassandra-migration

2. http://www.slideshare.net/nkorla1share/cass-summit-3?qid=f85a27f7-a560-48bb-9d64-6eaa91c39f24&v=&b=&from_search=8

3. https://www.youtube.com/watch?v=tg6eIht-00M

4. https://www.mongodb.com/customers/city-of-chicago

5. https://www.youtube.com/watch?v=FVyIxdxsyok