TSV

"\t"

","

raw_data = sc . textFile ( "daily_show.tsv" ) raw_data . take ( 5 )

['YEAR\tGoogleKnowlege_Occupation\tShow\tGroup \tRaw_Guest_List', '1999\tactor\t1/11/99\tActing\tMichael J. Fox', '1999\tComedian\t1/12/99\tComedy\tSandra Bernhard', '1999\ttelevision actress\t1/13/99\tActing \tTracey Ullman', '1999\tfilm actress\t1/14/99\tActing\tGillian Anderson']

SparkContext

sc

raw_data = sc.textFile("daily_show.tsv")

raw_data

raw_data

take()

raw_data.take(5)

take(n)

n

raw_data.take(5)

raw_data = sc.textFile("dail_show.tsv")

raw_data.take(5)

raw_data

After lots of ground-breaking work led by the UC Berkeley AMP Lab , Spark was developed to utilize distributed, in-memory data structures to improve data processing speeds over Hadoop for most workloads. In this post, we're going to cover the architecture of Spark and basic transformations and actions using a real dataset. If you want to write and run your own Spark code, check out theof this post on Dataquest The core data structure in Spark is an RDD, or a resilient distributed dataset. As the name suggests, an RDD is Spark's representation of a dataset that is distributed across the RAM, or memory, of lots of machines. An RDD object is essentially a collection of elements that you can use to hold lists of tuples, dictionaries, lists, etc. Similar to DataFrames in Pandas, you load a dataset into an RDD and then can run any of the methods accesible to that object.While Spark is writen in Scala, a language that compiles down to bytecode for the JVM, the open source community has developed a wonderful toolkit called PySpark that allows you to interface with RDD's in Python. Thanks to a library called Py4J , Python can interface with JVM objects, in our case RDD's, and this library one of the tools that makes PySpark work.To start off, we'll load the dataset containing all of the Daily Show guests into an RDD. We are using theversion of FiveThirtyEight's dataset . TSV files are separated, or delimited, by a tab characterinstead of a commalike in a CSV file.is the object that manages the connection to the clusters in Spark and coordinates running processes on the clusters themselves. SparkContext connects to cluster managers, which manage the actual executors that run the specific computations. Here's a diagram from the Spark documentation to better visualize the architecture:The SparkContext object is usually referenced as the variable. We then run:to read the TSV dataset into an RDD object. The RDD objectclosely resembles a List of String objects, one object for each line in the dataset. We then use themethod to print the first 5 elements of the RDD:To explore the other methods an RDD object has access to, check out the PySpark documentation will return the firstelements of the RDD.One question you may have is if an RDD resembles a Python List, why not just use bracket notation to access elements in the RDD? Because RDD objects are distributed across lots of partitions, we can't rely on the standard implementation of a List and the RDD object was developed to specifically handle the distributed nature of the data. One advantage of the RDD abstraction is the ability to run Spark locally on your own computer. When running locally on your own computer, Spark simulates distributing your calculations over lots of machines by slicing your computer's memory into partitions, with no tweaking or changes to the code you wrote.Another advantage of Spark's RDD implementation is the ability to lazily evaluate code, postponing running a calculation until absolutely necessary. In the code above, Spark didn't wait to load the TSV file into an RDD untilwas run. Whenwas called, a pointer to the file was created, but only whenneeded the file to run its logic was the text file actually read into. We will see more examples of this lazy evaluation in this lesson and in future lessons.