Page Comparison

...

Create a new text file in your home directory on the cluster using nano (or your favorite text editor):
Code Block
nano sparktest.txt
Put a few lines of text into the file and save it, for example:
Code Block
This is line one This is line two This is line three This is line four
Load the file into an RDD as follows:
Code Block
rdd = sc.textFile("sparktest.txt")
Note that you case use the type() command to verify that rdd is indeed a PySpark RDD.
Count the number of lines in the rdd:
Code Block
lines = rdd.count()
Now you can use the split() and flatMap() functions to count the number of individual words:
Code Block
words = rdd.flatMap(lambda x: x.split()).count()

For a more detailed overview of Spark (and Big Data in general) with examples, you can view the slides from the recent XSEDE Big Data workshop (additional sessions will be held in the future): https://www.psc.edu/index.php/hpc-workshop-series/big-data-november-2016

Versions Compared