Page Comparison

...

Create a new text file in your home directory on the cluster using nano (or your favorite text editor):
Code Block
nano sparktest.txt
Put a few lines of text into the file and save it, for example:
Code Block
This is line one This is line two This is line three This is line four
Load the file into an RDD as follows:
Code Block
rdd = sc.textFile("sparktest.txt")
Note that you case use the type() command to verify that rdd is indeed a PySpark RDD.
Count the number of lines in the rdd:
Code Block
lines = rdd.count()
Now you can use the split() and flatMap() functions to count the number of individual words:
Code Block
words = rdd.flatMap(lambda x: x.split()).count()

...

Versions Compared