Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Create a new text file in your home directory on the cluster using nano (or your favorite text editor):

    Code Block
    nano sparktest.txt
  2. Put a few lines of text into the file and save it, for example:

    Code Block
    This is line one
    This is line two
    This is line three
    This is line four
  3. Load the file into an RDD as follows:

    Code Block
    rdd = sc.textFile("sparktest.txt")

    Note that you case use the type() command to verify that rdd is indeed a PySpark RDD.
      

  4. Count the number of lines in the rdd:

    Code Block
    lines = rdd.count()
  5. Now you can use the split() and flatMap() functions to count the number of individual words:

    Code Block
    words = rdd.flatMap(lambda x: x.split()).count()

...