Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Connect to the Tufts High Performance Compute Cluster. See Connecting Access for a detailed guide.

  2. Load the Spark module with the following command:

    Code Block
    module load spark

    Note that you can see a list of all available modules (potentially including different versions of Spark) by typing:

    Code Block
    module avail

    You can specify a specific version of Spark with the module load command or use the generic module name (spark) to load the latest version.
      

  3. Start PySpark session by typing:

    Code Block
    pyspark

...

  1. Create a new text file in your home directory on the cluster using nano (or your favorite text editor):

    Code Block
    nano sparktest.txt
  2. Put a few lines of text into the file and save it, for example:

    Code Block
    This is line one
    This is line two
    This is line three
    This is line four
  3. Load the file into an RDD as follows:

    Code Block
    rdd = sc.textFile("sparktest.txt")

    Note that you case use the type() command to verify that rdd is indeed a PySpark RDD.

  4. Count the number of lines in the rdd:

    Code Block
    lines = rdd.count()
  5. Now you can use the split() and flatMap() functions to count the number of individual words:

    Code Block
    words = rdd.flatMap(lambda x: x.split()).count()

  

For a more detailed overview of Spark (and Big Data in general) with examples, you can view the slides from the recent XSEDE Big Data workshop (additional sessions will be held in the future): https://www.psc.edu/index.php/hpc-workshop-series/big-data-november-2016. Please contact tts-research@tufts.edu for information regarding future workshops.