Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Connect to the Tufts High Performance Compute Cluster. See Connecting Access for a detailed guide.

  2. Load the Python module with the following command:

    Code Block
    module load python

    Note that you can see a list of all available modules (potentially including different versions of Python) by typing:

    Code Block
    module avail

    You can specify a specific version of Python with the module load command or use the generic module name (python) to load the latest version.  

  3. Start a Python session by typing:

    Code Block
    python
    Code Block
    print("Hello, World!")

A Simple Test

To make sure that Spark and PySpark are working properly, lets load an RDD and perform a simple function on it. In this example, we will create a text file with a few lines and use PySpark to count both the number of lines and the number of words.

  1. Create a new text file in your home directory on the cluster using nano (or your favorite text editor):

    Code Block
    nano test.txt
  2. Put a few lines of text into the file and save it, for example:

    Code Block
    This is line one
    This is line two
    This is line three
    This is line four
  3. Load the file into an RDD as follows:

    Code Block
    rdd = sc.textFile("sparktest.txt")

    Note that you case use the type() command to verify that rdd is indeed a PySpark RDD.

  4. Count the number of lines in the rdd:

    Code Block
    lines = rdd.count()
  5. Now you can use the split() and flatMap() functions to count the number of individual words:

    Code Block
    words = rdd.flatMap(lambda x: x.split()).count()

...

 

Python related:

How can I verify if a particular Python package is installed?
Add-on tools such as numpy and scipy are installed. Others would be under the install tree located at:
/opt/shared/python/
in the version specific site-packages directory.   Another approach uses pip.

> module load python/2.7.6
> pip list

 

For a more detailed overview of Python and how it relates to Big Data or High Performance Computing (HPC) please contact tts-research@tufts.edu for information regarding future workshops.

...