Page Comparison

...

Connect to the Tufts High Performance Compute Cluster. See Connecting for a detailed guide.
Load the Python module with the following command:
Code Block
module load python
Note that you can see a list of all available modules (potentially including different versions of Python) by typing:
Code Block
module avail
You can specify a specific version of Python with the module load command or use the generic module name (python) to load the latest version.
Start a Python session by typing:
Code Block
python
Code Block
print("Hello, World!")

A Simple Test

To make sure that Spark and PySpark are working properly, lets load an RDD and perform a simple function on it. In this example, we will create a text file with a few lines and use PySpark to count both the number of lines and the number of words.

Create a new text file in your home directory on the cluster using nano (or your favorite text editor):
Code Block
nano test.txt
Put a few lines of text into the file and save it, for example:
Code Block
This is line one This is line two This is line three This is line four
Load the file into an RDD as follows:
Code Block
rdd = sc.textFile("sparktest.txt")
Note that you case use the type() command to verify that rdd is indeed a PySpark RDD.
Count the number of lines in the rdd:
Code Block
lines = rdd.count()
Now you can use the split() and flatMap() functions to count the number of individual words:
Code Block
words = rdd.flatMap(lambda x: x.split()).count()

For a more detailed overview of Python and how it relates to Big Data or High Performance Computing (HPC) please contact tts-research@tufts.edu for information regarding future workshops.

...

Version	Old Version 2	New Version 3
Changes made by	Shawn Doughty	Shawn Doughty
Saved on	Dec 05, 2016	Dec 05, 2016

Versions Compared

Key

A Simple Test