PySpark

PySpark is the Python interface to Apache Spark, a powerful open source cluster computing framework. Spark is a fast and general-purpose cluster computing system and provides programmers with an interface centered on the Resilient Distributed Dataset (RDD). The RDD is a data structure that is distributed over a cluster of computers and is maintained in a fault-tolerant way.

For more information about Spark and PySpark, you can visit the following resources:

https://en.wikipedia.org/wiki/Apache_Spark

https://spark.apache.org/docs/latest/index.html

Getting Started with PySpark

You can access and start using PySpark with the following steps:

Connect to the Tufts High Performance Compute Cluster. See Access for a detailed guide.
Load the Spark module with the following command:
```
module load spark
```
Note that you can see a list of all available modules (potentially including different versions of Spark) by typing:
```
module avail
```
You can specify a specific version of Spark with the module load command or use the generic module name (spark) to load the latest version.
Start PySpark session by typing:
```
pyspark
```

A Simple Test

To make sure that Spark and PySpark are working properly, lets load an RDD and perform a simple function on it. In this example, we will create a text file with a few lines and use PySpark to count both the number of lines and the number of words.

Create a new text file in your home directory on the cluster using nano (or your favorite text editor):
```
nano sparktest.txt
```

Put a few lines of text into the file and save it, for example:

This is line one
This is line two
This is line three
This is line four

Load the file into an RDD as follows:
```
rdd = sc.textFile("sparktest.txt")
```
Note that you case use the type() command to verify that rdd is indeed a PySpark RDD.
Count the number of lines in the rdd:
```
lines = rdd.count()
```
Now you can use the split() and flatMap() functions to count the number of individual words:
```
words = rdd.flatMap(lambda x: x.split()).count()
```

For a more detailed overview of Spark (and Big Data in general) with examples, you can view the slides from the recent XSEDE Big Data workshop. Please contact tts-research@tufts.edu for information regarding future workshops.

Tufts UIT Research Computing

PySpark

Getting Started with PySpark

A Simple Test

Related content