PySpark

PySpark is the Python interface to Apache Spark, a powerful open source cluster computing framework. Spark is a fast and general-purpose cluster computing system and provides programmers with an interface centered on the Resilient Distributed Dataset (RDD). The RDD is a data structure that is distributed over a cluster of computers and is maintained in a fault-tolerant way.

For more information about Spark and PySpark, see:

https://en.wikipedia.org/wiki/Apache_Spark

https://spark.apache.org/docs/latest/index.html

Getting Started with PySpark

You can access and start using PySpark by using the following steps:

Connect to the Tufts High Performance Compute Cluster. See Connecting for a detailed guide.
Load the Spark module by typing:
```
module load spark
```
Note that you can see a list of all available modules (potentially including different versions of Spark) by typing:
```
module avail
```
You can specify a specific version of Spark with the module load command or use the generic module name (spark) to load the latest version.
Start PySpark by typing:
```
pyspark
```