The Tufts High Performance Compute (HPC) cluster delivers 35,845,920 cpu hours and 59,427,840 gpu hours of free compute time per year to the user community.

Teraflops: 60+ (60+ trillion floating point operations per second) cpu: 4000 cores gpu: 6784 cores Interconnect: 40GB low latency ethernet

For additional information, please contact Research Technology Services at tts-research@tufts.edu


Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Version History

« Previous Version 8 Next »

PySpark is the Python interface to Apache Spark, a powerful open source cluster computing framework. Spark is a fast and general-purpose cluster computing system and provides programmers with an interface centered on the Resilient Distributed Dataset (RDD). The RDD is a data structure that is distributed over a cluster of computers and is maintained in a fault-tolerant way. 

For more information about Spark and PySpark, you can visit the following resources:

https://en.wikipedia.org/wiki/Apache_Spark

https://spark.apache.org/docs/latest/index.html

Getting Started with PySpark

You can access and start using PySpark with the following steps:

  1. Connect to the Tufts High Performance Compute Cluster. See Connecting for a detailed guide.

  2. Load the Spark module with the following command:

    module load spark

    Note that you can see a list of all available modules (potentially including different versions of Spark) by typing:

    module avail

    You can specify a specific version of Spark with the module load command or use the generic module name (spark) to load the latest version.

  3. Start PySpark session by typing:

    pyspark
  • No labels