The Tufts High Performance Compute (HPC) cluster delivers 35,845,920 cpu hours and 59,427,840 gpu hours of free compute time per year to the user community.

Teraflops: 60+ (60+ trillion floating point operations per second) cpu: 4000 cores gpu: 6784 cores Interconnect: 40GB low latency ethernet

For additional information, please contact Research Technology Services at tts-research@tufts.edu


Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Python is the Python interface to Apache Spark, a powerful open source cluster computing framework. Spark is a fast and general-purpose cluster computing system and provides programmers with an interface centered on the Resilient Distributed Dataset (RDD). The RDD is a data structure that is distributed over a cluster of computers and is maintained in a fault-tolerant way. 

For more information about Spark and PySpark, you can visit the following resources:

https://en.wikipedia.org/wiki/Apache_Spark

https://spark.apache.org/docs/latest/index.html

Getting Started with PySpark

You can access and start using PySpark with the following steps:

  1. Connect to the Tufts High Performance Compute Cluster. See Connecting for a detailed guide.

  2. Load the Spark module with the following command:

    module load spark

    Note that you can see a list of all available modules (potentially including different versions of Spark) by typing:

    module avail

    You can specify a specific version of Spark with the module load command or use the generic module name (spark) to load the latest version.
      

  3. Start PySpark session by typing:

    pyspark

A Simple Test

To make sure that Spark and PySpark are working properly, lets load an RDD and perform a simple function on it. In this example, we will create a text file with a few lines and use PySpark to count both the number of lines and the number of words.

  1. Create a new text file in your home directory on the cluster using nano (or your favorite text editor):

    nano sparktest.txt
  2. Put a few lines of text into the file and save it, for example:

    This is line one
    This is line two
    This is line three
    This is line four
  3. Load the file into an RDD as follows:

    rdd = sc.textFile("sparktest.txt")

    Note that you case use the type() command to verify that rdd is indeed a PySpark RDD.

  4. Count the number of lines in the rdd:

    lines = rdd.count()
  5. Now you can use the split() and flatMap() functions to count the number of individual words:

    words = rdd.flatMap(lambda x: x.split()).count()

  

For a more detailed overview of Spark (and Big Data in general) with examples, you can view the slides from the recent XSEDE Big Data workshop. Please contact tts-research@tufts.edu for information regarding future workshops.

 

 

 

 

  • No labels