Content Comparison

Excerpt

Python is the Python interface to Apache Spark, a powerful open source cluster computing framework. Spark is a fast and general-purpose cluster computing system and provides programmers with an interface centered on the Resilient Distributed Dataset (RDD). The RDD is a data structure that is distributed over a cluster of computers and is maintained in a fault-tolerant waya widely used high-level, general-purpose, interpreted programming language. It is often used as the "glue" within the High Performance Computing community.

For more information about Spark and PySpark, you can visit the following resources:

https://enwww.wikipediapython.org/wiki/Apache_Spark

https://sparken.apachewikipedia.org/docs/latest/index.htmlwiki/Python_(programming_language)

Getting Started with

...

Python

You can access and start using PySpark Python with the following steps:

Connect to the Tufts High Performance Compute Cluster. See Connecting for a detailed guide.
Load the Spark Python module with the following command:
Code Block
module load sparkpython
Note that you can see a list of all available modules (potentially including different versions of SparkPython) by typing:
Code Block
module avail
You can specify a specific version of Spark Python with the module load command or use the generic module name (sparkpython) to load the latest version.
Start PySpark a Python session by typing:
Code Block
pysparkpython
Code Block
print("Hello, World!")

A Simple Test

To make sure that Spark and PySpark are working properly, lets load an RDD and perform a simple function on it. In this example, we will create a text file with a few lines and use PySpark to count both the number of lines and the number of words.

Create a new text file in your home directory on the cluster using nano (or your favorite text editor):
Code Block
nano sparktesttest.txt
Put a few lines of text into the file and save it, for example:
Code Block
This is line one This is line two This is line three This is line four
Load the file into an RDD as follows:
Code Block
rdd = sc.textFile("sparktest.txt")
Note that you case use the type() command to verify that rdd is indeed a PySpark RDD.
Count the number of lines in the rdd:
Code Block
lines = rdd.count()
Now you can use the split() and flatMap() functions to count the number of individual words:
Code Block
words = rdd.flatMap(lambda x: x.split()).count()

...

For a more detailed overview of Spark (and Big Data in general) with examples, you can view the slides from the recent XSEDE Big Data workshop. Please Python and how it relates to Big Data or High Performance Computing (HPC) please contact tts-research@tufts.edu for information regarding future workshops.

...

Version	Old Version 1	New Version 2
Changes made by	Shawn Doughty (Deactivated)	Shawn Doughty (Deactivated)
Saved on	Dec 05, 2016	Dec 05, 2016

Versions Compared

Key

Getting Started with

Python

A Simple Test