Spark
- Description
- Spark is a library and cluster management software allowing parallel programming of in memory tasks for data processing.
- Modulefile
spark
- Documentation
- Spark homepage
- Usage
- To submit a spark job use the following submission script as an example:
#$ -cwd #$ -j yes #$ -l h_rt=1:00:0 #$ -pe spark 16 #$ -N spark-test module list ${SPARK_HOME}/bin/spark-submit --executor-memory 1g --master `cat spark-logs-$JOB_ID/master_url` ./wordcount.py generated.txt wordcounts
Unlike with most other software at ACENET, you must load the appropriate environment modules before you submit the job!.
$ module load java/8u45 spark $ qsub spark-test.sh
Note also the use of the "spark" parallel environment. This environment will grant only whole nodes since Spark, by default, uses all the resources on a given host. Thus you must specify 16, 32, 48 etc. cores at the Placentia, Fundy, and Glooscap clusters and 4,8,12 etc. cores at Mahone. The ./wordcount.py generated.txt wordcounts
specifies the Python script and command line arguments to be run with Spark. An example wordcount.py
file is given below to help get you started using Spark:
from pyspark import SparkContext,SparkConf import sys import os if len(sys.argv) > 1: inputFileName = sys.argv[1] else: inputFileName = "./generated.txt" assert(os.path.exists(inputFileName)) if len(sys.argv) > 2: outputDirectoryName = sys.argv[2] else: outputDirectoryName = "./wordcounts" assert(not os.path.exists(outputDirectoryName)) conf=SparkConf().setAppName("wordCount") sc = SparkContext(conf=conf) file = sc.textFile(inputFileName) counts = file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) counts.saveAsTextFile(outputDirectoryName)
The input file generated.txt
to the wordcount.py Spark script is a plain ascii file of text consisting of words separated by spaces on spread over multiple lines.