This page has exercises, most presented or at least sketched in class.
Count Unusual Values
The problem is to count how many values in the eegsample dataset are not within two standard deviations of the mean. A program for this was developed in class, essentially consisting of the following steps.
rawRDD = sc.textFile("/user/hadoop/eegsample") wordsRDD = rawRDD.flatMap(lambda line: line.split()) numRDD = wordsRDD.map(lambda word: float(word)) mu = numRDD.mean() sigma = numRDD.stdev() wantedRDD = numRDD.filter(lambda e: e<mu-2*sigma or e>mu+2*sigma) print (wantedRDD.count())
This code took around 30 seconds to run. To get something that runs faster, sampling was suggested: by changing the initial line to the following, the program ran quite fast (sampling just 1% of the dataset).
rawRDD = sc.textFile("/user/hadoop/eegsample").sample(False,0.01)
Another demonstration for this problem put all the code into a file wanted.py, along with some initial lines to create the Spark Context (sc) variable. This file was copied, using the scp command, to the practice machine. Then the following command was demonstrated:
$ spark-submit wanted.py 2> log > hold
What this command does is run the wanted.py using spark-submit, but it puts the many informational (and error) messages into a file named "log" whereas the printed output goes to a file named "hold". This can be handy because scp can be used to copy log and/or hold to your local computer for study.