Contents
Hadoop is a package or framework, freely available, to bring some ideas used in the infrastructure of cloud computing onto a small cluster of machines. The problem with researching this topic is the very large amount of available information. You may wish to start with the Wikipedia entry, then visit the Apache Page, and look at some presentations. There are many presentations captured on video and mp3, which you easily search out on the web. The important thing is to manage your time, and focus your presentation or report on the basic ideas.
Example 1: Calculate Pi Approximation
The example here is a terrible way to approximate Pi: throw random points in a square, and see how many fall inside the largest circle that fits into the square (i.e., twice the circle's radius is the length of the square's edge). The proportion of points that end up in the circle give a quick approximation to Pi, using the ratio of the formula for a circle's area to the area of the square. To make it parallel, divide up the square into vertical strips and let mappers throw random points inside of the strips and see if they fall with the circle. The reducer adds up the mapper results to calculate Pi (there's just one reducer).
Programs
Here's a simple example, showing how to calculate Pi. Here's the mapper:
1 #!/usr/bin/env python 2 3 import sys 4 import random 5 6 M = 10 # number of times to test within an interval 7 8 def sliv(a,b): 9 # a is the start of an interval and b is the end 10 # of that interval, i.e., the interval is [a,b) 11 x = random.uniform(a,b) 12 y = random.uniform(-1.0,1.0) 13 if x*x + y*y <= 1.0: return True 14 return False 15 16 # one mapper gets a chunk of lines 17 for line in sys.stdin: 18 line = line.strip() 19 f = line.split() 20 if len(f) != 2: continue 21 s,t = float(f[0]),float(f[1]) 22 # process one interval M times 23 i = 0 24 p = 0 25 while i<M: 26 i += 1 27 if sliv(s,t): p += 1 28 print "1\t(%d,%d,%f,%f)" % (p, M, s, t)
I saved this as mapper.py, and made the file executable with the chmod command. And here's the reducer:
1 #!/usr/bin/env python 2 3 import sys 4 5 # reducer gets all lines 6 S = 0 7 C = 0 8 for line in sys.stdin: 9 line = line.strip() 10 f = line.split() 11 if len(f) != 2: continue 12 f = f[1].replace("(","") 13 f = f.replace(")","") 14 f = f.split(",") 15 if len(f) != 4: continue 16 C += 1 17 p = int(f[0]) # should be count for one interval 18 M = int(f[1]) # number of trials for that interval 19 S += p 20 21 # at end, S is count of hits within circle 22 # and C is number of intervals, and 23 # M is number of trials per interval 24 p = float(S)/float(C*M) 25 p *= 4.0 26 27 print "Pi is approximately", p
This was saved as reducer.py and also marked executable. One more program was needed to create an input file for the mappers to consume:
1 # Create a list of N intervals subdividing [-1.0,1] 2 N = 1e5 # change as you feel appropriate 3 n = 2.0 / float(N) # size of each interval 4 s = -1.0 5 i = 0 6 while i<N: 7 i += 1 8 t = s + n 9 print s,t 10 s = t
This was put into a MakeIntvl.py file.
Testing and Executing on Hadoop
Then, I made a file called seqTest containing the following line:
python MakeIntvl.py | python mapper.py | sort | python reducer.py
At the Unix command prompt, the command sh seqTest runs this, as a non-Hadoop, simple test, for debugging and timing purposes.
But, finally, we need to run it on Hadoop. After many manual trials, I put the following into a file called hadoopTest, and marked that executable:
#!/bin/sh # setup paths etc export PATH=$PATH:/usr/local/hadoop/bin # # First, setup the input file and copy it to HDFS, # and also clean up any old output directory # rm approxPi.txt rm intervals.txt hadoop dfs -rm intervals.txt python MakeIntvl.py > intervals.txt hadoop dfs -copyFromLocal /home/therman/picalc/intervals.txt intervals.txt hadoop dfs -rmr pi # # Now run the MapReduce job # hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.18.3-streaming.jar \ -mapper mapper.py \ -file mapper.py \ -reducer reducer.py \ -file reducer.py \ -input intervals.txt \ -output pi \ -numReduceTasks 1 # # Copy the output to local directory and clean up # hadoop dfs -copyToLocal pi/part-00000 /home/therman/picalc/approxPi.txt rm intervals.txt hadoop dfs -rm intervals.txt
Now, just by running the command "./hadoopTest" the MapReduce job runs.
Example 2: Calculate Better Pi Approximation
This program takes a standard "BPP" formula for Pi and implements that using Python's "decimal" module, which allows any specified precision. To assign mappers, a first program generates numbers (0, 1, 2, ...) and puts these in a file: each mapper gets some integers which specify what terms in the formula should be computed. Here again, there is one reducer to add up mapper results, though the overhead of Python's decimal module is significant, and likely things would be faster using multiple reducers.
Programs
Here's the program to make input for mappers:
1 # Create a list of N terms 2 N = 1e3 # change as you feel appropriate 3 i = 0 4 while i<N: 5 print i, i 6 i += 1
And here's the mapper.py program:
1 #!/usr/bin/env python 2 import sys 3 from decimal import * 4 getcontext().prec = 1000 # specify 1000 digits of precision 5 # one mapper gets a chunk of lines 6 for line in sys.stdin: 7 line = line.strip() 8 f = line.split() 9 if len(f) != 2: continue 10 s = int(f[0]) 11 # 12 # The following is a BBP formula for a term of pi 13 # 14 s8k = "%d" % (8*s) 15 s8k = Decimal(8*s) 16 t = Decimal("4.0")/(s8k + Decimal("1.0")) 17 t = t - (Decimal("2.0")/(s8k + Decimal("4.0"))) 18 t = t - (Decimal("1.0")/(s8k + Decimal("5.0"))) 19 t = t - (Decimal("1.0")/(s8k + Decimal("6.0"))) 20 t = t * (Decimal("16.0") ** Decimal("%d" % -s)) 21 print "%d\t%s" % (s,t)
The reducer just adds up the terms:
1 #!/usr/bin/env python 2 import sys 3 from decimal import * 4 getcontext().prec = 1000 # specify 1000 digits of precision 5 # reducer gets all lines 6 S = Decimal("0.0") 7 for line in sys.stdin: 8 line = line.strip() 9 f = line.split() 10 if len(f) != 2: continue 11 p = Decimal(f[1]) 12 S = S + p 13 print "Pi is approximately %s" , S
For testing, I also entered a thousand digits of Pi into the program to verify that the result was accurate -- it works.
Of course, I used almost the same script (hadoopTest) from Example 1 above, to execute the MapReduce job.
Example 3: MapReduce Tutorial
This is an adaptation of the Hadoop tutorial on MapReduce (here). Here's how to run it on our Hadoop cluster:
Grab a copy of the word count java program: WordCount.java
Save the file "demjava": demjava.txt (save it as "demjava"); it has the following script/commands:
1 #!/bin/sh 2 # NOTE: because of Java version mixup in the Linux/Hadoop 3 # installation, I am using "sjavac" and "sjar" below instead 4 # of using "javac" and "jar" as the tutorial would recommend. 5 # 6 sjavac=/usr/lib/jvm/java-6-sun/bin/javac 7 sjar=/usr/lib/jvm/java-6-sun/bin/jar 8 9 # FIRST step: compile 10 # destroy/create a directory for the java classes 11 rm -rf wordcount_classes 12 mkdir wordcount_classes 13 $sjavac -classpath /usr/local/hadoop/hadoop-0.18.3-core.jar -d wordcount_classes WordCount.java 14 # now make a jar file from the directory 15 $sjar -cvf wordcount.jar -C wordcount_classes/ . 16 17 # SECOND step: create input/remove output for job 18 hadoop dfs -rmr input 19 hadoop dfs -mkdir input 20 hadoop dfs -copyFromLocal wssnt10.txt input/wssnt10.txt 21 hadoop dfs -rmr output 22 23 # THIRD step: run job 24 hadoop jar wordcount.jar org.myorg.WordCount input output
- Do "chmod +x demjava" to make it executable.
- Find some big text file and name it "wssnt10.txt" (I searched for a text file containing all of Shakespeare's sonnets, then used the "scp" command to copy it over to the neighbor cluster).
- Run it, "./demjava" and it should do the mapreduce job.
