Hadoop is a package or framework, freely available, to bring some ideas used in the infrastructure of cloud computing onto a small cluster of machines. The problem with researching this topic is the very large amount of available information. You may wish to start with the Wikipedia entry, then visit the Apache Page, and look at some presentations. There are many presentations captured on video and mp3, which you easily search out on the web. The important thing is to manage your time, and focus your presentation or report on the basic ideas.

Example 1: Calculate Pi Approximation

The example here is a terrible way to approximate Pi: throw random points in a square, and see how many fall inside the largest circle that fits into the square (i.e., twice the circle's radius is the length of the square's edge). The proportion of points that end up in the circle give a quick approximation to Pi, using the ratio of the formula for a circle's area to the area of the square. To make it parallel, divide up the square into vertical strips and let mappers throw random points inside of the strips and see if they fall with the circle. The reducer adds up the mapper results to calculate Pi (there's just one reducer).

Programs

Here's a simple example, showing how to calculate Pi. Here's the mapper:

I saved this as mapper.py, and made the file executable with the chmod command. And here's the reducer:

This was saved as reducer.py and also marked executable. One more program was needed to create an input file for the mappers to consume:

This was put into a MakeIntvl.py file.

Testing and Executing on Hadoop

Then, I made a file called seqTest containing the following line:

At the Unix command prompt, the command sh seqTest runs this, as a non-Hadoop, simple test, for debugging and timing purposes.

But, finally, we need to run it on Hadoop. After many manual trials, I put the following into a file called hadoopTest, and marked that executable:

Now, just by running the command "./hadoopTest" the MapReduce job runs.

Example 2: Calculate Better Pi Approximation

This program takes a standard "BPP" formula for Pi and implements that using Python's "decimal" module, which allows any specified precision. To assign mappers, a first program generates numbers (0, 1, 2, ...) and puts these in a file: each mapper gets some integers which specify what terms in the formula should be computed. Here again, there is one reducer to add up mapper results, though the overhead of Python's decimal module is significant, and likely things would be faster using multiple reducers.

Programs

Here's the program to make input for mappers:

And here's the mapper.py program:

The reducer just adds up the terms:

For testing, I also entered a thousand digits of Pi into the program to verify that the result was accurate -- it works.

Of course, I used almost the same script (hadoopTest) from Example 1 above, to execute the MapReduce job.

Example 3: MapReduce Tutorial

This is an adaptation of the Hadoop tutorial on MapReduce (here). Here's how to run it on our Hadoop cluster:

  1. Grab a copy of the word count java program: WordCount.java

  2. Save the file "demjava": demjava.txt (save it as "demjava"); it has the following script/commands:

       1 #!/bin/sh
       2 # NOTE:  because of Java version mixup in the Linux/Hadoop
       3 # installation, I am using "sjavac" and "sjar" below instead
       4 # of using "javac" and "jar" as the tutorial would recommend.
       5 #
       6 sjavac=/usr/lib/jvm/java-6-sun/bin/javac
       7 sjar=/usr/lib/jvm/java-6-sun/bin/jar
       8 
       9 # FIRST step:  compile
      10 # destroy/create a directory for the java classes 
      11 rm -rf wordcount_classes
      12 mkdir wordcount_classes
      13 $sjavac -classpath /usr/local/hadoop/hadoop-0.18.3-core.jar -d wordcount_classes WordCount.java
      14 # now make a jar file from the directory
      15 $sjar -cvf wordcount.jar -C wordcount_classes/ .
      16 
      17 # SECOND step:  create input/remove output for job
      18 hadoop dfs -rmr input
      19 hadoop dfs -mkdir input
      20 hadoop dfs -copyFromLocal wssnt10.txt input/wssnt10.txt
      21 hadoop dfs -rmr output
      22 
      23 # THIRD step:  run job
      24 hadoop jar wordcount.jar org.myorg.WordCount input output
    
  3. Do "chmod +x demjava" to make it executable.
  4. Find some big text file and name it "wssnt10.txt" (I searched for a text file containing all of Shakespeare's sonnets, then used the "scp" command to copy it over to the neighbor cluster).
  5. Run it, "./demjava" and it should do the mapreduce job.

More Examples

In Java

Hadoop (last edited 2014-05-25 18:20:09 by localhost)