Students will present several ideas about how the MapReduce paradigm can be programmed, and multiple MapReduce instances can be controlled and related (Pig, Sawzall, etc). As another example in this genre, this page is a brief coverage of the Cascading Project (see the Cascading Website for more detail).

Cascading has been implemented on Hadoop, so we could actually install and try it in our lab, though it would likely be overkill for our purposes. One interesting aspect of Cascading (and other general planners or orchestrators of MapReduce instances) is that the notions of a simple FIFO queue, the shuffle/sort, or partition step of MapReduce, and even database operations are somehow all seen as generalized pipes, in the sense of Pipes (computing), that one finds in Unix scripts that pipeline data from standard output to standard input, from one command to another. In the Cascading generalization, tuples are not merely passed from one end to the other: they can be grouped by key, which the MapReduce partition/sort phase, filtered, transformed, and more. Such a pipe can even do Join (SQL) operations to the tuples that pass through it.

The attached presentation is extracted from the Cascading User Guide.

Cascading (last edited 2014-05-25 18:20:09 by localhost)