Continue with 30Apr.pdf informal student presentations. A brief mention of some future directions with tools like MapReduce and Spark: Apache Tez, Apache Flink, Apache Zeppelin (and for sake of comparison, an outsider project idea). What we see from all of these is a convergence on the general trends (directed graph of tasks, need for distributed file system, wanting to exploit RAM, fault tolerance, better support for interactive queries and visualization). Bonus link: GraphLab.
A bit about how to work with matrices on Spark (just general problem solving, such as how to multiply two matrices), then a little about recommender systems. Recommendation algorithms can be set-based (like using the Jaccard index), using Alternating Least Squares, even using k-Nearest Neighbors with Cosine Similarity. Three student presentations about basic part of the Final Project, from 30Apr.pdf.
Summary of techniques in scalability. Approximation, Sampling, Fast Algorithms (for streaming), Hierarchical Architectures, trading consistency for availability (e.g. Shard (database architecture)) aggressive use of in-memory data structures. Illustrative topics in more detail: Bloom Filters and Facebook TAO.
Walkthrough of Spark architecture, partitions, estimating memory and storage needs for RDDs, including Performance Tuning and Profiling of Spark. Considerations for size of new cluster for Final Project. At end of lecture, a quiz over previous topics.
Tuning Spark: how to adjust memory and storage prefences. Sampling in Spark using .sample() method (related is Reservoir sampling). Discussion - what is a good size of cluster for Final Project Student presentation on counting triangles in large graphs. Time permitting, GraphX vs GraphLab.
There will be a bit more discussion of the Final Project (mainly for any questions). The main topics of day are (1) some brief further material on database transactions, and (2) an introduction to approximation methods for big data. Spark has a sample transform that can get a small, sampled RDD from a large RDD; a technical basis for sampling is Reservoir sampling. Most time will be devoted (including a student presentation) on approximate counting, such as using Bloom filters and HyperLogLog techniques. Bonus link: many data analysis discussions/examples
Discussion of the Public Datasets now in S3 buckets for the Final Project, what they are, how to use them. A student presentation on Three-phase commit protocol, then some discussion of problematic issues of stable story (journaling of transactions), checkpointing consistent states, how to set timeouts, and the need for quorums of more than half of the nodes to guarantee safety. These issues are present in Paxos (computer science), and can be understood by viewing this Paxos Lecture. Bonus link: Extracting Data from Recipes
A student presentation on the Tachyon File System. Then an introduction to Message queuing service, which leads to looking at asynchronous use of shared memory for the Producer-consumer problem. The programming difficulty, with concurrency, is how to avoid losing data or damaging a queue when multiple writers compete to make updates to the shared queue. There are low-level tricks like locks, semaphores, even test-and-set operations. More relevant to Cloud Computing is the high-level view: the queue's history is a sequence of state transitions, from one queue state to the next. Processes working on the queue can use a distributed agreement algorithm so that all competing processes agree on the next state of the queue (which simplifies how the queue is updated). At the end of the lecture, the problem of Consensus (computer science) was presented to the class.
Homework Four progress? Some slides from this presentation. Initial description of the Final Project, based on Public Datasets. How addition of n numbers can be done in O(log n) time on a Parallel random-access machine.
Continuing Spark, with in-class demonstrations.
Introducing Spark (there will be at least one homework using Spark).
Overview of the Google File System, racks, servers, master, chunkservers, metadata, and writing chunks through replicas. A video showed how Hadoop has adapted most of these ideas to the Hadoop File System. At end of lecture, brief discussion of Homework Three.
Show running a Java MapReduce job on Amazon. See Launching Hadoop on Amazon for a written example, but we can try other things in class. The lecture also covered ssh and scp commands, demonstrating remote login and secure copy to and from the master machine of an Amazon EMR cluster. Another topic, to be continued on Thursday, is the Google File System.
Some discussion of Parts 4-6 of Homework One -- what will be the due dates? Then, Higher-Level Software for Hadoop: Hive, Pig, Cascading and more. Demonstrations of AWS, see Elastic MapReduce using Python and MRJob, but the lecture will also look at how to launch EC2 instances, how to launch EMR, how to connect and interact with an instance using SSH, and possibly how to run Java JAR in Hadoop on EMR.
How to critically read technical papers: Advice on Reading and Presenting which is useful for Homework Two (CS grad students only). Lecture topic is to present the original MapReduce paper. Time permitting, the lecture may start coverage of Amazon's Web Services.
No Class -- use this time for Homework One.
No Class -- use this time for Homework One.
Cover, in detail, the wordcount example using MRJob -- see the page Python MapReduce for how this can be installed. The example followed in the lecture is in topwords.zip. See PythonWordCount.pdf for a listing. The two programs wordcount.py and freqwords.py have no comments explaining how things work: that's so you can document them during class, taking notes and asking questions. Some of the Python syntax is more advanced than beginners to the language will have seen. It's important to understand these programs well, since the first homework builds on this knowledge. Discussion of Homework One.
Introduction to MapReduce, Version 0 in MapReduce page.