Lectures

5May

Continue with 30Apr.pdf informal student presentations. A brief mention of some future directions with tools like MapReduce and Spark: Apache Tez, Apache Flink, Apache Zeppelin (and for sake of comparison, an outsider project idea). What we see from all of these is a convergence on the general trends (directed graph of tasks, need for distributed file system, wanting to exploit RAM, fault tolerance, better support for interactive queries and visualization). Bonus link: GraphLab.

30Apr

A bit about how to work with matrices on Spark (just general problem solving, such as how to multiply two matrices), then a little about recommender systems. Recommendation algorithms can be set-based (like using the Jaccard index), using Alternating Least Squares, even using k-Nearest Neighbors with Cosine Similarity. Three student presentations about basic part of the Final Project, from 30Apr.pdf.

28Apr

Summary of techniques in scalability. Approximation, Sampling, Fast Algorithms (for streaming), Hierarchical Architectures, trading consistency for availability (e.g. Shard (database architecture)) aggressive use of in-memory data structures. Illustrative topics in more detail: Bloom Filters and Facebook TAO.

23Apr

Walkthrough of Spark architecture, partitions, estimating memory and storage needs for RDDs, including Performance Tuning and Profiling of Spark. Considerations for size of new cluster for Final Project. At end of lecture, a quiz over previous topics.

21Apr

Tuning Spark: how to adjust memory and storage prefences. Sampling in Spark using .sample() method (related is Reservoir sampling). Discussion - what is a good size of cluster for Final Project Student presentation on counting triangles in large graphs. Time permitting, GraphX vs GraphLab.

16Apr

There will be a bit more discussion of the Final Project (mainly for any questions). The main topics of day are (1) some brief further material on database transactions, and (2) an introduction to approximation methods for big data. Spark has a sample transform that can get a small, sampled RDD from a large RDD; a technical basis for sampling is Reservoir sampling. Most time will be devoted (including a student presentation) on approximate counting, such as using Bloom filters and HyperLogLog techniques. Bonus link: many data analysis discussions/examples

14Apr

Discussion of the Public Datasets now in S3 buckets for the Final Project, what they are, how to use them. A student presentation on Three-phase commit protocol, then some discussion of problematic issues of stable story (journaling of transactions), checkpointing consistent states, how to set timeouts, and the need for quorums of more than half of the nodes to guarantee safety. These issues are present in Paxos (computer science), and can be understood by viewing this Paxos Lecture. Bonus link: Extracting Data from Recipes

9Apr

A student presentation on one "NoSQL" Key-Value storage system. Discussion on where to store data for the Final Project. Introduction to Virtualization.

7Apr

A student presentation on Apache ZooKeeper. The connection between ZooKeeper and the problem of consensus in a distributed system. Idea suggestions on Final Project.

2Apr

A student presentation on the Tachyon File System. Then an introduction to Message queuing service, which leads to looking at asynchronous use of shared memory for the Producer-consumer problem. The programming difficulty, with concurrency, is how to avoid losing data or damaging a queue when multiple writers compete to make updates to the shared queue. There are low-level tricks like locks, semaphores, even test-and-set operations. More relevant to Cloud Computing is the high-level view: the queue's history is a sequence of state transitions, from one queue state to the next. Processes working on the queue can use a distributed agreement algorithm so that all competing processes agree on the next state of the queue (which simplifies how the queue is updated). At the end of the lecture, the problem of Consensus (computer science) was presented to the class.

31Mar

Dropboxes for Final Project. Brief introduction to Message Passing Interface and Linda (coordination language).

26Mar

More discussion of Final Project. Algorithms in the PRAM architecture, the Bulk synchronous parallel model and asynchronous shared memory.

24Mar

Homework Four progress? Some slides from this presentation. Initial description of the Final Project, based on Public Datasets. How addition of n numbers can be done in O(log n) time on a Parallel random-access machine.

12Mar

One more talk by student, mention of Homework Four and Public Datasets, quiz on Spark.

10Mar

Two talks by students on Apache Mesos, see Presentations schedule.

5Mar

Two talks by students on YARN (part of Apache Hadoop), see Presentations schedule. Homework due today.

3Mar

Continuing Spark, with in-class demonstrations.

26Feb

Introducing Spark (there will be at least one homework using Spark).

24Feb

Discussion of Presentations. Some wrap-up on Homework One, plus discussion of Homework Three (in progress), and how Secondary Sort can be helpful.

19Feb

Overview of the Google File System, racks, servers, master, chunkservers, metadata, and writing chunks through replicas. A video showed how Hadoop has adapted most of these ideas to the Hadoop File System. At end of lecture, brief discussion of Homework Three.

17Feb

Show running a Java MapReduce job on Amazon. See Launching Hadoop on Amazon for a written example, but we can try other things in class. The lecture also covered ssh and scp commands, demonstrating remote login and secure copy to and from the master machine of an Amazon EMR cluster. Another topic, to be continued on Thursday, is the Google File System.

12Feb

Some discussion of Parts 4-6 of Homework One -- what will be the due dates? Then, Higher-Level Software for Hadoop: Hive, Pig, Cascading and more. Demonstrations of AWS, see Elastic MapReduce using Python and MRJob, but the lecture will also look at how to launch EC2 instances, how to launch EMR, how to connect and interact with an instance using SSH, and possibly how to run Java JAR in Hadoop on EMR.

10Feb

How to critically read technical papers: Advice on Reading and Presenting which is useful for Homework Two (CS grad students only). Lecture topic is to present the original MapReduce paper. Time permitting, the lecture may start coverage of Amazon's Web Services.

5Feb

No Class -- use this time for Homework One.

3Feb

No Class -- use this time for Homework One.

29Jan

Another detailed look at wordcount, this time in Java, based on wordcountjava.zip. See JavaWordCount.pdf for a listing. More discussion of Homework One.

27Jan

Cover, in detail, the wordcount example using MRJob -- see the page Python MapReduce for how this can be installed. The example followed in the lecture is in topwords.zip. See PythonWordCount.pdf for a listing. The two programs wordcount.py and freqwords.py have no comments explaining how things work: that's so you can document them during class, taking notes and asking questions. Some of the Python syntax is more advanced than beginners to the language will have seen. It's important to understand these programs well, since the first homework builds on this knowledge. Discussion of Homework One.

22Jan

Versions 1 and 2 in MapReduce page, plus a brief introduction to MRJob.

20Jan

Introduction to MapReduce, Version 0 in MapReduce page.

Lectures (last edited 2015-08-10 21:34:36 by Ted Herman)