A significant part of this course is reading papers and presenting them (either in class, or written notes). By presenting, it is typically done by making some slides or prepared material for the blackboard (or whiteboard) that shows the main concepts, results, research questions, and relation of the paper to the field. Before presenting a paper, one should read the paper carefully. See the Advice page for suggestions on how to read a technical paper.
Presenter and Scribe
Each paper will be read carefully by two students, the presenter and the scribe. The presenter will take twenty to thirty minutes covering the content of the paper. The scribe will write a report of the content. Initially the present and scribe work independently, though the scribe may revise the report after the presentation as a result of asking questions and refining understanding. On the assigned day of the presentation (before class), the scribe will send the instructor a first draft of the report; also, the presenter should send slides or notes to the instructor before class (by at least one day) to get feedback and suggestions for improvement, so that the presentation goes well.
Other graduate students attending the presentations should also take a look at the paper, so they are in a good position to ask questions. However, the audience won't be as fully conversant with the results as the presenter and scribe. Selected students (who have no other tasks for two weeks until scribe or presenter) will be asked, by email, to write questions about the paper: see the Advice page for notes on an "ideal paper" for hints about what would be reasonable questions of a critical reader.
Student PIN Numbers
So that student identities are anonymous on this course website, PIN numbers will be assigned (00-99) to students in class. These PIN numbers appear below in the schedule.
The Papers and Dates
Here is a schedule of papers and dates for presentation.
- 3 Apr
Professor Presents, Student 19 Scribe (questions by Students 11) NoSQL including Redis/Memcached. Paper: Can the elephants handle the NoSQL onslaught? and Cache craftiness for fast multicore key-value storage
Professor's Comment: Watch this nice presentation for some other insights in the SQL vs NoSQL debate.
- 3 Apr
Student 4 Presents, Student 3 Scribe (questions by Students 13, 14) NoSQL databases, including MongoDB (with evidence of controversial views pro and con). Paper: Performance Evaluation of a MongoDB and Hadoop Platform for Scientiﬁc Data Analysis
Professor's Comment: To further understand the pluses and minuses of MongoDB, you can search for MongoDB Gotchas, MongoDB Haters, MongoDB Tuning, SQL vs MongoDB, MongoDB Complexity, and so on. There is an interesting six-part blog sequence about consistency here.
- 8 Apr
Student 6 Presents, Student 5 Scribe (questions by Students 15, 16) The stack of Twitter, which uses (or used to use) Gizzard (Scala framework) among many other components (like Redis, Memchached, etc). A few slides from this presentation might be good to show. A paper to present is WTF: the who to follow service at Twitter.
Professor's Comment: There doesn't seem to be an academic paper about the Twitter stack, which is evolving rapidly. This presentation on the Real Time Delivery Architecture is informative, if outdated by now. The paper presented in class shows the tension between doing a good job of recommendation based on thorough graph analysis (approximation of personal pagerank) and the need for new users to get quick recommendations.
- 8 Apr
Student 8 Presents, Student 7 Scribe (questions by Students 17, 18) The stack of Pinterest, which may use ELB as one part of the stack. A recent paper exploring "elastic" properties in systems (especially paging subsystems) is Towards Elastic Operating Systems.
Professor's Comment: This paper owes much to one of its references ( the bibliography) which observes that it can be faster to get memory pages of another machine, across a local network, than to get the same amount of memory from local disk. That observation is potentially a game-changer for distributed systems, inviting us to rethink how to best use the combined memory of a cluster of machines. This paper advocates modifying virtual memory, fetching remote pages across a network, and using process migration for elasticity.
- 10 Apr
Student 10 Presents, Student 9 Scribe (questions by Students 19, 01) Queue services are used in many big data systems: Amazon SQS, RabbitMQ, Kafka, Beanstalkd, HornetQ, three Apache projects (ActiveMQ, Qpid, Apollo) are examples of queue services. Look at AMQP, and survey a bit of RabbitMQ to explain what these services provide. A paper to present would be Performance evaluation of RESTful web services and AMQP protocol
Professor's Comment: Queue services are particularly important in systems with high velocity (typically streams of incoming data that cannot be slowed down). Social networks are large and generate lots of data, but generally the velocity can be controlled by apps and user interfaces; not so with sensors, real-time traffic monitors and domains like High-frequency trading. Not so much interesting research has been published on queue services, though there are many implementations.
- 10 Apr
Student 12 Presents, Student 11 Scribe (questions by Students 02, 03) Tumblr's stack includes software to "accelerate" web server performance. Although caches like Memcached or in-memory databases such as Redis can greatly improve server performance, another aspect is load balancing. An influential example geared to web server speedup is HAProxy for load balancing (also see Haproxy) A paper like Toward a Cloud-Ready Dynamic Load Balancer Based on the Apache Web Server would be relevant.
Professor's Comment: Though the idea of load balancing is intuitive, there are interesting surprises. One way to think about load balancing that adjusts dynamically (including features for elasticity, which may grow and shrink resource deployment automatically) is the viewpoint of Feedback control. Load balancing that samples and adjusts (feedback) could be susceptible to oscillations, overreactions, and instability. It is hard to optimally balance load if information about the current load is stale or if future load is highly dynamic (unpredictable). Also, delays in changing load make feedback control tricky to achieve.
- 15 Apr
Student 14 Presents, Student 13 Scribe (questions by Students 04, 05) Storm is an Apache project (see Storm (event processor)) originating from Backtype/Twitter, then made available to others (Groupon, Alibaba are two of many users). Storm addresses the high velocity aspect of big data, where streams of information need to be filtered and queried for events of interest. A paper to read is Adaptive Online Scheduling in Storm, which proposes improvements in scheduling for Storm.
- 15 Apr
Student 16 Presents, Student 15 Scribe (questions by Students 06, 07, 12) Apache Spark is an analytics subsystem sitting on HDFS (like Hadoop). Spark can be compared to Hadoop, but claims to be much faster for most work. Spark Streaming is an add-on, addressing high velocity data; Spark also has support for Hive's query language (native Hive would run Hadoop jobs). Groups at Berkeley have many papers about Spark (see AmpLab list). A paper to present could be GraphX: A Resilient Distributed Graph System on Spark.
- 17 Apr
Student 18 Presents, Student 17 Scribe (questions by Students 08, 09, 10) Virtual machines are crucial technology for cloud computing, with Amazon EC2 being one familiar example. However, virtual machines can have more overhead than running applications on a native operating system. An alternative approach is Operating system level virtualization usually called containers for applications. One popular example of the concept is LXC (Linux containers). Implementing containers at scale motivates other software, like Docker (software), which extends Linux Containers. For some fluffy press about Docker, see this article; for a paper to read, consider A Performance Comparison of Container-based Virtualization Systems for MapReduce Clusters which compares different container implementations with respect to MapReduce.
- 22 Apr
Professor presents Apache S4. The paper is S4: Distributed Stream Computing Platform, which motivates looking into stream system performance. Student 2 Presents (questions by Students 13, 14, 15, 16) Earlier presentations (Blazes, Storm) have emphasized the high-velocity issue of big data. The paper here is Naiad: A Timely Dataflow System. For further background you can look at this "Scribe Report" and view this talk.
- 20 Feb
Student 01 Presents, Student 02 Scribe The Google File System (Ghemawat et al, 2003) Scribe summary: scribe-gfs.pdf
Professor's Comment: The design of GFS has several motivations: first, Google search needs to have a data repository for web crawlers that collect page data and URL data, and existing databases aren't sufficient for the job; second, Google discovered that it is less expensive to use low-cost disks and many machines rather than using high-cost machines and traditional reliable storage based on RAID and so forth; third, the design of GFS takes advantage of large memory sizes (it is possible to have machines with 48GB of memory and more) so that tracking all the information of a file system can effectively all be in memory, and this enables a master machine to respond very fast to clients. This third reason influences many other things, including the weakened semantics of file operations, the way that storage is allocated in 64KB chunks, and more.
- 20 Feb
Student 03 Presents, Student 04 Scribe: Dynamo: amazon's highly available key-value storage (Vogels et al, 2006) Scribe summary: scribe-dynamo.pdf
Professor's Comment: The paper has a clever, somewhat algorithmic idea on how a key-value store with high availability (target of 99.9% of operations to have a 300ms reponse time) can be built using principles from Peer-to-peer networks. Keys are hashed to locations in a virtual ring (though the presentation is somewhat confusing on all the details). A quorum-based design lets each application using Dynamo create an instance (ring) according to design parameters (R,W,N) which control some tradeoffs influencing latency of operations. An interesting point is that consistency is not guaranteed by the system, and applications may have to reconcile inconsistent versions of currently-modified objects.
- 25 Feb
Student 05 Presents, Student 06 Scribe (Questions by Students 17, 18). Bigtable: A Distributed Storage System for Structured Data (Chang et al, 2006) Scribe summary: scribe-bigtable.pdf
Professor's Comment: Bigtable addresses the reality of application needs on system performance. For many applications querying and updating large data, abstractions of mapping (key-value indexing and more general ways of finding data) fall into fewer patterns than one might initially imagine. Therefore, it can make sense to specialize the way data is stored and replicated so that the system performs better on the common patterns. Bigtable organizes data into groups of rows (tablets) and families of columns, making these the units upon which servers are based. Once this organization of data was decided, issues of fault tolerance, locality, load balancing, scaling and consistency have to be addressed. The paper sketches how the issues were dealt with in Bigtable. The most important lesson learned (or so it is claimed) is the value of simple designs. One wonders, however, how simple Bigtable is in an absolute measure of simplicity. Also see Apache Accumulo.
- 25 Feb
Student 07 Presents, Student 08 Scribe (Questions by Student 19). Spanner: Google’s Globally-Distributed Database (Corbett et al, 2012) Scribe summary: scribe-spanner.pdf
Professor's Comment: Spanner does make Bigtable seem almost simple. In this paper we learn that GFS was replaced by Colossus (unpublished), that Bigtable only guarantees eventual consistency (if we consider all of Google's datacenters as one universal data store), but that Spanner is the first planetary scale, multi-version, synchronously replicated database. One surprise is that, at this immense scale, the choice of model was more a relational SQL-like treatment of data than what other similar enterprises chose with structured data models. This paper echoes many classical themes of transactional concurrency control, using Paxos, transaction managers, two-phase commit, and multiversion objects based on timestamps. A novel timestamp abstraction is TrueTime (TT) which consists of a (start,end) interval, including clock uncertainty based on low-level clock reads/synchronizations (GPS, atomic clocks, and timely resynchronization operations). The TT abstraction can be simpler to use than heavier causal abstractions like vector clocks. The paper has many several nuggets of rationale for choices in the design (how to colocate directories that would be accessed together, how to derive the timing of commit wait).
- 27 Feb
Student 09 Presents, Student 10 Scribe (Questions by Students 01, 02). Blazes: Coordination Analysis for Distributed Programs (Alvaro et al, 2013) Scribe summary: scribe-blazes.pdf
Professor's Comment: Whereas Dynamo optimizes data storage for writes (and sacrifices some consistency), and TAO (below) optimizes data storage for reads (availability is more important than consistency for Facebook), this paper goes in other directions. First, it is not about data storage so much as the velocity aspect of big data (streams of oncoming data). Second, because of replication and imperfect streams (out of order messages, duplication of messages), replicated views can be inconsistent (not exactly database inconsistency). Third, the inconsistencies can be dealt with by standard mechanisms (consensus or atomic broadcast), but these are expensive. The point of the paper is that if application semantics are well understood, then a compiler-like tool can automatically add the minimal synchronization needed to ensure consistency, thus lowering latency over always using consensus. So, just as Amazon and Facebook "understood" the customer needs, that understanding can be automated using semantic information (e.g. annotations or specialized languages) to understand the client on a case-by-case basis. It's unclear how practical such an approach can be, but the idea is interesting.
- 27 Feb
Student 11 Presents, Student 12 Scribe (Questions by Students 03, 04).TAO: Facebook’s Distributed Data Store for the Social Graph (Bronson et al, 2013) Scribe summary: scribe-tao.pdf
Professor's Comment: Facebook's TAO teaches us that it is possible to build large scale infrastructure from non-scalable tools (MySQL and PHP) by the combination of caching, replication and failover, and divide-and-conquer for storing petabytes of social graph. This is possible by making the choice, in several points of the design, to emphasize read availability over consistency. The social graph is sharded into many databases, each replicated but with a single master at any time. A two-tier cache system (of immense scale) brings a copy of the data that clients need close to them, with some attention to load-balancing.
- 4 Mar
Student 13 Presents, Student 14 Scribe (Questions by Students 05, 06). Cassandra: a decentralized structured storage system (Lakshman and Malik, 2010, only six pages): also read about Apache Cassandra to prepare the presentation/report, documentation at Apache Scribe summary: scribe-cassandra.pdf
Professor's Comment: Apache Cassandra is a data store engineered to support high write throughput and efficient reading, at the expense of full-blown relational database features (like join operations, and less than full ACID guarantees). Originally developed at Facebook (though one designer had been part of Dynamo), it offers users numerous parameters to control performance and resources, like how to arrange data, replication, level of consistency. The schema for data is quite similar to Bigtable's, with some extension for recursive column families (in fact, columns data is used to index documents in one application). To optimize robust updating, Cassandra uses a dedicated disk for the commit log. Further, Cassandra morphs all writes into batched sequential writes. The Dynamo-style ring-structure is used to replicate, and reads are normally directed to the closest replica (though, at higher consistency needs, some quorum reading is possible). Cassandra is especially notable for wide adoption, and there have been a few recent Cassandra summits (industry meetings) with many applications.
- 4 Mar
Student 15 Presents, Student 16 Scribe (Questions by Students 07, 08). Sailfish: a framework for large scale data processing (Rao et al, 2012) Scribe summary: scribe-sailfish.pdf
Professor's Comment: Sailfish is just one of many developments to improve upon the the state of the art for MapReduce computing. The approach of the paper looks beyond the shortcomings of older Hadoop and HDFS systems, asking the question, what are the fundamental bottlenecks? Among the many factors, network latency/bandwidth and disk seeks are identified as points for attention. To that end, the paper advocates a new way to store intermediate data (output by mappers, headed to reducers). Their contribution is a new file structure for intermediate data, the I-File, which tries to limit disk seeks when writing/merging mapper output. The I-File also indexes, which has the benefit of enabling dynamic estimation of key ranges so as to balance load among reducers and even to automatically tune the number of reducers for better performance. There's an interesting tradeoff in the results, mapping can actually take longer, while the total overall time can be shorter. The results vary depending on the type of job, and Sailfish is not always a clear win.
- 6 Mar
Student 17 Presents, Student 18 Scribe (Questions by Students 09, 10). Apache Hadoop YARN: yet another resource negotiator (Vavilapilli et al, 2013) Scribe summary: scribe-yarn.pdf
Professor's Comment: YARN is Apache's newest generation of fault-tolerant job scheduling, replacing the older mechanism Hadoop used to manage MapReduce tasks. YARN's ambition goes beyond MapReduce, supporting many parallel computing frameworks. In some ways, YARN's goals are like traditional multiprocessor scheduling of shared resources, however the hosted applications are presumably isolated from each other. The YARN paper is quite well written for how the requirements/goals are established. The three layers of abstraction (Resource Manager, Application Master, Node Manager) might have been more clearly separated in the narrative, to see better what are consequences of multi-tenancy, failures, and cross-layer interactions. An interesting technique appears to be using periodic heartbeats between components to reliably share state (and status of requests, responses, resources). This seems quite sensible, since heartbeats let a system crudely emulate shared memory. The most recent version of Hadoop ships with YARN as the high-level control mechanism.
- 6 Mar
Student 19 Presents, Student 01 Scribe (Questions by Students 11, 12). Large graph processing in the cloud (Chen et al, 2010); this is a short paper, so also look at some Large Graph Software that is available or the Pregel paper. Scribe summary: scribe-largegraph.pdf
Professor's Comment: For many applications, MapReduce is a reasonable paradigm. However, for (distributed and parallel) graph algorithms, especially for sparse graphs, MapReduce tends to consume too much bandwidth and incur extra latency by shuffling (key,value) tuples around excessively. Papers like this "Surfer" paper and Pregel offer new paradigms closer to the design of many graph algorithms. There are operations to propagate information locally (where local is with respect to the graph), operations to aggregate information from other vertices, and even more advanced operations like condensation and graph mutation. These can result in considerable speedup over what MapReduce would do. However, the gains are highly problem-dependent and instance dependent; moreover, best results may depend on good Graph partition, which is not so easy.