Not really meant to be notes on a lecture, this list just gives a rough idea of what went on for a given day.
- 03 May
Moving on to BigTable, !HBase, !NoSQL, and future of Hadoop and Big Data.
- 01 May
Responding to a class request, this lecture looks at the Hadoop File System, the Google File System, and even the file system of the Disco Project. These are interesting not only because they support things like MapReduce, but because they show how to engineer distributed file systems for large data that doesn't follow traditional database models.
- 26 Apr
Upon class request, most of the lecture was about Erlang (programming language) and why that is generating excitement for highly concurrent computing. I summarized, from memory, some of the features/lessons from this presentation. Along the way, happened to briefly glance at Disco Project, which is built on Erlang, plus an alternative to the Hadoop File System called the Disco File System, and a Python interface (could be an interesting alternative to Hadoop). Practically-oriented students might also want to check out the tools that instagram uses.
- 24 Apr
About Cascading. A comparison mzG4k
- 19 Apr
- 17 Apr
- Covering impossibility of 1-fault resiliency in tasks other than consensus.
- 12 Apr
- Showed rough equivalence between consensus and atomic broadcast, which provides another way to understand limitations of scale and fault tolerance for consensus problems.
- 5 Apr
- 3 Apr
- 29 Mar
Shared Memory topics, such as consistency of read/write, and atomicity.
- 27 Mar
- 22 Mar
- 20 Mar
Shortest Path in MapReduce was the main topic
- 8 Mar
How to make parallel a shortest path computation in graphs (networks) - PRAM and MapReduce
- 6 Mar
- 1 Mar
- Going over solution to homework; second Quiz.
- 28 Feb
- More on maximum in O(1) time Also, start of 3-coloring algorithm.
- 23 Feb
- Applying roots of a forest algorithm to parallel prefix. Also, an O(1) time parallel algorithm for maximum of an array, but only if the model permits concurrent write of the same value to a given location -- otherwise it seems we can only get O(lg n) time algorithm.
- 21 Feb
- In more depth, the current homework (details on the framework); then look at PRAM algorithm for finding roots of a forest.
- 16 Feb
- 14 Feb
- 9 Feb
- 7 Feb
- 2 Feb
- Pipelining of addition - computing a series of summations: input is a sequence of vectors, output is a sequence of their sums; a new vector enters the pipeline at each consecutive time unit, and after log(n) delay, the corresponding sum emerges from the pipeline.
- 2 Feb
- Quiz over Linda.
- 31 Jan
- No Class.
- 26 Jan
Going over notes on Linda Trapezoid, then a look at systolic array computation of a matrix product C=AB; also a very brief look at Convolution on a systolic array, like used in engineering a filter for Finite Impulse Response.
- 25 Jan
Not a lecture summary, but a useful example: Linda Trapezoid.
- 24 Jan
- Continue with MPI. Starting to look at dataflow and systolic array models of processing.
- 19 Jan
Aside: some background for MPI
- Proesses in Unix/Linux/Mac
- Creating new processes: fork() call
- Each process has its own "address space"
- Processes can be monitored (each has an id)
- Processes communicate with each other ... how? NO RECOMMENDED WAY TO DO THIS! (Can use files, "shared memory segments", pipes, networking)
- Processes are "heavyweight" (consume a lot of resources)
- Maximum practical number of processes on one computer is therefore small.
- Process scheduling is part of OS kernel, which can take advantage of multicore for parallel running of processes
- POSIX Threads are a lighter-weight idea than processes
- supported by "pthreads" library (not system kernel)
- can still take advantage of multi-processor/multi-core hardware
- multiple threads can share same address space
- threads library has methods for synchronization (locks, queues, etc)
- built-in to Java, library in Python, etc
- note: thread idea doesn't extend to MULTIPLE computers (say, in a LAN) -- only per machine
- PVM and MPI are older libraries for parallel processing (but still being used!)
- initially for processes (but later, can do more than that)
processes communicate by messages (sent & received by library calls)
- libraries exist for supercomputer hardware
- well-developed for numerical problems (Fortran, BLAS, etc)
- 17 Jan
Introduction of parallel computing as a topic, how it is motivated by current trends in graphics (CPUs), multicore architecture, Cloud Computing and Big Data. Motivations for Big Data include scientific sources (like astronomy, genomics, etc), Social Networks, search engines, business analytics, electronic commerce with data warehousing, military and surveillance applications. Many traditional forms of parallel computing are relevant and we will cover a few of those. For exercises, we propose to use Hadoop and MapReduce; later in the semester we can look at limits of distributed and parallel systems, for instance the !CAP theorem. Notes: 17jan.pdf