- 4 May
- Continued slides from student projects. Reminder of final report due. What about the AWS clusters?
- 2 May
Slides from student projects. A very brief mention of Google's Spanner database.
- 27 Apr
Discussion: project status, presentations next week. The lecture topic is orchestration of containers, mainly looking at Kubernetes.
- 25 Apr
A look at Orchestration (computing). Related to orchestration is the scripted maintenance and management of configurations, as seen in packages like Chef (software), Puppet (software), and Ansible (software). A very brief view of Facebook's configuration management. The last part of class is for a Quiz.
- 20 Apr
- Solution to puzzle. A brief look at recent research in Cloud Computing. Announcements: upcoming Assignment and Quiz, description of Final Report on Project.
- 18 Apr
A puzzle on how to parallelize using Spark. Also, a first look at AWS Lambda.
- 13 Apr
- 11 Apr
Some announcements about clusters and projects. Then a look at Management of Parallelism.
- 6 Apr
- 4 Apr
Project status (some students have not progressed beyond the Extract, transform, load phase). Heading toward some typical cloud data software concepts, the lecture reviews Head-of-line blocking, Lock (computer science), Database transaction, ACID, and the Two-phase commit protocol. These topics build up to the infrastructure of parallel and distributed databases typically thought of as SQL databases. Performance issues (how scalable are these techniques) are unsolved, so some people advocate a "No SQL" way to organize data. This alternative is the NoSQL kind of data repository.
- 30 Mar
- Brief discussion of Quiz. Performance studies (RDD vs Dataframe). Upcoming next assignment. Expanded office hours. Bucket permissions. Launching cluster for the class. Some traditional cloud topics at a glance.
- 28 Mar
- Performance studies. The importance of structure in data (for efficiency). Quiz.
- 23 Mar
Continue with aspects of the new assignment, plus cover some missed topics on the Spark page. Also, some ideas about direction of the project after getting familiar with the data.
- 21 Mar
New assignment: each student starts project this week by collecting and understanding the data. Some Public Datasets are in snapshots, and may be in AWS regions not the same as an EMR cluster, so creating an EC2 instance and mounting the snapshot as a volume is necessary. Some ideas are on the Spark Setup page. Instead of using AWS, some students will be using Google Cloud resources. Here's a page Google Cloud about using gcloud.
- 9 Mar
A quick look at Data Analytics which relates to the project assignment (now posted on ICON, each student chooses some "large" dataset, possibly from Public Datasets). Then a brief tour of Spark's ML package, added to the Spark page. News Google Compute now has a free tier, so AWS isn't the only place you can go for practice in the cloud.
- 7 Mar
- 2 Mar
- 28 Feb
Continue with Spark, illustrated through some in-class exercises. A new assignment using Spark on the practice machine is announced on ICON, due 8th March.
- 23 Feb
Starting on Spark, exploring the Spark API, a few examples, and interactive demonstration.
- 21 Feb
Finish up topic of Google's original MapReduce. Then, in preparation for the next topic (Spark), a digression: the influence of functional programming on cloud computing and parallel computing styles. The lecture will sample a few historical topics and programming languages, which may have influenced Spark's design. Has everyone finished the current homework?
- 16 Feb
- 14 Feb
Announcements (engineering career fair today, upcoming UICC'17, next EMR cluster). Explaining the Hadoop File System (HDFS) by looking at the Google File System; some background motivation is machine and datacenter architecture, including "numbers every programmer should know". New assignment for today on ICON, submit notes on this lecture (see 14FebNotes assignment on ICON). Notes by student: 14Feb.pdf.
- 9 Feb
New homework due 21 February, see announcment on ICON. The lecture demonstrates use of Amazon AWS, the dashboard, Amazon S3 Buckets, launching and running an EMR cluster to do Hadoop jobs (with luck; it will depend on network connectivity, Amazon AWS, etc).
- 7 Feb
Continuing with asynchronous parallel computing (e.g. Barrier (computer science), multicast, Message Passing Interface). Looking at the low-level nature of how MPI works. Some examples of MPI programs. Even some ideas from Systolic array influence the structure of MPI programs, eg. algorithms in these notes. Whereas MPI is "lower level" (programmer has to specify which messages are sent, received, and do process control), some "higher level" libraries focus on data patterns and leave process control to the system (for instance, the Linda (coordination language)).
- 2 Feb
Answering questions on the current homework. Then, the main theme of the lecture is going from synchronous PRAM models to network models and asynchronous computation. 2Feb.pdf
- 31 Jan
- 26 Jan
Continuing remote login notes (see 24Jan.pdf it has been revised) and running Hadoop. Some explanation of MapReduce concepts. First MapReduce homework assignment, also a reading assignment. Notes 26Jan.pdf
- 24 Jan
- 19 Jan
PRAM details (CRCW vs CREW), some example algorithms. 19Jan.pdf
- 17 Jan