4 May
Continued slides from student projects. Reminder of final report due. What about the AWS clusters?
2 May

Slides from student projects. A very brief mention of Google's Spanner database.

27 Apr

Discussion: project status, presentations next week. The lecture topic is orchestration of containers, mainly looking at Kubernetes.

25 Apr

A look at Orchestration (computing). Related to orchestration is the scripted maintenance and management of configurations, as seen in packages like Chef (software), Puppet (software), and Ansible (software). A very brief view of Facebook's configuration management. The last part of class is for a Quiz.

20 Apr
Solution to puzzle. A brief look at recent research in Cloud Computing. Announcements: upcoming Assignment and Quiz, description of Final Report on Project.
18 Apr

A puzzle on how to parallelize using Spark. Also, a first look at AWS Lambda.

13 Apr

The Virtual World of Clouds

11 Apr

Some announcements about clusters and projects. Then a look at Management of Parallelism.

6 Apr

More about data in the cloud, introducing the concepts of Shard (database architecture), Wide column store, NoSQL, CAP theorem, and some of the software systems that use these ideas in practice.

4 Apr

Project status (some students have not progressed beyond the Extract, transform, load phase). Heading toward some typical cloud data software concepts, the lecture reviews Head-of-line blocking, Lock (computer science), Database transaction, ACID, and the Two-phase commit protocol. These topics build up to the infrastructure of parallel and distributed databases typically thought of as SQL databases. Performance issues (how scalable are these techniques) are unsolved, so some people advocate a "No SQL" way to organize data. This alternative is the NoSQL kind of data repository.

30 Mar
Brief discussion of Quiz. Performance studies (RDD vs Dataframe). Upcoming next assignment. Expanded office hours. Bucket permissions. Launching cluster for the class. Some traditional cloud topics at a glance.
28 Mar
Performance studies. The importance of structure in data (for efficiency). Quiz.
23 Mar

Continue with aspects of the new assignment, plus cover some missed topics on the Spark page. Also, some ideas about direction of the project after getting familiar with the data.

21 Mar

New assignment: each student starts project this week by collecting and understanding the data. Some Public Datasets are in snapshots, and may be in AWS regions not the same as an EMR cluster, so creating an EC2 instance and mounting the snapshot as a volume is necessary. Some ideas are on the Spark Setup page. Instead of using AWS, some students will be using Google Cloud resources. Here's a page Google Cloud about using gcloud.

9 Mar

A quick look at Data Analytics which relates to the project assignment (now posted on ICON, each student chooses some "large" dataset, possibly from Public Datasets). Then a brief tour of Spark's ML package, added to the Spark page. News Google Compute now has a free tier, so AWS isn't the only place you can go for practice in the cloud.

7 Mar

Initial description of Spark dataframes and more about Public Datasets. Another example was added to SparkExamples for dataframes.

2 Mar

Continue with Spark, and with SparkExamples. Upcoming assignment will involve Public Datasets.

28 Feb

Continue with Spark, illustrated through some in-class exercises. A new assignment using Spark on the practice machine is announced on ICON, due 8th March.

23 Feb

Starting on Spark, exploring the Spark API, a few examples, and interactive demonstration.

21 Feb

Finish up topic of Google's original MapReduce. Then, in preparation for the next topic (Spark), a digression: the influence of functional programming on cloud computing and parallel computing styles. The lecture will sample a few historical topics and programming languages, which may have influenced Spark's design. Has everyone finished the current homework?

16 Feb

The architecture of the original MapReduce built upon Google File System; how MapReduce can do sorting, how it is fault-tolerant and can be used in multistep jobs.

14 Feb

Announcements (engineering career fair today, upcoming UICC'17, next EMR cluster). Explaining the Hadoop File System (HDFS) by looking at the Google File System; some background motivation is machine and datacenter architecture, including "numbers every programmer should know". New assignment for today on ICON, submit notes on this lecture (see 14FebNotes assignment on ICON). Notes by student: 14Feb.pdf.

9 Feb

New homework due 21 February, see announcment on ICON. The lecture demonstrates use of Amazon AWS, the dashboard, Amazon S3 Buckets, launching and running an EMR cluster to do Hadoop jobs (with luck; it will depend on network connectivity, Amazon AWS, etc).

7 Feb

Continuing with asynchronous parallel computing (e.g. Barrier (computer science), multicast, Message Passing Interface). Looking at the low-level nature of how MPI works. Some examples of MPI programs. Even some ideas from Systolic array influence the structure of MPI programs, eg. algorithms in these notes. Whereas MPI is "lower level" (programmer has to specify which messages are sent, received, and do process control), some "higher level" libraries focus on data patterns and leave process control to the system (for instance, the Linda (coordination language)).

2 Feb

Answering questions on the current homework. Then, the main theme of the lecture is going from synchronous PRAM models to network models and asynchronous computation. 2Feb.pdf

31 Jan

No Class

26 Jan

Continuing remote login notes (see 24Jan.pdf it has been revised) and running Hadoop. Some explanation of MapReduce concepts. First MapReduce homework assignment, also a reading assignment. Notes 26Jan.pdf

24 Jan

Some very practical information about commands, remote login (24Jan.pdf), preparing for starting coverage of MapReduce

19 Jan

PRAM details (CRCW vs CREW), some example algorithms. 19Jan.pdf

17 Jan

Course on ICON and Syllabus here. First assignment on ICON, due 18 January. Lecture topic will be introduction to models of parallel computing. One student's notes: 17Jan.pdf

Lectures (last edited 2017-05-04 15:31:04 by Ted Herman)