Homework 4

From 118wiki

Jump to: navigation, search

Contents

Homework 4

Homework 4 is a programming task to illustrate concurrent (parallel) programming using threads and Linda. The homework has two parts (there are two programs to write).

Instructions

This homework is due by Midnight (11:59pm) on Thursday 2 November, 2006.

Submit your solution using the submit command on one of the department's Linux machines to the course directory. (Note: if you don't already have an account on the departmental machines, I'll grant you an account for this course; please contact me to get your account.) Here are instructions on how to submit:

  1. In your directory, rename, move, or delete any directory named homework4
  2. Download media:Homework4.zip, and then execute "unzip Homework4.zip". This should create a directory called Homework4. You will modify files in that directory to complete your homework.
  3. When your work is ready, run the submit script from your home directory by typing submit at the command prompt and hitting return/enter.
  4. When prompted for the File/directory name, enter homework4
  5. When prompted for the course put c118
  6. Next you will see a list of possible submission directories. For this assignment, reply homework4
  7. Done! For further help, please review the submit proceedure provided by CSG: submit.html
  8. Note: case does matter.
  9. Note: submitting the same homework assignment multiple times will overwrite the previously submitted homework with the same name and submit location.

Threaded Linda Programs

The assignment asks you to solve two problems using Linda as the communication method between threads. Essentially, the assignment simulates network hosts in that each thread represents a network host. Correct programs for the assignment will have these characteristics:

  • The program will work with any reasonable number of threads. That means you can test with 4 threads, but then change the number of threads to some larger number (like 11 threads) and it will also work, without having to make any changes in the rest of the logic of your code.
  • Threads do not perform redundant work; so if thread 5 contributes some calculation to the end result, another thread does not also do the same contribution.
  • Threads only communicate with each other using the tuplespace. Thus, no thread gets any special "initialization" data when it is created, except perhaps a handle to the tuplespace. Further, no thread uses Python's global statement, except perhaps to get access to the tuplespace. No sockets or other ways of communication are permitted: threads only communicate using Linda operations.
  • Every thread participates in the computation. It is "cheating" to let one thread do the entire program and define all other threads to do nothing. You can, however, designate that some threads do a different kind of work than other threads (this is implicit in specialist approaches to parallelism, or even the master-worker pattern). You can specify what kind of work a thread should do when your code creates that thread (for instance, as a parameter to "__init__").

The Homework4 directory that you download contains Linda.py and testLinda.py.

Advice on Learning/Debugging

Try the testLinda.py program, and try some simple modifications of this program, such as reducing the amount of numbers from 1,000 to 20. The "T.Dump()" method is useful for debugging if the tuplespace only contains a few tuples. You can also customize the Dump method, or create a "Dump2" method that returns selected information useful to your own debugging.

Study testLinda.py to learn how the Out, In, and Rd operations work. Copy parts of this to build upon for your own solutions.

Parallel Pattern Matching

The motivation for this problem is finding all the places that a given string (representing a string of proteins) matches a longer string (representing a DNA sequence). The original input to the problem therefore consists of two strings, for example:

ttaaggaccccatgccctcgaataggcttgagcttgccaattaacgcgcacgggct
cggat

The question is how many times the string "cggat" occurs in the longer, first input string. Probably this is a simple task in Python using the regular expression module, but remember we are trying to simulate parallel processing where the length of the input string could be millions of characters.

Included in Homework4 is a program "inputDNA.py" that represents two input strings as tuples, where each tuple has just one character. You can use inputDNA.py to create sample inputs for your program (here's a demo that it works):

 >python
 Python 2.4.1 (#1, Apr 10 2005, 22:30:36) 
 [GCC 3.3.5] on linux2
 Type "help", "copyright", "credits" or "license" for more information.
 >>> from inputDNA import *
 >>> print TuplesB
 [('mat', 0, 'g'), ('mat', 1, 'g'), ('mat', 2, 'a')]

The program inputDNA.py creates TuplesA and TuplesB, which you can combine as part of the initial contents of the tuplespace for the assignment.

Cluster Analysis

The second problem is to take some "relationship" data, which is a list of persons and how they are related (by kinship, by marriage) and calculate the number of independent clans in the input. To simplify the input, each item of relationship data is a pair: the tuple ("related","bob","joe") indicates that "bob" is related to "joe". Two individuals are in the same clan if they are (transitively) related.

The file inputClans.py creates a list of relationships you can use for input:

 >python
 >>> from inputClans import *
 >>> print RTuples
 [('related', 'AEthelwulf', 'Athelstan'), ('related', 'AEthelwulf',
 'Osburga'), ('related', 'Athelstan', 'Osburga'), ('related', 'AElflaeda',
 'Edwin'), ('related', 'EdwardElder', 'Edwin'), ... ]
Personal tools