Parallel Programming

From 118wiki

Jump to: navigation, search

Contents

Parallel Program Design

Why Parallel?

  • programs that run faster
  • some problems are too large to fit in one computer, so the best way to solve is to use multiple computers
  • sometimes problem data are scattered over many hosts, so it is natural to put computation close to where the data are

How to Program for Parallel Computation?

  • could use a special-purpose language and compiler (not very common)
  • use ordinary languages plus a parallel programming API and library

Machinery for Parallel Computation

  • high-performance supercomputers (SIMD machines)
  • clusters of workstations on a LAN
  • some graphics cards GPUs (Graphics processing unit) have instructions for parallel operations

Taxonomy

  • synchronous parallel computation: all the participating computers run at identical speeds
  • asynchronous parallel computation: the participating computers can run at different speeds, even changing speeds during the computation

Paradigms for Parallel Programs

There are many patterns, or strategies in the design of parallel programs. Here, just three paradigms are mentioned.

Result-Oriented Design

The idea is to work backward from what is being computed. For instance, to compute

f(g(x),h(y,x)) where
  g(x) = r(x,m(x,log(x)))
  h(y,x) = a(e(y),t(x,y))

the program has a structure such as

               f
              / \
             /   \
            /     \
           /       \
          /         \
         g           h
        /            | 
       r             a 
      / \           / \
     /   \         /   \
    x     m       e     t
         / \      |    / \
        x   log   y   x   y
             |
             x

For parallel computation, the idea is to start at the leaves of this tree (different leaves can be processed in parallel) and work upward as the computation proceeds. In the example above, opportunities for parallelism are greater at the leaves than near the root.

Standard scientific and graphics operations (with vectors and matrices) can be solved in a similar way. When the problem is to calculate a batch of results (for example, multiply many matrices by the same vector), result-oriented parallel computation can sometimes be pipeline (software). (Note: the idea of pipelining also occurs in HTTP in the case where a client can send multiple requests to the server without waiting for responses between the requests.)

Specialist-Oriented Design

Think of a crew building a house. Some of the crew work on framing, some on plumbing, some on electric systems, and so on. These specialists work in parallel when the opportunity arises. Sometimes one specialist waits on another, sometimes a specialist creates work for another specialist.

A similar idea can work in a multithreaded or multihost parallel program. Different threads can have specialized functions in the design, and shared data structures and/or messages are used to communicate work needed, or results computed.

Agenda-Oriented Design

Think of the generic job of tailoring 500 suits for many customers. The steps in tailoring a suit include measuring a customer, cutting cloth, sewing, testing a mock-up on the customer, then possibly again cutting, sewing, testing, and a final assembly and test. Now imagine there is a basket with slips of paper, and on each slip of paper is written some task to be done. Initially, there is one slip of paper with the task "tailor 500 suits". A first step could be to remove the slip of paper and replace it with 500 slips of paper, each of the form "tailor suit i" (for i in range(500)). A subsequent step removes "tailor suit 37", for example, and replaces that with "measure customer for suit 37". This process continues; sometimes a slip removed from the basket does not result in a new slip being added; other times a slip removed results in two or more new slips being added.

Now think of how parallel threads could take the role of tailors in the example. The life of each thread is to grab any slip from the basket, do the work specified, and when finished, possibly put new slip(s) in the basket; then the thread looks for another slip.

Notice that the number of threads is not important; no thread is specialized; if you add more threads, the computation may run faster.

Example Problem and Solution

The problem is to print out all the prime numbers between 2 and 999 (inclusive).

Recall that a number is prime if no smaller number (except 1) divides it with no remainder. One method for determining whether a number k is prime would be brute force:

 for i in range(2,k):
   if k%i == 0: return False
 return True

But there are some ways to improve this. First, we don't need to try all numbers smaller than k in the division test. We only need to try numbers up to (approximately) the square root of k.

 import math
 limit = math.sqrt(float(k))
 limit = int(math.ceil(limit))
 limit += 1
 for i in range(2,limit):
    if i in Primes: continue
    if k%i==0:  return False
 return True

Now we have the basic idea for testing whether a number is prime. To generate the primes between 2 and 999, the following is a sequential (non-parallel) program:

 # subroutine tests one number
 import math
 def isPrime(x): 
   global Primes
   limit = math.sqrt(float(x))
   limit = int(math.ceil(limit))
   limit += 1
   for i in range(2,limit):
     if i not in Primes: continue
     if x>i and x%i==0:  return False
   return True
 # main program tests all numbers
 Primes = []
 for j in range(2,1000):
   if isPrime(j):  Primes.append(j)
 print Primes

How can this be parallelized? Let's not worry about that first; instead, let's ask how to express this using the agenda-oriented pattern and Linda's tuplespace operations. We will have a main thread (the "Master") and some other threads (the "workers"). The Master will initialize the tuplespace with some work (we don't have any input for this problem). We can represent work to be done with a tuple

 ("testPrime",i)

where i is the number to be tested. A worker will test a number to see if it is prime. We also need some other tuples to represent the output from a worker. Let

 ("prime",k,True) 

be a tuple that a worker adds to the tuplespace if k is prime, otherwise the work adds

 ("prime",k,False)

to the tuplespace. Now, the basic form of the Master is:

 import Linda
 import threading
 T = Linda.ThreadingTupleSpace()  # initially empty tuplespace
 for i in range(2,1000):  T.Out( ("testPrime",i) )
 ... here would be some code to start worker threads
 for i in range(2,1000):
    a,b,c = T.Rd( ("prime",i,None) )
    if c:  print b, " is prime "

Notice that, implicitly, the Master waits on workers to finish, because "T.Rd( ... )" will block, waiting for a matching tuple to be in the tuplespace. Now let's consider what a worker does in the run method (of the thread):

 def run(self):  
   global T
   m,n = T.In( ("testPrime",None) )
   r = isPrime(n)
   T.Out( ("prime",n,r) )

That looks easy, but it won't work because the way isPrime() is defined above depends on a list Primes, which we don't have. That list Primes has to be replaced by tuplespace operations.

 import math
 def isPrime(x): 
   global T
   limit = math.sqrt(float(x))
   limit = int(math.ceil(limit))
   limit += 1
   if x<=2: return True
   for i in range(2,limit):
     v = T.Rd( ("prime",i,None) )
     if not v[2]: continue
     if x>i and x%i==0:  return False
   return True

How does this all work? Consider a worker testing 103 to see if that is a prime number. It calls isPrime(103), which then tries to read ("prime",2,None), then later tries to read ("prime",3,None), and so on. This worker will be stuck until other workers have put the required tuples into the tuplespace. It's not too difficult to show that if the Master creates enough worker threads, say 998 workers, then the computation will work correctly.

However, there is a potential bug: if the Master creates only 20 workers (for example), and if these workers all happen to start by getting ("testPrime",d) for d greater than 2, then none of the workers will ever finish. They will all be waiting for ("prime",2,None), which will never be added to the tuplespace.

Can we correct this? The problem is that the Master created all the testPrime tuples at the very beginning, before there was any opportunity for parallel computing. Here's a simple change to correct that:

 import Linda
 import threading
 T = Linda.ThreadingTupleSpace()  # initially empty tuplespace
 T.Out( ("testPrime",2) )
 ... here would be some code to start worker threads
 for i in range(2,1000):
    a,b,c = T.Rd( ("prime",i,None) )
    if c:  print b, " is prime "

Notice now that the Master only creates one initial agenda item. That means that the workers themselves need to create the other agenda items. Here is a revised worker program:

 def run(self):  
   global T
   while True:
     m,n = T.In( ("testPrime",None) )
     new = n+1
     if new<1000:  T.Out( ("testPrime",new) )
     r = isPrime(n)
     T.Out( ("prime",n,r) )

With this revision, we only have ("testPrime",d) in the tuplespace if some worker has previously started working on d-1 (and by induction, workers have previously started work or finished 2,...,d-1).

Let's see how it looks, putting everything together; I added some print statements for debugging, and added a time.sleep() call just to see more mixing of the threads during execution:

 import Linda 
 import threading
 import math
 import time
 def isPrime(x): 
    global T
    if x <= 2: return True
    limit = math.sqrt(float(x))
    limit = int(math.ceil(limit))
    limit += 1
    for i in range(2,limit):
      v = T.Rd( ("prime",i,None) )
      if not v[2]: continue
      if x>i and x%i==0:  return False 
    return True
 class worker(threading.Thread):
   def __init__(self):
      threading.Thread.__init__(self)  
   def run(self):
      global T
      print "thread ", self.getName(), " started"
      while True: 
        m,n = T.In( ("testPrime",None) )
        print "thread ", self.getName(), " testing ", str(n)
        new = n+1 
        if new<1000:  
          print "thread ", self.getName(), " adding testPrime ", str(new)
          T.Out( ("testPrime",new) )
        time.sleep(0.01) # wait 1/100 second
        r = isPrime(n)
        print "thread ", self.getName(), " adding prime ", str(n), str(r)
        T.Out( ("prime",n,r) )
 T = Linda.ThreadingTupleSpace()  
 T.Out( ("testPrime",2) )
 for j in range(5):
   t = worker()
   t.start()
 for i in range(2,1000):
   a,b,c = T.Rd( ("prime",i,None) )
   if c:  print b, " is prime "

What happens when this is executed?

thread  Thread-1  started
thread  Thread-1  testing  2
thread  Thread-1  adding testPrime  3
thread  Thread-2  started
thread  Thread-2  testing  3
thread  Thread-2  adding testPrime  4
thread  Thread-3  started
thread  Thread-3  testing  4
thread  Thread-3  adding testPrime  5
thread  Thread-4  started
thread  Thread-4  testing  5
thread  Thread-4  adding testPrime  6
thread  Thread-5  started
thread  Thread-5  testing  6
thread  Thread-5  adding testPrime  7
thread  Thread-1  adding prime  2 True
thread  Thread-1  testing  7
thread  Thread-1  adding testPrime  8
2  is prime 
thread  Thread-2  adding prime  3 True
thread  Thread-2  testing  8
thread  Thread-2  adding testPrime  9
thread  Thread-3  adding prime  4 False

If you'd like to run it entirely, you can download media:Prime.txt, however:

Warning: the program above does not terminate the threads; I find that it is difficult to stop the program, even when all numbers have been tested. This annoying behavior is how Python works (usually I need to open another window and use the command "killall python" to force termination).

Other Patterns

During lectures, I covered two common patterns used in parallel or concurrent thread situations.

Fair Mutual Exclusion

Using a lock, you can arrange concurrent thread behavior so that no two threads have access to a resource at the same time; this is classically called the problem of mutual exclusion, and one solution to the problem is to define a lock for the resource you'd like to control. Since a system can have many resources, you can define many locks. Further, there can be specialized kinds of locks (see lock (computer science)).

Some standard requirements of any implementation of a locking mechanism are:

  1. safety -- at most one thread holds the lock at any time; and
  2. fairness -- once a thread requests a lock, it eventually is granted the lock

When fairness does not hold, then a thread can "starve", meaning that it could forever be blocked at the lock request, even though other threads get and release the lock many times.

We saw in class that safety is easy to express using Linda tuplespace (In and Out operations); fairness is harder to implement (one way to do it is to employ a lock manager thread).

When there are multiple resources and multiple locks, threads could potentially get into a situation of deadlock (see the same reference for the term livelock). Two reactions to dealing with deadlock are:

  1. deadlock avoidance -- devise some protocol in the requesting of locks so that a deadlock situation won't ever arise; and
  2. deadlock detection -- add a "monitoring" thread that independently watches the other threads, detecting that they get into a deadlock situation; then the monitor could abort and restart some thread to break the deadlock.

Barrier Synchronization

Barrier synchronization is a term often associated with parallel computations that are arranged in a sequence of phases (in each phase there is parallel activity, but the phases themselves occur in sequential order); see barrier (computer science) and synchronization (computer science). We saw in class that there are several ways to express barrier synchronization using Linda tuplespace operations.

Personal tools