Databases

From 118wiki

Jump to: navigation, search

Contents

Databases

A course on this topic will explore data modelling, representation, access, relational theories, views, schemas, and lots more. Here we are mainly interested in how network applications use databases and how databases use networking.

The starting point is that we assume the database architecture is given, such as a Database System (DBMS) in which there are entities that control and manage the database, log transactions to the database, support data locking, rollback for incomplete transactions, and so on. Once we understand how DBMS and database transaction work, and the ACID properties of a transaction, we can think about a distributed database implemented via networking.

Transactions

The fundamental unit of state change in a database is a transaction; it exists because a database is recorded on many sectors of a device or many devices, not all of which can be changed in one instant, and because performance motivates concurrent access to a database. Hence, transactions look like this pseudo code:

 Begin Transaction Examp2:
   request read-lock(A);
   request write-lock(B);
   x = read(A);
   write(B = x-1);
   commit();
 End Transaction Examp2;

In the example, the transaction locks two items from the database, A and B; the transaction doesn't explicitly unlock because the commit action does this. The "read lock" allows multiple transactions to concurrently share the lock so long as they are all reading (see Read write lock pattern).

No actual writing to the database occurs until the commit action is executed; the DBMS's transaction manager saves all writes in a cache, deciding whether or not the transaction satisfies the ACID properties before actually writing to the underlying data devices. If a transaction cannot be committed, it is either aborted or restarted, depending on the implementation and other parameters: this helps guarantee that the database state is always consistent.

Two Phase Locking

Transactions can still get inconsistent views of what is in the database due to concurrent updates by other transactions, because the order in which items are read could allow concurrent updates between reads. To prevent this, transactions can follow a protocol for locking, reading and writing. The technique of two-phase locking (2PL), with the "growing" and "shrinking" phases of lock ownership, can guarantee consistent views of the database, in spite of some concurrency. Study Strict two-phase locking to see how this works.

Deadlock Detection

Even 2PL does not prevent deadlocks. The transaction manager can detect deadlocks using a wait-for graph (WFG), in which vertices of the graph are transactions, directed edges represent a "waiting for" relationship between transactions. A cycle in this graph corresponds to a deadlock.

Distributed Transactions

In a networked world, databases do not live in isolation. It becomes reasonable to think that different databases could be logically synchronized, that is, that the content of multiple databases could satisfy some "global" consistency property (think, for example, of the task of transferring electronic money from one bank/database to another bank/database, where consistency implies, among other things, that no money gets lost or magically appears).

To think about this problem, we allow transactions to be split up into subtransactions, one for each site in a distributed database. Individually, each database has its own transaction manager, processing lock requests and commit actions, however now these managers are dealing with subtransactions. The most interesting question is this:

When can a subtransaction safely be committed?

Ideally, we would like to have all the subtransactions of a given transaction instantaneously commit (at the same time, so that the overall transaction satisfies ACID properties). But networks use messages that can be delayed, have unpredicatable latency, and so on, making the ideal of instantaneous global commit unrealistic.

Here is the outline of a solution (assume that we are dealing with a transaction T, with subtransactions T1, T2, and T3):

  1. database site Si runs subtransaction Ti
  2. there is a new process called the coordinator for T
  3. each site Si has an agent for Ti
  4. the coordinator communicates with the agents
  5. database transaction managers have a new interface and a new state during transaction processing called "ready to commit"
  6. when subtransaction Ti executes commit, the manager at Si decides whether Ti can be commited, and tells Si's agent (if the answer is yes, Ti is not yet committed, but Si knows Ti is ready to commit)
  7. when agent at Si learns the outcome of Ti's commit, then Si sends the coordinator a message
  8. the coordinator collects messages from all subtransaction agents
  9. if the coordinator gets a "fail" (failure to commit) message from any agent, then the coordinator sends message to all other agents telling them to cancel their subtransactions
  10. if the coordinator gets "ready" messages from all the agents (a unanimous decision), then the coordinator sends a "do commit" message to all the agents
  11. when an agent gets a "do commit" message, it notifies the transaction manager at that site to go ahead and commit the subtransaction

If all goes well, the scheme outlined above does consistently update even multiple databases. However, if one site crashes (and there are tricky cases, such as cashing during commit processing), or if the coordinator crashes, then the scheme may not work properly. Many research papers study this problem and propose various solutions.

Distributed Deadlock

When transactions involve subtransactions at different databases, deadlock is possible. What's worse is that we can't just use 2PL and ordered locks to avoid deadlock, and there is no single lock manager that can observe a wait-for graph to detect the deadlock. See edge chasing for one idea on how to deal with distributed deadlock.

SQLite

The department has sqlite, a lightweight SQL implementation you can use for simple things. Follow this tutorial and you learn how to create a database, tables, add data and query it. The tutorial uses the interactive, shell interface to the database server. There is also a Python interface, via the pysqlite package. To use sqlite in Python, the command
from pysqlite2 import dbapi2 as sqlite
will enable you to use a Python statement such as
sqlite.connect(database="mydb", timeout=10.0)
as described in the pysqlite manual.

Patterns of Usage

SQL has many powerful facilities for querying data, but our needs are modest in this course. At least we need to know about the basic Create, read, update and delete or "CRUD" pattern, which is the most common pattern for using databases.

Personal tools