NetBSD-SoC: Improved Writing to FileSystem Using Congestion Control

What is it?

When the rate of generation of write requests by the kernel or userland processes exceeds the service time of the underlying FileSystem(FS)/disk, write congestion can happen. Such a phenomenon has many undesirable affects manifested in the form of :- (i) unequal sharing of disk bandwidth between processes, (ii) delayed write (due to lots of request getting piled up in the buffer), and (iii) read requests getting affected due to onslaught of write requests. Hence, the primary aim of this project is to provide mechanism(s) that would enforce "fairness" and "rate control" among competiting proceses initiating such write requests.

Status

2006-05-23: Google publishes accepted/rejected projects. Go!
2006-06-26: Google solicits mid-program mentor evaluations of student progress
2006-06-30: All mid-program evaluations of student progress due by 17.00 Pacific Standard Time
2006-08-21: All student projects due by 08.00 Pacific Standard Time
2006-09-05: All mentor and student evaluations due by 08.00 Pacific Standard Time

Deliverables

Mandatory (must-have) components:

Congestion Control Algorithm impelemented inside the UVM.
Benchmarking and validation of the approach.
Documentation highlighting the algorithm and effectiveness of the approach.

Optional (would-be-nice) components:

Currently, process information is kept on a link list, which has O(N) traversal. We would like to have a splay tree version where the most accessible nodes are always close to the root.
As of now, the algorithm does not receive any feedback from the underlying I/O system. We would like to create a version of the CCA that self-tunes itself based on information from the underlying layer(s).

Technical Details

The Congestion Control Algorithm (CCA) is available as a hook inside the UVM (uvm_cca_update_node(curproc)). Specifically, we trap page request inside the genfs_putpages() function call. All processes that invoke this function, has its requests recorded by the congestion control module. As of now, all statistical information about processes are kept in a linked list. This means that with the increase in the number of processes issuing write requests, we suffer from linear order traversal of the list. In the future, it is envisioned that a more efficient data structure, such as a splay tree, will be used, so that processes invoking more frequest writes, stays close to the root of the tree.
The CCA algorithm is based on coupling two observable behavior of processes: (i) the entropy or the amount of randomness present in the process (this treats the processes as independent random variables), and (ii) the rate of growth of writes being issued by a process. In order to reduce computation overhead, we record the above parameters every 2 seconds of process activity. A process is marked as being in "congestion inducing phase" if the rate of growth of writes (predicted) is less than what is being actually observed. Under such a circumstance, the process is put to sleep for a variable period of time. (In the current release, we have hard-coded this figure to 8secs as we are trying to tune various aspects of the algorithm). Typical trace behavior of such an activity can be found here.

Documentation

Caveat: Code released has not been exhaustively tested though reasonable care has been taken to ensure that that the implementation will not cause undesirable affects. In some instances, it has been observed, during heavy write events, that the CCA algorithm makes the system non-responsive to remote login via ssh and in two instances caused the ioflush daemon to pause for long intervals of time. Also, in the current release, the rate of read requests by processes are also getting affected while slowing down the rate of generation of write requests. All these are under investigation and will hopefully be addressed in future releases.

Instructions on how to run the code can be found here. Initial results using the approach can be found here.

References

Interactive responsiveness under heavy I/O load, http://mail-index.netbsd.org/tech-perform/2004/01/26/0000.html.
Pluggable disk scheduler for FreeBSD, http://wikitest.freebsd.org/moin.cgi/Hybrid.
An Implementation of Scheduler Activations on the NetBSD Operating System, http://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa.html.
NetBSD Internals, http://www.netbsd.org/~jmmv/guide/.
The Design of the Unix Operating System, Maurice J. Bach.

###########

Sumantra R. Kundu <sumantra@gmail.com>

$Id: index.html,v 1.1.1.1 2006/05/23 10:44:18 hubertf Exp $