[NetBSD logo]    &    [Google logo]

NetBSD-SoC: Pass-to-Userspace Framework FileSystem

Upupdate 2006

puffs has been integrated into NetBSD as of 20061022. Information on this page is no longer kept up-to-date (it is mostly out-of-date already when writing this). Refer to the NetBSD puffs page for current information.

Update 2006

I've updated the code to run against a -current source tree. See the README and TODO files for more information. If you are interested in the recent progress, the CHANGES file should track that quite closely now. Currently, among other things, it is now possible to do the following:

jojonaru# mount
/dev/wd0a on / type ffs (local)
puffs:detrempe on /puffs type puffs
jojonaru# cd /usr/share
jojonaru# pax -rw doc /puffs
jojonaru# diff -r doc /puffs/doc
jojonaru# pax -rw zoneinfo /puffs
jojonaru# df
Filesystem     1K-blocks      Used     Avail Capacity  Mounted on
/dev/wd0a         254079    173851     67525    72%    /
puffs:detrempe       248       248         0   100%    /puffs
jojonaru# cd /puffs
jojonaru# ls
vi        zoneinfo
jojonaru# rm -rf *
jojonaru# df .
Filesystem     1K-blocks      Used     Avail Capacity  Mounted on
puffs:detrempe         0         0         0   100%    /puffs
jojonaru# ls
jojonaru# umount /puffs
jojonaru# mount
/dev/wd0a on / type ffs (local)

.. and

jojonaru# touch pink
jojonaru# ls -l
total 0
-rw-rw-rw-  1 root  wheel  0 Oct  9 22:10 pink
jojonaru# ln pink floyd
jojonaru# echo 'the lunatic is on the grass' > pink
jojonaru# rm pink
jojonaru# cat floyd
the lunatic is on the grass
jojonaru# ls -l floyd
-rw-rw-rw-  1 root  wheel  28 Oct  9 22:10 floyd

.. and

jojonaru# echo 'the bug' > straits
jojonaru# ln -s straits dire
jojonaru# ls -l
total 0
lrwxr-xr-x  1 root  wheel  7 Oct  9 23:47 dire -> straits
-rw-rw-rw-  1 root  wheel  8 Oct  9 23:47 straits
jojonaru# cat straits
the bug
jojonaru# rm straits
jojonaru# cat dire
cat: dire: No such file or directory

.. and (directory containing hard links and symlinks)

jojonaru# cd /usr/
jojonaru# pax -rw bin /puffs
jojonaru# diff -r bin /puffs/bin
jojonaru# df -i /puffs
Filesystem     1K-blocks      Used     Avail Capacity  iused    ifree  %iused  Mounted on
puffs:detrempe     19328     19328         0   100%      735        0   100%   /puffs
This means that detrempefs (a simple in-memory file system written on top of libpuffs) can survive copying a directory tree over and preserves the original structure and contents. puffs is now getting fairly close to integration into NetBSD.

What is it?

The aim is to create a general-purpose framework for attaching filesystems running in userspace. The framework can then be used for various applications such as writing new filesystems in userspace to test them or some "novelty" uses as having a filesystem for user account administration.

On a more technical level, the work consists of writing a passthough-layer which attaches to the current virtual filesystem layer in the kernel and creating a communication infrastructure so that the filesystem can receive commands from the kernel and respond to them once they have completed the task. In addition, at least some pretended effort must be put into thinking about the interface to which userspace implementation will attach to.

The flow of control will be somewhat like the following: application (e.g. cat file) -> kernel (syscall, vfs ..) -> kernel puffs -> userspace puffs -> fs implementation (userspace) -> userspace puffs -> kernel puffs -> application

What's in a name?

I was reading a cookbook when beginning the project and was at a chapter on puff pastry. Since the acronym almost fits the purpose and it can be imagined for the framework to increase the volume of the operating system, it was (unwisely) chosen.

And a détrempe is of course a flour and water paste, which is the first stage in making puff pastry.

Requirements

Following the above call-flow, it can be said that the project is divided into four separate parts.

VFS integration

For incoming calls, the puffs framework needs to be integrated into the existing virtual filesystem framework. This is something that obviously needs to work for anything to work. Luckily, this part is just legwork, which of some parts are documented less than others, but it is suitable work to be done before having coffee in the morning (no, not really ... ;).

Userspace communication: technology

Another fairly obvious goal is to have a pipe between userspace and the kernel. This is harder to accomplish than what I initially thought. For some reason, every way (which can be said to be a general solution for the generic filesystem and efficient'ish on a straight face) I think of seems to be a dead end.

For the SoC project an acceptable method is to just have some working method (well, maybe not the "operator types requests manually in"-method ..), it can be improved later. It's not going to show to the user, so the filesystems can freely be developed. Of course they need to be relinked to the puffs lib after I finally figure out what the best way for communication is.

Userspace communication: interface

This requirement consists of specifying what the various vfsops and vnodeops should look like when transmitted to userspace. It's manual labour for all operations, but once a few are figured out, it should be doable on a fairly quick pace. The interface needs to be fairly well in place, although perhaps not perfectly thought out, by the end of the SoC project.

Userspace link-level interface

This specifies what the C-level linkage to the framework should look like. I'll probably have some kind of library available for filesystem development here, but it may not the ultimate killer interface yet by the end of the SoC project.

summa summarum

The mandatory requirement for the SoC project will be a prototyping implementation for puffs. I will continue perfecting it afterwards until it is working according to the NetBSD definition.

Deliverables

The deliverables shall meet the levels of quality specified in the Requirements section. They are as follows:

A test filesystem will naturally be implemented, but it may be just suitable for testing the framework, not useful for anything standalone, and therefore cannot be considered to be a deliverable.

Status

Future:

Current:

Documentation

Implementing a userspace filesystem with puffs consists of filling out a few operation vectors (well, structs actually) and calling the mount function. Sounds simple, yes? Yes.. yes ... um... no.

What needs to be done to have a working filesystem implementation behind puffs is to use the routines provided by libpuffs, link your implementation against libpuffs and run the resulting application.

background on in-kernel filesystems

In-kernel filesystems attach to filesystem abstractions at two basic interfaces: the filesystem itself (vfs - virtual filesystem) and the filesystem nodes (vnode - virtual node). Each filesystem implements usually some subset of the operations in the two interfaces to achieve what it wants to achieve. Some features are mandatory to implement (such as mount), others can be implemented or left unimplemented depending on the filesystem in question.

what's kernel got to do with it?

The level of abstraction that the userspace implementation needs to bite into had to be decided. The existing readily available abstraction was the natural choice. Of course the operations cannot be passed directly to userspace as such (think about e.g. kernel memory space references), but on a basic level the operations are the same in the kernel and userspace.

vnodes

Vnodes live and dwell in the kernel. They are pooled together when unused and are recycled between all filesystem types in the system. Basically their usage is highly optimized, since they come and go a lot in the daily operation of a normal healthy kernel.

However, in userspace we are not so concerned with performance. We even cannot pool nodes together between different mounts of puffs: they are separate processes, they have no way of knowing about each others' nodes. Even if we could, we probably would not want to, since pooling vnodes together adds much complexity.

But what we must be able to do is to map vnodes to userspace nodes and back. This is because the vnode operations are done on specific vnodes, and the userspace implementation must know which node we are operating on currently. The information between the kernel and userspace is passed back and forth as cookies. Most operations pass the information from the kernel to userspace, but some operations where node creation is involved pass this information from userspace back to the kernel (and the kernel then uses this information to pass the cookie values for operations on those created nodes).

The userspace implementation must be able to map cookie values to userspace nodes. The easiest scheme is to pass structure virtual memory addresses back. The kernel is okay with anything that is unique for every node (at a given period in time).

the interface

The interface is defined in two files, the kernel portion in miscfs/puffs/puffs.h and the userspace portion in puffif.h.

miscfs/puffs/puffs.h

The information relevant to implementing a puffs-filesystem in this file are the argument structures to each vfs or vnode call (puffs_vfsreq_xxx and puffs_vnreq_yyy). In addition to vnode cookies discussed above, common information in calls are the operation caller credentials: the structure uucred (userspace representation of struct ucred) and the calling process id. The rest of the fields usually relate closely to the operation in question and can be spied on from the vfsops.9 and vnodeops.9 manual pages. (Yes, I will add better descriptions at a later date).

The other interesting pieces of information located here are struct puffs_sizeop and struct puffs_cn. I will discuss the inner truths related to that structure further down. puffs_cn is the translated userspace representation of the kernel struct componentname. It cannot be transferred with a 1:1 copy since it contains pointers to the kernel memory space.

puffif.h

This file containes the definitions for the structures whose callbacks must be filled out for the filesystem to work. All callbacks can either be filled in manually, or puffs_dummyops() can be called for filling in all the operations with dummy functions. This file also contains the library public interface (which is bound to change in the future for several reasons, so don't take it too seriously, ok?).

puffs_sizeop

For almost all operations it is enough to define a single simple callback, which does everything required for the operation. But as fate usually mandates, there are a few exceptions to this simple rule. These exceptions are operations which require an arbitrary-size buffer for completion (read/write/ioctl/fcntl). What happens now is that the first operation e.g. read1() is called and the userspace program must reserve enough memory for the operation (usually arg->resid) and pass the buffer location and size to the kernel in sizeop->userbuf and sizeop->bufsize, respectively. In the second call the (e.g. read2()) the buffer is freed. Of course read actually reads into the buffer in read1, while in a write operation the userspace filesystem code would read *from* the buffer in write2 and dedicate write1 to simply reserving buffer. Confusing? I'll draw a picture some day ;)

XXX-note-to-self: investigate calling uvm_mmap() for the handling process vm_map to reserve memory without a bounce-call. does this introduce unwanted side-effects (such as having to register the process doing the handling). still does not solve the problem with ioctl and fcntl.

simple example: hardcodefs

A really simple (or perhaps even simpler than that) filesystem was created just to test out the framework and act as an example for people wanting to use the framework (ok, it might have been smarter to set a good example by writing good code, but the world isn't always aligned with one's wishes).

The filesystem consists of a flat layer of files where a single file is defined by struct hcfsfile in hcfs.h. As can be seen, the file only contains its name, attributes and a fixed-length memory storage buffer for storing information. While currently the data storage area is in anonymous memory, nothing would prevent e.g. from doing a simple mmap() on a file in another filesystem and therefore gaining non-volatile storage with just a few additional lines of code.

File creation is not possible runtime, all file nodes must be listed in the code and they are created already at runtime. There is no real reason besides lazyness for this.

hc.c

This is the main entry point for the filesystem. As can be seen, the operation vectors are filled with dummy ops and then overwritten by the few real operations that hardcodefs supports. After this mount is called and internally it creates a new execution context. The calling context could hang around for whatever (and in the future I will probably change the interface so that the actual execution must not be passed into the hands of the mount function. Currently the interface is poor because all requests must be handled synchoronously. But more on this topic will follow at a later date), but it can also just exit, so that's what it does.

files.c

The creation of files is handled here. It is accomplished in creatfiles() by going over the pre-determined list, allocating space for nodes, allocating a predefined attributes structure for them, telling them their names and registering the nodes to the filesystem. As a convenience function, a routine for providing the "n th" file in the filesystem is provided (purely an artifact of the "design" of the filesystem).

vfsops.c

The vfs operations implemented for this filesystem are contained here. No black magic present.

vnops.c

Finally, the vnode op implementations are here. Only the following are currently supported: lookup, readdir, getattr, read, write. Lookup simply goes through the list of flat nodes and finds the one that matches the given name in the filesystem and returns the address of the respective structure in a cookie. Readdir simply always returns the nth directory entry. Getattr is the easiest of all: a memcpy() from the file structure to the argument structure. The read and write calls work as described above in the paragraph on puffs_sizeop.

review

Easy 1-2-3 to implement a filesystem: And:

Technical Details

Useful documentation and "documentation" considering this project:


SourceForge.net Logo
Antti Kantee<antti.kantee@hut.fi>
$Id: index.html,v 1.24 2007/03/14 10:07:24 thepooka Exp $