[NetBSD logo]    &    [Google logo]

NetBSD-SoC: Improving the mbuf API and implementation

What is it?

The BSD network stack uses chains of mbuf structures to describe packets stored in memory in a noncontiguous manner, to minimize memory usage and eliminate copying where possible (see [1] for details). It is unfortunately easy to write code which makes incorrect assumptions about the mbufs it uses, easier than writing a code which is correct. Worse, the incorrect code will work most of the time, so detecting such bugs is hard. This project aims at providing better APIs for using the mbufs, which would make writing correct code an easy task. It also aims at adding more consistency checks, so that code abusing the mbuf structures fails more reliably.

The goal of the project is explicitely not adding new user-visible features, but to improve the current implementation and make the network stack more reliable.


See the separate status report.


[What exactly is intended to be provide by the project. This list will be used to judge success of the project!]

Mandatory (must-have) components:

Optional (would-be-nice) components:


To compile and test the code, one needs a snapshot of NetBSD-current from 26.05.2006. Replace the directories under src/sys by corresponding directories from CVS under mbuf/src/sys. There are also several files outside this area. src/sys/kern/uipc_mbuf*.c should be replaced by corresponding files under mbuf/userland/mbuf_impl and src/sys/sys/mbuf.h by mbuf/userland/mbuf_impl/sys/mbuf.h. (It is possible to just nonrecursively checkout mbuf/userland/mbuf_impl to src/sys/kern and mbuf/userland/mbuf_impl/sys to src/sys/sys, CVS will ignore the extra files.)

Finally, two patches need to be applied: files.patch and netccitt.patch.

There is also an optional xen.patch, suggested by Manuel Bouyer, which is useful for debugging read-only mbufs under NetBSD/Xen 2 unprivileged domain. If READONLY_MBUF is defined, the xennet driver maps the external storage read-only, so any code attempting to write there crashes the kernel.

For testing the new routines there is a possibility to compile them in an userland application, which is much easier to debug than kernel. See a separate page for details.

Documentation of the API is provided in the form of an updated mbuf(9) man page (search for mptr, m_datarange and M_SETCONTIGSIZE).

Technical Details

The main focus of my work will be on replacing the mtod() macro. The problem with mtod() is that it returns a pointer to the data in the mbuf without checking the length and writability. The caller is supposed to do it itself. Ensuring writability can be done by a call to m_makewritable() or m_copyback_cow() functions. Ensuring the correct length is done by a call to the m_pullup() function. The result looks like:
if (m->m_len < sizeof(struct ether_header)) {
	m = m_pullup(m, sizeof(struct ether_header));
	if (m == NULL)
		return NULL;
eh = mtod(m, struct ether_header *);

Given that this idiom is very common, this should be replaced by something simpler. More importantly, the mtod() should be replaced by a macro which does check the length and writability to make the code which does not use m_pullup() to reliably fail if the assumptions are incorrect. Currently, such code mostly works, leading to hard-to-repeat bugs.

With suggestions from my mentors, we determined a possible way to achieve this goal. One would introduce a macro, say mptr(), which would take as argument the type of the structure whose pointer is to be obtained, unlike the current mtod, which takes the type of the pointer itself. It would apply sizeof to the type and ensure (with m_pullup) that the mbuf data are contiguous in a sufficient length to contain the whole structure. The returned pointer would be const.

There would be also a variant to return a writable pointer (which would make the mbuf writable automatically). And also a variant to skip the m_pullup if the caller somehow knows that the data are already contiguous. This would optionally check and panic if this assumption is incorrect.

This involves modifying lots of places, but the modifications should be quite straightforward ant the not-yet-modified code would continue to work.

To avoid calls to m_pullup or its equivalent, the mbufs should be made contiguous as soon as possible, preferably when enetring an encapsulation layer (like IP or Ethernet). At every entry to such encapsulation layer, the mbuf would be made contiguous to a size sufficient for headers of this layer. A simplest solution would be to use a size of 512 bytes, which should be enough. Then, all places in this layer where one can be sure that this is enough can stop using m_pullup or an equivalent. They should be converted to the above-proposed variant of mptr, which does not do the m_pullup, but optionally checks and panics.

A variant of the above would be to use some knowledge about the current layer's needs, to pullup only the necessary amount and not an arbitrary length as 512 bytes.

There will be of course places where one can't assume that a fixed-length is enough. An example are IPv6 header chains. In such occasions, one would have to still use an equivalent of m_pullup (such as the above-proposed mptr), but this should be rare.

Another idea considered was to deprecate the m_pullup call completely. Instead of it, the code would be forced to always copy data from the mbuf to a local structure, typically allocated on stack (using some convenience macro which would replace mtod). This macro could skip copying in the case of contiguous data, but this would be transparent to the caller, because it would return a pointer either to the local variable or to the mbuf data. A problem arises when the mbuf needs to be writable. In that case, the caller would have to copy the structure back.

Another inconvenience arising from this approach is that if the same structure is repeatedly accessed from different functions, it needs to be copied (or at least checked for the need of copying) in all of the functions. Currently the mbuf can be pulled up in the first one in the call chain and if the others are always called with an already contiguous mbuf, they can skip the m_pullup call. All such code would have to be modified, which is very inconvenient for 3rd party code such as pf, where local modifications are undesirable. It would also probably have negative performance impact.

In any case, the code which writes to mbufs should be minimized. In the case of NAT, updating TTLs and checksums it is unavoidable, but there is code which writes to the mbuf to convert on-wire data to some in-kernel representation, usually to convert to the host byteorder. This would have negative performance impact because of triggering a copy in case of read-only mbufs. It is also quite ugly.

A secondary goal is to rework the m_tag API. It is used to tag packets (represeted as mbufs) with arbitrary metadata. Currently it is used by e.g. the hardware VLAN acceleration code or IPSEC, and it is supposed to have more uses in the future (such as ALTQ or MPLS). The iplementation is done using a linked list of dynamically allocated structures, and unfortunately the API assumes a linear list. It would be good to replace the API with a new one which would hide the implementation, so that some optimizations are possible, such as replacing the list by a hash table or embedding constant-size tags in the mbuf header. There will probably be two kinds of tags: simple tags, of constant size and only one tag of a given type allowed on a packet, and a more general, variable sized tag, with a possibility of appearing multiple times on one packet. The API would guarantee ordering only between tags of the same type.

First of all, I'll try to document current state of the code, how it uses the API which will be changed.

Use of m_pullup and mtod

Writing to mbufs

The book [2] describes severlal cases in the TCP/IP input path where received mbufs are modified, with possibly bad impacts on mbuf storage sharing (either bugs, if the code does not check for shared storage, or reduced performance if it does).
  1. TCP and IP reassembly uses some fields in the IP header for storing pointers to form linked lists.
  2. In ip_input, numbers in the IP header are converted from network to host byte order directly in the packet.
  3. In udp_input and tcp_input, the ipovly structure overlays the original IP header and some fields are changed for the purpose of veryfying checksums. Also, the ip_stripoptions function is called, which removes IP options from the middle of the packet and rearranges te packet to remain contiguous.
  4. In tcp_input, fields in the TCP header are converted from network to host byte order directly in the packet.

Fortunately most of those cases do not exist anymore.

  1. was converted to external lists in 1995 by Chris G. Demetriou when NetBSD was ported to Alpha. (64-bit pointers would not fit to the fields previously used.)
  2. was changed in 2002 by Jun-ichiro itojun Hagino.
  3. was changed in 1999 with KAME import. ip_stripoptions is now unused (and likely buggy, as it calls bcopy for overlapping memory copy).
Only 4. remains, in the form of the TCP_FIELDS_TO_HOST macro. There was an attempt to make this correct for read-only mbuf storage (which NetBSD/Xen could optionally use) - rev. 1.241 of tcp_input.c by Manuel Bouyer. It caused obscure crashes and was backed out. (The likely cause was that it called m_makewritable with the length of the struct tcphdr. But the TCP header is actually variable-sized because of options, this structure is just the bare header without them. There is code in tcp_input.c to make sure that the whole TCP header is contiguous, and this call to m_makewritable destroyed this property.)


There are several functions of the m_tag group which imply strict ordering between mbuf tags.


The above cited manual pages are a reference of the mbuf API. But for an easy to understand introduction, I would recommend the chapters about networking in the classical "daemonic book":

[1] Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman: The Design and Implementation of the 4.4 BSD Operating System. Addison Wesley, 1996.

The following book provides a very detailed description of networking in the 4.4BSD-Lite kernel.

[2] Gary R. Wright, W. Richard Stevens: TCP/IP Illustrated, volume 2: the implementation. Addison Wesley, 1995.

SourceForge.net Logo
Pavel Cahyna <pavel@netbsd.org>
$Id: index.html,v 1.14 2006/08/21 13:24:40 pcahyna Exp $