![]() |
& | ![]() |
The goal of the project is explicitely not adding new user-visible features, but to improve the current implementation and make the network stack more reliable.
Mandatory (must-have) components:
To ensure correct handling of read-only storage, the "const" qualifier should be used to leverage the compiler-provided checking.
Converting legacy protocols such as XNS, ISO, or Appletalk is optional.
Code which hasn't been converted because it did not match the new API well should be documented to provide basis for future improvements. Also there should be documentation of the assumptions made by current code, especially code that we can't change easily because it is shared with other projects.
Optional (would-be-nice) components:
Finally, two patches need to be applied: files.patch and netccitt.patch.
There is also an optional xen.patch, suggested by Manuel Bouyer, which is useful for debugging read-only mbufs under NetBSD/Xen 2 unprivileged domain. If READONLY_MBUF is defined, the xennet driver maps the external storage read-only, so any code attempting to write there crashes the kernel.
For testing the new routines there is a possibility to compile them in an userland application, which is much easier to debug than kernel. See a separate page for details.
Documentation of the API is provided in the form of an updated mbuf(9) man page (search for mptr, m_datarange and M_SETCONTIGSIZE).
if (m->m_len < sizeof(struct ether_header)) {
m = m_pullup(m, sizeof(struct ether_header));
if (m == NULL)
return NULL;
}
eh = mtod(m, struct ether_header *);
Given that this idiom is very common, this should be replaced by something simpler. More importantly, the mtod() should be replaced by a macro which does check the length and writability to make the code which does not use m_pullup() to reliably fail if the assumptions are incorrect. Currently, such code mostly works, leading to hard-to-repeat bugs.
With suggestions from my mentors, we determined a possible way to achieve this goal. One would introduce a macro, say mptr(), which would take as argument the type of the structure whose pointer is to be obtained, unlike the current mtod, which takes the type of the pointer itself. It would apply sizeof to the type and ensure (with m_pullup) that the mbuf data are contiguous in a sufficient length to contain the whole structure. The returned pointer would be const.
There would be also a variant to return a writable pointer (which would make the mbuf writable automatically). And also a variant to skip the m_pullup if the caller somehow knows that the data are already contiguous. This would optionally check and panic if this assumption is incorrect.
This involves modifying lots of places, but the modifications should be quite straightforward ant the not-yet-modified code would continue to work.
To avoid calls to m_pullup or its equivalent, the mbufs should be made contiguous as soon as possible, preferably when enetring an encapsulation layer (like IP or Ethernet). At every entry to such encapsulation layer, the mbuf would be made contiguous to a size sufficient for headers of this layer. A simplest solution would be to use a size of 512 bytes, which should be enough. Then, all places in this layer where one can be sure that this is enough can stop using m_pullup or an equivalent. They should be converted to the above-proposed variant of mptr, which does not do the m_pullup, but optionally checks and panics.
A variant of the above would be to use some knowledge about the current layer's needs, to pullup only the necessary amount and not an arbitrary length as 512 bytes.
There will be of course places where one can't assume that a fixed-length is enough. An example are IPv6 header chains. In such occasions, one would have to still use an equivalent of m_pullup (such as the above-proposed mptr), but this should be rare.
Another idea considered was to deprecate the m_pullup call completely. Instead of it, the code would be forced to always copy data from the mbuf to a local structure, typically allocated on stack (using some convenience macro which would replace mtod). This macro could skip copying in the case of contiguous data, but this would be transparent to the caller, because it would return a pointer either to the local variable or to the mbuf data. A problem arises when the mbuf needs to be writable. In that case, the caller would have to copy the structure back.
Another inconvenience arising from this approach is that if the same structure is repeatedly accessed from different functions, it needs to be copied (or at least checked for the need of copying) in all of the functions. Currently the mbuf can be pulled up in the first one in the call chain and if the others are always called with an already contiguous mbuf, they can skip the m_pullup call. All such code would have to be modified, which is very inconvenient for 3rd party code such as pf, where local modifications are undesirable. It would also probably have negative performance impact.
In any case, the code which writes to mbufs should be minimized. In the case of NAT, updating TTLs and checksums it is unavoidable, but there is code which writes to the mbuf to convert on-wire data to some in-kernel representation, usually to convert to the host byteorder. This would have negative performance impact because of triggering a copy in case of read-only mbufs. It is also quite ugly.
A secondary goal is to rework the m_tag API. It is used to tag packets (represeted as mbufs) with arbitrary metadata. Currently it is used by e.g. the hardware VLAN acceleration code or IPSEC, and it is supposed to have more uses in the future (such as ALTQ or MPLS). The iplementation is done using a linked list of dynamically allocated structures, and unfortunately the API assumes a linear list. It would be good to replace the API with a new one which would hide the implementation, so that some optimizations are possible, such as replacing the list by a hash table or embedding constant-size tags in the mbuf header. There will probably be two kinds of tags: simple tags, of constant size and only one tag of a given type allowed on a packet, and a more general, variable sized tag, with a possibility of appearing multiple times on one packet. The API would guarantee ordering only between tags of the same type.
First of all, I'll try to document current state of the code, how it uses the API which will be changed.
pf does not use m_pullup at all. In the case of the IP header, it simply uses mtod to obtain a pointer to the beginning of the mbuf chain as an IP header. This looks suspicious, but it is correct. All callers of pfil_run_hooks, where pf is called from (the pf_test function) do a m_pullup before.
For the L4 headers, it uses a pf_pull_hdr function, which copies the header to a temporary structure.
For ALTQ, the address of the IP header is included in an ALTQ tag:
mtag = m_tag_get(PACKET_TAG_PF_QID, sizeof(*atag), M_NOWAIT);
if (mtag != NULL) {
atag = (struct altq_tag *)(mtag + 1);
atag->qid = r->qid;
/* add hints for ecn */
atag->af = af;
atag->hdr = mtod(m0, struct ip *);
m_tag_prepend(m0, mtag);
This could cause problems if the mbuf is re-arranged between the
pfil_run_hooks
call and the consumer of the ALTQ tag, for example if another pfil
hook is run (which can free the mbuf and copy the data to another
one).
IPFilter uses compatibility macros defined in ip_compat.h to hide differences between the various kernels it was ported to. Those macros present an interface mostly identical to the mbuf API to the rest of IPFilter code.
The IPFilter functions obtain a pointer to a fr_info_t structure which describes the packet. It has the pointer to the mbuf, and also pointers to the IP header and L4 header inside the mbuf. This is potentially dangerous, because if the mbuf chain is rearranged, those pointers will become invalid. There is only one instance of m_pullup in IPFilter, which is in the fr_pullup wrapper function. This wrapper takes care of updating the pointers in the fr_info_t structure. If another mbuf-rearranging function is used in IPFilter, a similar wrapper should be provided, instead of calling such function directly.
The fr_pullup function has a bug. If the area to be made contiguous is larger than MHLEN or MLEN, it calls m_pulldown instead, as m_pullup can't make such contiguous area. But it discards the return value from m_pulldown, which is the mbuf containing the data, and instead blindly references the first mbuf, which, while still valid, can be zero-sized if the data don't fit into it. The simplest fix would be to change m_pullup to cope with mbuf clusters and eliminate the call to m_pulldown.
Fortunately most of those cases do not exist anymore.
This deletes all tags after a given tag. If invoked with NULL, it deletes all tags attached to the mbuf. Fortunately all code invoke it with NULL, so there is no dependency on tag ordering.
Finds the first tag with a given type after a given tag. If invoked with NULL, searches from the beginning.
The only places where it is called with non-NULL argument are in netipsec/xform_ah.c and netipsec/xform_esp.c, where is an iteration over all tags of type PACKET_TAG_IPSEC_IN_CRYPTO_DONE. This does care about tag ordering, but only between tags of the same type.
Is called at only one place: ip_output(). There is an iteration over all tags of type PACKET_TAG_IPSEC_OUT_DONE or PACKET_TAG_IPSEC_OUT_CRYPTO_NEEDED. The result obviously depends on which comes first. The question now is if both can be attached to a same mbuf.
[1] Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman: The Design and Implementation of the 4.4 BSD Operating System. Addison Wesley, 1996.
The following book provides a very detailed description of networking in the 4.4BSD-Lite kernel.
[2] Gary R. Wright, W. Richard Stevens: TCP/IP Illustrated, volume 2: the implementation. Addison Wesley, 1995.
|
|
|