[NetBSD logo]    &    [Google logo]

NetBSD-SoC: Wide Character Support in NetBSD curses Library

What is it?

This project is part of the Google Summer of Code program to promote the open source software development. I am very glad that my proposal was approved. Thanks to Google's Summer of Code Program. I will make every effort to make this project a success.

NetBSD is "a free, secure, and highly portable Unix-like Open Source operating system available for many platforms, from 64-bit Opteron machines and desktop systems to handheld and embedded devices. Its clean design and advanced features make it excellent in both production and research environments, and it is user-supported with complete source." For more information, please visit NetBSD official web site.

Status

Projected milestones:

I will put any related progress reports and discussion on my blog. Please leave your comments/suggestions there. I appreciate it.

Project Deliverables

Documentation

Technical Details

Current NetBSD native curses library

The lack of support for wide characters in the NetBSD implementation of curses library limits the support for internationalized character sets, and thus limits the uses of NetBSD in countries using wide character sets. In particular, the problem lies in the NetBSD curses internal character storage and the related routines assume a 8-bit character in each position. To add wide character support, we need to add a new storage and a new set of wide character routines as specified in the X/Open Reference.

Proposed changes

Modifying storage structure

In order to add support for wide characters in the NetBSD curses library, the curses internal storage of characters and attributes (as defined in $SRC/lib/libcurses/curses_private.h) needs to be modified. In particular, one possible data structure (__ldata) describing each character position could contain the following:

    struct nschar_t {
        wchar_t           ch;   /* Non-spacing character */
        struct nschart_t *next; /* Next non-spacing character */
    };
    struct __ldata {
        wchar_t   ch;         /* Character */
        attr_t    attr;       /* Attributes */
        wchar_t   bch;        /* Background character */
        attr_t    battr;      /* Background attributes */
        nschar_t *nsp;        /* Foreground non-spacing character pointer */
        nschar_t *bnsp;       /* Background non-spacing character pointer */
    };
In this storage structure, both the character value and attribute are 32 bits. Such a data structure is generic enough to handle all wide character as well as non-spacing characters. Besides, it should align nicely on 32-bit and 64-bit machines. We don't have a character width field to save some memory because the width only needs at most 3 bits. Instead, we use part of the attribute to specify the width. In particular, we currently use values in 0x03ffff00 as the standard attributes. We could use the top bits (0xfc000000) to store the width. The narrow character routines could mask off the wide attributes part (input/output) and for the width to be one (input). On each line, there is one storage cell for each column. For a m-column wide character, only the first storage cell hold the width of the character, and the rest m-1 storage cells hold the position information in their width fields. For example, if a 4-column character is added to the screen, 4 storage cells will be changed and their character width would get the contents (4, -1, -2, -3). Then, if we later come to add a character at the 2nd, 3rd or 4th position, we know we are in the middle of a multi-column character and can easily clear the other cells. In addition to non-spacing characters in foreground, the X/Open reference also indicates that the background characters can also include non-spacing characters, so we must have two linked lists for non-spacing characters in each cell.

There are concerns of more than quadruply use of memory using this new storage structure. First of all, the memory increase is not so large, because the current character storage already uses wchat_t and an integer type attr_t. The only possible additional memory comes from the linked list of non-spacing characters, which do occur infrequently, and can be limited by an implementation to as low as five. So, the memory increase is at most 2.75 times the current space in the worst case, assuming every character has five non-spacing character associated and the uses of 32-bit pointers. In the most common cases, the memory uses only increase by 25%. On the other hand, such increase is an inevitable price to pay if we want to enable wide character support in curses library. We can provide storage structures for both narrow and wide characters, and make it a compile time option ("HAVE_WCHAR") such that the curses program developers can decide if they want to use the wide character support depending on the memory constraints.

In addition, we need to define the complex character data structure cchar_t, required by the X/Open reference to be used in functions such as in_wchstr(). It includes a string of up to 8 wide characters and its length, an attribute, and a color-pair.

#define CURSES_CCHAR_MAX 8
struct cchar_t {
    attr_t    attributes;             /* character attributes */
    unsigned  elements;               /* number of wide characters in vals[] */
    wchar_t   vals[CURSES_CCHAR_MAX]; /* wide characters including non-spacing */
}
Note, we don't define the color-pair, because it is the __COLOR part of the attribute, and can be extracted with COLOR_PAIR() macro.

In order to handle wide character input from the terminal in get_wch(), we need to add a circular buffer in the screen structure to keep the array of input characters so that correct wide as well as narrow characters can be return properly by get_wch().

struct __screen {
    ...
#define MAX_CBUF_SIZE MB_LEN_MAX
    int       cbuf_head;      /* header pointer to the circular input character buffer */
    int       cbuf_tail;      /* tail pointer to the circular input character buffer */
    int       cbuf_cur;       /* pointer to the current character in the buffer */
    mbstate_t sp;             /* wide character input processing state */
    int       cbuf[ MAX_CBUF_SIZE ]; /* input character buffer */
}

Adding wide character support routines

With the modified internal character storage data structure, the new wide character routines as well as the routines that assume the narrow characters and use the old storage data storage structure must be written or rewritten to make add, insert, input, delete, refresh, and cursor movement operations work properly. The major change is to (re)write the code to use the character width instead of assume one character per position. All the wide character routines return an error message if "HAVE_WCHAR" is not defined; similarly, all narrow character routines return an error when "HAVE_WCHAR" is defined. One good news is that all the underlying support routines already exist in the current NetBSD, which are defined in $SRC/include/wchar.h, although some of them do not yet have a man page.

  1. Add (Overwrite)
  2. Insert
  3. Input (Read back from window)
  4. Delete
  5. Complex character processing
  6. Refresh
  7. Window related

Performance goal

I would try to make the implementation as fast as possible, so that it will be usable on machines like vax and m68k. I will also try to make it use the smallest possible memory, for these types of machine. For example, I will try to make line comparison as efficient as possible. We will increase the hash size in __line and refresh() to include non-spacing characters. As an additional goal, I would try to make the wide character support as a compile time option, so that it could be omitted on boot media for small memory systems.

Testing

We will test our new library with a simple file viewer. The test script is borrowed from the ncurses test suite (view.c). Some modifications are made to make it specifically for test the new wide character functions. We will see if it work properly with wide character files.

Manual pages changed (if we can not reuse SUS)

Source code

Added new files Modified files

You are welcome to check out the current source code. Just don't forget to send me your bug reports (if any) and comments, either by email or in my blog. To checkout the current source code,

  1. cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/netbsd-soc login (Password: just press ENTER)
  2. cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/netbsd-soc co -P wcurses
  3. See the README.NetBSD file for building and usage instructions
The sources code can also be viewed using a web interface.


SourceForge.net Logo
Ruibiao Qiu <ruibiao@arl.wustl.edu>
$Id: index.html,v 1.29 2005/09/21 14:51:00 ruibiao Exp $