Curses is a terminal control library for Unix-like systems. It enables a programmer to build text-based applications using text user interface (TUI)[1]. The current NetBSD native curses library does not have wide character support, limiting the uses of NetBSD in countries with wide character locales. To prompt the use of NetBSD in more countries, we propose to add wide character support in the current native NetBSD curses library to make it comply with the X/Open Curses Reference[2] for better internationalization and localization.
The X/Open Curses Reference defines all the requirements and interfaces for supporting wide characters, which includes multi-column characters and non-spacing characters. We will start with the curses library source in the NetBSD current source, and add necessary wide character specific interfaces defined in the X/Open Curses Reference. In addition, narrow character specific functions should work with wide character specific functions in the same application too.
The problem of the the current NetBSD curses for wide character support lies in its internal character storage data structure and the related functions. They all assume a 8-bit character in each display cell on the screen. To add wide character support, we must to adapt to a new character storage data structure and add a new set of wide character specific routines as specified in the X/Open Reference.
The current internal character storage data structure is like this:
struct __ldata {
wchar_t ch; /* Character */
attr_t attr; /* Attributes */
wchar_t bch; /* Background character */
attr_t battr; /* Background attributes */
}
This storage structure initially was designed for narrow characters, and does
not support wide characters although it uses 32-bit wchar_t and attr_t data
types because because it assumes one character per position, and uses only the
least significant 8 bits with a mask. The higher bits are used for some
attributes fields. There is no character width field to support multi-column
characters. In addition, it does not support non-spacing characters either.
The current curses routines assume a character is 8-bit and takes one position on a screen. New wide character specific interfaces defined in the X/Open Reference must be added.
We modified the internal character storage data structure as follows:
struct nschar_t {
wchar_t ch; /* Non-spacing character */
struct nschart_t *next; /* Next non-spacing character */
}
struct __ldata {
wchar_t ch; /* Character */
attr_t attr; /* Attributes */
wchar_t bch; /* Background character */
attr_t battr; /* Background attributes */
nschar_t *nsp; /* Foreground non-spacing character pointer */
nschar_t *bnsp; /* Background non-spacing character pointer */
}
In the new data structure, all fields are 32-bit, and should align nicely on
32-bit and 64-bit machines. We don't have a character width field to save
memory because the width only needs at most 3 bits. Instead, we use part of
the attribute field to specify the width. In particular, we use top bits
(0xfc000000) above the attributes mask (0x03ffffff) in the attributes field
to store the width. The narrow character routines just mask off the wide
attributes part (for input/output) and assume width of one (for input). On
each line, there is still one storage cell for each column. For a m-column
wide character, only the first storage cell hold the width of the character,
and the rest m-1 storage cells hold the position information in their width
fields. For example, if a 4-column character is added to the screen, 4
storage cells will be changed and their character width would get the
contents (4, -1, -2, -3). Then, if we later come to add a character at the
2nd, 3rd or 4th position, we know we are in the middle of a multi-column
character and can easily clear the other cells. Besides spacing characters,
two new linked list are added for non-spacing characters for foreground and
background spacing characters in each cell.
In addition to storage structure, we added a new data structure for complex characters (cchar), which includes the character and its attributes along with a list of non-spacing characters associated with it.
#define CURSES_CCHAR_MAX 8
struct cchar_t {
attr_t attributes; /* character attributes */
unsigned elements; /* number of characters in vals */
wchar_t vals[CURSES_CCHAR_MAX]; /* characters including non-spacing */
}
To handle wide character input as well as function key sequences in the keypad mode, we also added a circular input character buffer for parsing.
struct __screen {
...
#define MAX_CBUF_SIZE MB_LEN_MAX
int cbuf_head; /* header pointer */
int cbuf_tail; /* tail pointer */
int cbuf_cur; /* pointer to the current character */
mbstate_t sp; /* wide character input processing state */
int cbuf[ MAX_CBUF_SIZE ]; /* input character buffer */
}
The new wide character specific routines are in four categories: add, insert, input, and miscellaneous. When adding and inserting a wide character, we must carefully count the number of columns to fill and set the attributes (including character width) accordingly. To get a wide character from input file descriptor associated with a screen, we must be able to distinguish a function key sequence and a wide character sequence. We reuse the keymap routines for narrow character input routine getch(), and use stateful wide character conversion routines defined in wchar.h.
Some existing narrow specific routines were modified so that they can work with wide characters on a screen. There are mostly two changes: refreshing and changing.
First, the new storage data structure makes screen refreshing code more complicated. The curses library tries to make minimum changes to the screen for better performance. It compares the contents of the current screen (curscr) with those of the visible screen (__virtscr) line by line, using a hash function to quickly determine if a line need to be updated. For wide character support, the hash function must include the non-spacing characters as well to capture the changes in rendition. So, the __hash_more() is called for non-spacing characters associated with each spacing character. Similarly, line comparison and copy become more complicated, because all non-spacing characters must be checked.
The second issue with narrow character specific routines is the changing of characters. When a narrow character is added or deleted at a location, we must check if a partial wide character is resulted and wipe it out if necessary. In addition, moving characters must be carefully done because of a character may occupy multiple columns.
We tested the new curses library with three wide character locales using a simple file viewer: Simplified Chinese, Traditional Chinese, and Japanese. The screenshots of these tests can be found on our project page.
To analyze the performance of the new library, we also compared the memory usage of the file viewer linked against different libraries with both narrow and wide characters the results can be found on our project page. In summary, our tests results show that we generally need double memory to support wide characters, and the native NetBSD curses library consumes more memory up front than the GNU ncurses library.
Using the simple file viewer, I compared the memory footprint of different curses libraries, including new NetBSD curses library with wide character support (wcurses), traditional NetBSD curses library, ncurses library with and without wide character support. I use ps(1) to make simple relative comparisons of the same file viewer code linked against different curses libraries. The tests use the same file viewer source code that can call either narrow character functions as a narrow character viewer (ccview and ncview) or wide character functions as a wide character viewer (wcview and nwview). For the wide character tests, the two viewers open the same Chinese locale text that spans multiple pages; for the narrow character tests, the two viewer open the their own source code (view.c). The tests are run on an i386 machine running NetBSD 2.0. Here are the results:
| wcview | nwview | |
|---|---|---|
| SIZE | 2152K | 1308K |
| RES | 2984K | 2164K |
| wcview | nwview | tcview | ncview | |
|---|---|---|---|---|
| SIZE | 1440K | 1128K | 504K | 328K |
| RES | 2496K | 1964K | 1108K | 1128K |
In the future, we want to test our library with more wide character locales. In addition, we are discussing the method to reduce the memory footprint in the native NetBSD curses library by removing background character information out of each storage data structure.
We would like to thank Google for its Summer of Code program to make this project possible.