& |
This project is part of the Google Summer of Code program to promote the open source software development. I am very glad that my proposal was approved. Thanks to Google's Summer of Code Program. I will make every effort to make this project a success.
NetBSD is "a free, secure, and highly portable Unix-like Open Source operating system available for many platforms, from 64-bit Opteron machines and desktop systems to handheld and embedded devices. Its clean design and advanced features make it excellent in both production and research environments, and it is user-supported with complete source." For more information, please visit NetBSD official web site.
Projected milestones:
Status: Finished documentation.
To do: More tests with other wide character locales.
Final report converted as a longer article for
Daemon News.
Status: Finished documentaion.
To do: More tests with other wide character locales.
Final report converted as an article for
Daemon News.
Status: Finished documentation.
To do: More tests with other wide character locales.
Final report converted as an article for
Daemon News.
The curses wide character support comes as patches to the NetBSD-current distribution. The patches are generated from a baseline version of NetBSD-current source checked out on June 28, 2005. To integrate it, just run patch command to merge the changes into your current source.
If we can get permission from The Open Group to use the Single Unix Specification (SUS) in NetBSD, it is great because we don't have to write a whole new set of man pages from scratch.
If we can't get the right permission, the changes to the curses manual pages should be minor too. Simple changes to existing curses(3) man pages, indicating availability of wide character support. Some newly added functions specified in the X/Open Reference will be included in individual man pages, such as [mv][w]add_wch() in curses_addch(3). In addition, a new option may be added to the "SYNOPSIS" section to enable/disable wide character support.
I will start with unit tests for individual functions that are added or
modified. Then, we will use a simple file viewer borrowed from the
ncurses package test suite. Some modifications are made to suit our
needs. We will see if it works properly with wide character files.
Screenshots of the latest tests can be found here:
Simplified Chinese,
Traditional Chinese,
Japanese, and more are coming soon...
Memory usage:
Using the simple file viewer, I compared the memory footprint of different
curses libraries, including new NetBSD curses library with wide character
support (wcurses), traditional NetBSD curses library, ncurses library
with and without wide character support. I use ps(1) to make simple
relative comparisons of the same file viewer code linked against different
curses libraries. The tests use the same file viewer source code that
can call either narrow character functions as a narrow character viewer
(ccview and ncview) or wide character functions as a wide character
viewer (wcview and nwview). For the wide character tests, the two
viewers open the same Chinese locale text that spans multiple pages; for
the narrow character tests, the two viewer open the their own source code
(view.c). The tests are run on an i386 machine running NetBSD 2.0.
Here are the results:
wcview | nwview | |
---|---|---|
SIZE | 2152K | 1308K |
RES | 2984K | 2164K |
wcview | nwview | tcview | ncview | |
---|---|---|---|---|
SIZE | 1440K | 1128K | 504K | 328K |
RES | 2496K | 1964K | 1108K | 1128K |
The lack of support for wide characters in the NetBSD implementation of curses library limits the support for internationalized character sets, and thus limits the uses of NetBSD in countries using wide character sets. In particular, the problem lies in the NetBSD curses internal character storage and the related routines assume a 8-bit character in each position. To add wide character support, we need to add a new storage and a new set of wide character routines as specified in the X/Open Reference.
struct __ldata { wchar_t ch; /* Character */ attr_t attr; /* Attributes */ wchar_t bch; /* Background character */ attr_t battr; /* Background attributes */ };
This storage structure initially was designed for narrow characters. It does not support wide characters although it uses 32-bit wchar_t and attr_t data types because because it assumes one character per position, and there is no character width field. In addition, it does not support non-spacing characters as specified in the X/Open Reference. Therefore, changes must be made to the storage structure to enable wide characters as well as non-spacing characters.
In addition to these newly added functions, some existing curses routines are shared by both wide character and narrow characters, but assume a 8-bit character per location. In addition, some functions directly access the storage cell when they add, insert, or delete, instead of using addch(), delch() or insch(). These functions should be identified and modified so that they work well with wide character routines. These functions are listed in the following categories:
In order to add support for wide characters in the NetBSD curses library, the curses internal storage of characters and attributes (as defined in $SRC/lib/libcurses/curses_private.h) needs to be modified. In particular, one possible data structure (__ldata) describing each character position could contain the following:
struct nschar_t { wchar_t ch; /* Non-spacing character */ struct nschart_t *next; /* Next non-spacing character */ }; struct __ldata { wchar_t ch; /* Character */ attr_t attr; /* Attributes */ wchar_t bch; /* Background character */ attr_t battr; /* Background attributes */ nschar_t *nsp; /* Foreground non-spacing character pointer */ nschar_t *bnsp; /* Background non-spacing character pointer */ };In this storage structure, both the character value and attribute are 32 bits. Such a data structure is generic enough to handle all wide character as well as non-spacing characters. Besides, it should align nicely on 32-bit and 64-bit machines. We don't have a character width field to save some memory because the width only needs at most 3 bits. Instead, we use part of the attribute to specify the width. In particular, we currently use values in 0x03ffff00 as the standard attributes. We could use the top bits (0xfc000000) to store the width. The narrow character routines could mask off the wide attributes part (input/output) and for the width to be one (input). On each line, there is one storage cell for each column. For a m-column wide character, only the first storage cell hold the width of the character, and the rest m-1 storage cells hold the position information in their width fields. For example, if a 4-column character is added to the screen, 4 storage cells will be changed and their character width would get the contents (4, -1, -2, -3). Then, if we later come to add a character at the 2nd, 3rd or 4th position, we know we are in the middle of a multi-column character and can easily clear the other cells. In addition to non-spacing characters in foreground, the X/Open reference also indicates that the background characters can also include non-spacing characters, so we must have two linked lists for non-spacing characters in each cell.
There are concerns of more than quadruply use of memory using this new storage structure. First of all, the memory increase is not so large, because the current character storage already uses wchat_t and an integer type attr_t. The only possible additional memory comes from the linked list of non-spacing characters, which do occur infrequently, and can be limited by an implementation to as low as five. So, the memory increase is at most 2.75 times the current space in the worst case, assuming every character has five non-spacing character associated and the uses of 32-bit pointers. In the most common cases, the memory uses only increase by 25%. On the other hand, such increase is an inevitable price to pay if we want to enable wide character support in curses library. We can provide storage structures for both narrow and wide characters, and make it a compile time option ("HAVE_WCHAR") such that the curses program developers can decide if they want to use the wide character support depending on the memory constraints.
In addition, we need to define the complex character data structure cchar_t, required by the X/Open reference to be used in functions such as in_wchstr(). It includes a string of up to 8 wide characters and its length, an attribute, and a color-pair.
#define CURSES_CCHAR_MAX 8 struct cchar_t { attr_t attributes; /* character attributes */ unsigned elements; /* number of wide characters in vals[] */ wchar_t vals[CURSES_CCHAR_MAX]; /* wide characters including non-spacing */ }Note, we don't define the color-pair, because it is the __COLOR part of the attribute, and can be extracted with COLOR_PAIR() macro.
In order to handle wide character input from the terminal in get_wch(), we need to add a circular buffer in the screen structure to keep the array of input characters so that correct wide as well as narrow characters can be return properly by get_wch().
struct __screen { ... #define MAX_CBUF_SIZE MB_LEN_MAX int cbuf_head; /* header pointer to the circular input character buffer */ int cbuf_tail; /* tail pointer to the circular input character buffer */ int cbuf_cur; /* pointer to the current character in the buffer */ mbstate_t sp; /* wide character input processing state */ int cbuf[ MAX_CBUF_SIZE ]; /* input character buffer */ }
With the modified internal character storage data structure, the new wide character routines as well as the routines that assume the narrow characters and use the old storage data storage structure must be written or rewritten to make add, insert, input, delete, refresh, and cursor movement operations work properly. The major change is to (re)write the code to use the character width instead of assume one character per position. All the wide character routines return an error message if "HAVE_WCHAR" is not defined; similarly, all narrow character routines return an error when "HAVE_WCHAR" is defined. One good news is that all the underlying support routines already exist in the current NetBSD, which are defined in $SRC/include/wchar.h, although some of them do not yet have a man page.
if wcwidth( ch ) == 0 locate the current storage struct *lp add ch to the non-spacing characters list of lp return locate the current storage struct *lp clear the columns before current cursor location if wcwidth( ch ) > remaining space on the line clear to the end of line lp = struct of the first caharacter of next line add ch in wcwidth( ch ) columns starting at lp clear the remaining columns of the 2nd character overwritten advance the cursor accordingly
compute number of spacing character in str (len) if n != -1 && len > n truncate it to keep just the first n characters len = n locate the current storage struct *lp if str has non-spacing characters only (len == 0 ) add the non-spacing characters to the current spacing character under cursor return OK skip leading non-spacing characters clear the columns of the partial character before the cursor while there are characters in str c = current character if ( wcwidth( c ) == 0 ) add c to the non-spacing character list else if wcwidth( c ) > remaining space on the line clear to the end of line return OK update *lp and the next wcwidth( c ) - 1 columns move lp to the next position move to the next character in str
compute number of spacing character in str (len) if ( n != -1 && len > n ) { truncate it to keep just the first n characters len = n } while there are characters in str { c = current character create a single wide char string wc with c setcchar( &cc, wc ) add_wch( cc ) }
add_wch( ch ) refresh()
if ch is a non-spacing character return add_wch( ch ) locate the current storage struct (s) right shift all columns from (s + rem) to end-of-line by (cw - rem) columns clear the partial character column at the end-of-line update *s and the following (cw - 1) columns
if string starts with a non-spacing character return ERR compute total width (w) and toal number of spacing acharacters (len) of str if ( n > 0 && len > n ) truncate the string to n wide characters len = n locate the current storage struct (s) rem = partial character columns to the end of current character right shift all columns from (s + rem) to end-of-line by (w - rem) columns clear the partial character column at the end-of-line update *s and the following (w - 1) columns with str
for ( ;; ) { switch ( state ) { case NORM: read a character into cbuf[ cbuf_tail ] cbuf_cur = cbuf_tail = ( cbuf_tail + 1 ) % MAX_CBUF_SIZE state = ASSEMBLING break case BACKOUT: get the character from cbuf[ cbuf_cur ] cbuf_cur = ( cbuf_cur + 1 ) % MAX_CBUF_SIZE if no more character in cbuf state = ASSEMBLING break case ASSEMBLING: read a character c if ( EOF ) if cbuf is empty continue else get the character from cbuf[ cbuf_cur ] state = TIMEOUT else put c in cbuf[ cbuf_cur ] cbuf_tail = cbuf_cur = ( cbuf_cur + 1 ) % MAX_CBUF_SIZE break case WC_ASSEMBLING: read a character if ( EOF ) if cbuf is empty continue else return the first know character if cbuf is empty state = NORM else state = BACKOUT else check for possible wide character sequence with mbrtowc() if it is a possible sequence continue else return the first known character/key if cbuf is empty state = NORM else state = BACKOUT default: return ERR; } if state == TIMEOUT or there is no matching mblen = cbuf_tail < cbuf_cur ? MAX_CBUF_SIZE - cbuf_cur : cbuf_tail - cbuf_cur ret = mbrtowc( &wc, cbuf[ cbuf_cur ], mb_len, &sp ) switch ( ret ) { case >= 0 remove the wide character sequence from cbuf break case -1 return the first known character break case -2 cbuf_cur = ( cbuf_cur + mb_len ) % MAX_CBUF_SIZE state = WC_ASSEMBLING if cbuf is empty state = NORM else state = BACKOUT } else if key_entry[ c ] is not a leaf move to the next key_entry else return the function key if cbuf is empty state = NORM else state = BACKOUT wc = key_entry value return KEY_CODE_YES } if echoing is enabled { if wc is [LEFT] or [BS] or [DEL] move cursor back delch() else add_wch( wc ) } if the window has been moved or modified refresh() return wc
n = 0 while ( ret = get_wch( wc ) != WEOF ) { if ret == KEY_CODE_YES restore terminal settings continue if wc is a end-of-line or a newline character wstr[ n ] = NULL return wc if wc is an erase or kill character n = n > 0 ? n-- : 0 else wstr[ n++ ] = wc }
locate the current storage struct lp return lp->ch
locate the storage struct of the current location (s) locate the storage struct of the window edge (e) check the length of string from s to e if n > 0 and the length is longer than n e = s + n fp = s, cp = chstr while ( fp != e ) { *cp = fp->ch cp++, fp++ } *cp = NULL
locate the storage struct of the current location (s) locate the storage struct of the window edge (e) check the length of string from s to e if n > 0 and the length is longer than n e = s + n fp = s, cp = chstr while ( fp != e ) { *cp = fp->ch *cp |= attributes | color-pair cp++, fp++ } *cp = NULL
ungetwc( wc, _cursesi_screen->infd )
locate the storage struct of the current location (e) t2 = e, t1 = e + 1 while ( t1 is not the last character on the line ) { copy *t1 to *t2 t1--, t2-- } fill the rest of line with a blank character string
get character value ch get attributes a get color-pair c from attribute assemble a complex character from (ch, a, c)
get character value ch get attributes a add color-pair c to attribute assemble a wide character with (ch, a)
I would try to make the implementation as fast as possible, so that it will be usable on machines like vax and m68k. I will also try to make it use the smallest possible memory, for these types of machine. For example, I will try to make line comparison as efficient as possible. We will increase the hash size in __line and refresh() to include non-spacing characters. As an additional goal, I would try to make the wide character support as a compile time option, so that it could be omitted on boot media for small memory systems.
We will test our new library with a simple file viewer. The test script is borrowed from the ncurses test suite (view.c). Some modifications are made to make it specifically for test the new wide character functions. We will see if it work properly with wide character files.
You are welcome to check out the current source code. Just don't forget to send me your bug reports (if any) and comments, either by email or in my blog. To checkout the current source code,
|