Efficient wide character regular expressions

NetBSD Google Summer of Code

Description

Student:Matthias-Christian Ott
Mentor:Alistair Crooks

During this year’s Google Summer of Code I want to improve the performance of NetBSD’s regular expression library and add support to it for wide characters.

See also: Implementing efficient wide character regular expressions

Status

We have chosen TRE because of Ville Laurikari’s convincing master thesis and Russ Cox’s article on regular expressions. Ville agreed to release TRE under a 2-clause BSD licence and we already ported TRE to NetBSD. TRE and agrep will integrated into the NetBSD src repository soon.

At the moment only a performance benchmark which is based on realistic regular expressions and data to ensure that TRE is verifiably efficient is missing.

Timeline

20 April 2009 Project accepted and officially announced
26 April 2009 Published project page
6 May 2009 Added an overview of regular expression engines
23 May 2009 Initial import of the regular expression fuzzer
8 June 2009 Initial import of TRE
13 June 2009 Initial import of AT&T’s regression tests
18 June 2009 Added termcap and recursion support to agrep
30 June 2009 TRE builds as part of NetBSD’s source TRE
3 August 2009 Code posted for public review

See also: Google Summer of Code 2009 Timeline

Documentation

Overview of Regular Expression Engines

Name Licence Syntax Algorithm wchar_t support
Curie MIT POSIX (subset) non-backtracking NFA no (UTF-8 support)
dietlibc GNU GPL, version 2 POSIX (subset) non-backtracking NFA (?) no
glibc GNU LGPL, version 2.1 POSIX DFA yes
Hackerlab C Library GNU GPL, version 2 POSIX DFA (?) no (t_uchar support)
libregexp9 MIT or Lucent Public License regexp non-backtracking NFA no (UTF-8 support)
microregex BSD (3-clause) Perl non-backtracking NFA no
NetBSD libc BSD (2-clause) POSIX DFA no
Oniguruma BSD (2-clause) POSIX, Ruby ? no (support for various encodings)
PCRE BSD (3-clause) Perl backtracking NFA no
TRE GNU LGPL, version 2.1 POSIX with extensions backtracking, parallel and approximate parallel TNFA yes
T-Rex zlib/libpng T-Rex backtracking NFA yes (only UTF-8)
utf8regex GNU LGPL, version 2.1 POSIX DFA no (UTF-8 support)

See also:

Testing

I have developed a fuzzer for regular expressions, which randomly generates strings from POSIX-compliant regular expressions and tests whether the regular expression library accepts the string. Moreover, AT&T’s regession tests are used to test POSIX-conformance.

Further Reading

:wq


$Id: index.html,v 1.14 2009/08/19 18:21:03 hyperyl Exp $

Get NetBSD Summer of Code projects at SourceForge.net. Fast, secure and Free Open Source software downloads