NetBSD-SoC: Apropos replacement based on mandoc and SQLite's FTS: Final Status Report
The main objective of the project was to develop a replacement tool
for apropos(1) which would provide a better search experience. We often encounter
situations where we are faced with a problem whose solution is easily answered
somewhere in some man page but due to the lack of a search tool, we either turn
towards Google or seek the advice of an expert. The aim of this project was to
try to develop such a search tool, which would point the user towards the solution.
Deliverables Proposed & Delivered
- A utility for parsing and indexing the man pages. (makemandb.c)
- A utility for searching the index thus created. (apropos.c_
- A ranking algorithm to find more relevant results.
- A mechanism to update the index when new man pages are installed or old
ones are removed.
- Using the database to manage the man page aliases.
- A library like interface to built applications on top of it.
- Documentation in the form of man pages.
Deliverables Proposed & Not Delivered
- I proposed to provide line number or references to specific sections of
the man pages in the search results but at the time of implementation it
did not seem trivial.
- A CGI based interface: I did not have enough time left at the end to try
this out. Although the ground work for this work has been done in the form
of a library like interface and a function run_query_html() which provides
the search results in the form of HTML fragment. So it should be trivial
to write a CGI application to perform the searches from a web browser.
Details About The Deliverables Produced
There are two command line utilities 'makemandb' and 'apropos'. You would
first need to build the Full Text Search (FTS) Index using makemandb(1) and then
you can use apropos(1) (the one provided by this project) to perform searches.
- Simply running makemandb will build the FTS index and tell you
the number of pages indexed. Some of the pages might not get indexed on
the way which will be indicated by error messages on the screen but
nothing to worry about that.
NOTE: The default behavior of makemandb is incremental updation. That is to
say it will try to add only those pages to the index which it did not
have previously and also it will remove those pages from the index which
are no more on the file system. Of course if there is no existing index
it will build it from scratch.
makemandb supports following options:
[-f]: The option 'f' will tell makemandb(1) to prune the existing index
(if there exists one) and rebuild the database from scratch.
[-l]: The option 'l' will tell makemand(1) to limit the indexing to only
the NAME section of the man pages. This option can be used to mimic the
behavior of the "classical apropos" although with improved search
capabilities. This option might be useful if you want to save few MB of
[-o]: The option 'o' is for optimizing the index. makemand(1) will try
to optimize the FTS index for faster search performance and also it will
optimize the storage of the data to optimize disk space usage.
makemandb also builds and maintains an aliases table for managing the man
page aliases which are scattered through the file system in the form of
symlinks or hardlinks. I have provided a patch to man.c so that man(1)
looks up this table to identify the target page which it needs to render.
Thus, it should be possible to get rid of these symlinks and hardlinks.
- Once you have built the database you can fire apropos(1) and
pass a query to do a search. For example:
$apropos "add a new user"
apropos supports following options:
[-1234569]: You can pass section numbers as options to apropos which
will make apropos to search only within the specified set of sections.
[-p]: By default apropos(1) will display the top 10 ranked results on
stdout. So if you would like to see more results then use 'p'. It will
allow apropos(1) to display all the results and also it will pipe the
results to a pager (more(1)).
A Library Like Interface
Besides the two command line tools, I have also developed a very small
library to allow and build a search application on top of the FTS index built
by makemandb. It has following public functions:
For more detailed documentation you can read up the man pages of the individual
init_db(): To initialize a connection to the database. It takes care of
registering some custom functions with the connection, and also it will
recreate the database schema in case the database file does not exist and
you provided the right flags.
run_query(): To run a query as entered by the user and process the rows
obtained in a callback function (apropos.c uses it).
run_query_html(): Similar to run_query() but it formats the results
obtained in the form of an HTML fragment. This can be used to build a CGI
application to do searches from a browser.
run_query_pager(): Similar to run_query_html but it formats the results
so that the matching text appears highlighted when piped to a pager.
apropos.c uses it when the -p option is specified.
close_db(): To close the database connection and release any resources.
Requirements For Building & Running
Following are the requirements for building and running it on NetBSD:
- -CURRENT version of NetBSD (or at least -CURRENT man pages and -CURRENT
version of man(1) ).
- libmandoc from mdocml.
Some Personal Remarks
- makemandb is able to index 7683 manual pages installed on my machine under 40 seconds. (I am running NetBSD -CURRENT
on an i686 machine with 3GB RAM.)
- The size of database generated is obviously dependent on the number of pages being indexed.
With 7683 pages and the optimization option (
-o) of makemandb enabled, the size of the database was 30 MB.
- Without optimization option enabled, the size is around 45 MB mark.
- If you use the
-l option of makemandb, then only the NAME section is indexed, which
results in a database of size around 3 MB (7683 pages)
- The quality of the search results obtained depends highly on the keywords used for the search.
Manual pages contain technical documentation and thus use a specialized vocabulary. So it would be wrong to
expect quality of Google/Yahoo/Bing etc. Although it might be worth investigating in coming days about possibility of adding
support for synonyms list.
I owe a big chunk of the success to my mentor Joerg Sonnenberger who was always
there to answer my questions, offer advice and review the code. I have learnt
a great deal from him and I am sure I have improved as a programmer. The best
thing about working with him was that he never really disclosed the solution,
instead he gently guided towards the direction of the solution, so I never
lost a learning opportunity :-)
David Young also offered valuable guidance during the project. He provided some
clever insights and tips to improve the search and ranking of the results.
I decided to decompose the database into more columns based on different
sections in a man page based on his idea only.
Thanks to Kristaps Dzonsons as well who is responsible for the mdocml project.
He also reviewed the code related to parsing of the pages and pointed out bugs
in the code. I implemented makemandb based on his utility "mandocdb", so that
was also a huge help.
Special thanks goes to Thomas Klausner for reviewing the man pages I wrote
and also proving patches for the errors/mistakes I had made in them.
I must also thank Julio Merino, Jan Schaumann, Jukka Ruohonen, S.P.Zeidler
for the interest they showed in the project and offered help throughout :-)
And thanks to lots of other people in the community as well whose names I
forgot to mention. It was encouraging to see responses to each status report
I made and kept me excited.
I thoroughly enjoyed my experience while working on this project. I
would definitely like to continue working in the NetBSD community, in fact I
was discussing with Joerg about some of the projects I could work on. I have
interest in systems programming but not enough knowledge, but I don't mind
| Abhinav Upadhyay <er.abhinav.upadhyay at gmail dot com> |
| $Id: final-report.html,v 1.3 2011/08/25 20:41:41 abhinavupadhyay Exp $ |