NetBSD-SoC: Apropos replacement based on mandoc and SQLite's FTS
What is it?
Unix systems have had a culture and one of the main reasons behind the long standing success of Unices has been to follow this culture and philosophy over the years. Part of this culture and philosophy is to provide documentation for each component of the Operating System, whether it is a command line utility, a system call, a library function, a configuration file or anything that should be documented to make the life of the end user easier. This documentation has been shipped with the base system in the form of Manual pages (man pages in short), which can be easily accessed using the 'man' command.
A couple of utilities are also provided to search the documentation easily. apropos(1) can be used to search for man pages. How apropos(1) works is very simple. The name section of the man pages has been indexed in a file (typically named whatis.db) and apropos(1) performs search on this file for the keywords specified by the user.
While apropos(1) was designed keeping in mind the resources (both hardware and software) available during the early days, but things have changed drastically over the time. Now we have the resources available and in the Google era it behooves us to rethink the design and implementation of apropos(1). It is now possible to implement apropos(1) in a better manner so as to allow more extensive and flexible searches and that too over the complete content of the man pages rather than limiting it to the name section. More often than not we are not sure of the exact keywords to search for and apropos(1) doesn't give us the rignt results (or no results at all) in which case we turn to Google.
The idea behind this project is to mend this problem by reimplementing apropos(1) to enable full text search capabilities and in the process enhancing and modifying other man utilities as required. We have decided to use the FTS engine of Sqlite  for this purpose.
Project Repository: The project is currently hosted on github: https://github.com/abhinav-upadhyay/apropos_replacement
Weekly Report 1 : I know we have entered the 3rd week since the coding period started and I am late but in my defense, I was busy with exams during the first week, and started the work just from 1st of June. The first interesting bit was to build the list of directories where man pages are stored on the file system. Joerg suggested adding this as a new option to man(1). It will have two fold benefits for our project:
Joerg and David were very helpful and they reviewed different versions of the patch very patiently and answered my questions. I have sent the final version of the patch to Joerg, who will be committing it very soon. I have made a more detailed post on my blog about this.
- Most importantly we need this information for building the man page index in database.
- And this will allow the usrs to print the search path for debugging purposes on their systems.
Weekly Report 2: This was a more productive week. I worked on makemandb, which is responsible for traversing the set of directories returned by calling 'man -p' and parsing each of the man pages using libmandoc. It creates a new database in the present directory with the name 'apropos.db', and stores the parsed data in an FTS virtual table in it. Currently, it only parses the name of the man page, the one line description from the NAME section and the complete DESCRIPTION section and stores them in 3 columns in the table.
There are some issues that have come up during the parsing, but for the moment we can ignore them and focus on getting the initial prototype of the project ready.
For more information, you may read my blog post: Weekly Report 2
Weekly Report 3 Fixed many bugs, added a basic ranking algorithm. This report showed some sample runs of the new apropos.
Midterm Report Fixed some more bugs. Improved performance, improved the ranking algorithm and also discussed some upcoming features.
Project Update 5 Lots of changes and improvements and some regressions since last update report. Noticeable changes are:
- Indexing speed imrpovement: Indexing 7000+ pages under a minute!
- Parsing man(7) pages as well (previously we had support for only mdoc(7) only)
- Compressing the database using zlib(3)
- Stopword tokenizer: A custom tokenizer to prevent any stopwords from being indexed.
- New Feature: New option to apropos to search within specific sections.
So now you can perform search like this:
$apropos -18 "add new user"
and apropos will search only in section 1 and 8 :)
You may also be interested in reading the report I posted on the mailing list which is more detailed than this: http://mail-index.netbsd.org/tech-userlevel/2011/07/31/msg005310.html
Final Status Report A final status report of the project: final-report.html
I have also uploaded man pages of the project in HTML format:
A blog post as well: http://abhinav-upadhyay.blogspot.com/2011/08/final-report-netbsd-gsoc-2011-apropos.html
- April 25, 2011: Community Bonding Period -- Students get to know mentors, read documentation, get up to speed to begin working on their projects.
- May 23, 2011: Students begin coding for their GSoC projects; Google begins issuing initial student payments
- July 11, 2011: Mentors and students can begin submitting mid-term evaluations.
- July 15, 2011: Mid-term evaluation deadline; Google begins issuing mid-term student payments provided passing student survey is on file.
- August 15, 2011: Suggested 'pencils down' date. Take a week to scrub code, write tests, improve documentation, etc.
- August 22, 2011: Firm 'pencils down' date. Mentors, students
and organization administrators can begin submitting final evaluations to
- August 26, 2011: Final evaluation deadline; Google begins issuing student and mentoring organization payments provided forms and evaluations are on file.
Mandatory (must-have) components:
- A basic implementation of full text search by simply indexing the complete output of mandoc(1) as a single column in the FTS virtual table of sqlite3. It will get things started. Later on more improvisations can be done.
- Snippets of the matched results
- Better language support using the Porter Stemming Tokenizer .
- Integration with the existing man(1).
- A utility to update the index whenever new man pages are installed.
Should have components:
- A ranking algorithm to improve the quality of results
- The algorithm might require more information about the man pages which can be obtained by using the mandoc(3) parser to parse the different sections of the man pages and store them in separate columns as required by the algorithm.
For example: We can parse the man pages and store the name and description sections in two separate columns. More weight can be given to matches found in the name section. The "See Also" section can also be put to good use.
Optional (would-be-nice) components:
- Use the database to directly manage man page aliases by extracting the .Nm macros from the man pages.
- Support for synonyms list to improve the search.
- A web based interface through CGI
: Sqlite's FTS Engine Documentation
:My Github Profile
My Complete Proposal
| Abhinav Upadhyay <er.abhinav.upadhyay at gmail dot com> |
| $Id: index.html,v 1.8 2011/08/24 17:18:01 abhinavupadhyay Exp $ |