Review

The web provides access to many searchable text databases (e.g. the 1988 Wall Street Journal). Given a term-query, the user is faced with the problem of which database to search. This question can be answered by constructing a term-frequency index (which is called in the paper a "language model") of each database. This is simply a list of terms that occur in the database, each of which is associated with a frequency (in how many documents it occurs). The paper lists a variety of reasons why the database may not provide its term-frequency index to users.

he paper presents a method of building a term-frequency representation of a document-database. The method of building the language model is based on sampling. Specifically, the method constructs a term-frequency index based on a subdatabase that is obtained by repeatedly presenting single term queries; to each query the system responds with top n documents. The method involves the selection of the single term queries, the number of documents retrieved for each term, the stopping criteria, etc. These parameters of the method indeed are important. In other words, the paper examines the right issues of the problem. The method is evaluated and shown to work in an experimental environment that uses three real and very different document databases. I got the sense that at least some of the questions adressed in the paper could have been answered or verified/confirmed by statistical analysis. However, the paper does not do so.

Overall I liked the problem addressed by the paper, and I found it interesting and well written.


a service of  Schloss Dagstuhl - Leibniz Center for Informatics