dblp.uni-trier.de www.dagstuhl.de www.uni-trier.de

Using IR Techniques for Text Classification in Document Analysis.

Rainer Hoch: Using IR Techniques for Text Classification in Document Analysis. SIGIR 1994: 31-40
@inproceedings{DBLP:conf/sigir/Hoch94,
  author    = {Rainer Hoch},
  editor    = {W. Bruce Croft and
               C. J. van Rijsbergen},
  title     = {Using IR Techniques for Text Classification in Document Analysis},
  booktitle = {Proceedings of the 17th Annual International ACM-SIGIR Conference
               on Research and Development in Information Retrieval. Dublin,
               Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum)},
  publisher = {ACM/Springer},
  year      = {1994},
  isbn      = {3-540-19889-X},
  pages     = {31-40},
  ee        = {db/conf/sigir/Hoch94.html},
  crossref  = {DBLP:conf/sigir/94},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

This paper presents the INFOCLAS system applying statistical methods of information retrieval for the classification of German business letters into corresponding message types such as order, offer, enclosure, etc. INFOCLAS is a first step towards the understanding of documents proceeding to a classification-driven extraction of information. The system is composed of two main modulex the central indexer (extraction and weighting of indexing terms) and the classifier (classification of business letters into given types). The system employs several knowledge sources including a letter database, word frequency statistics for German, lists of message type specific words, morphological knowledge as well as the underlying document structure. As output, the system evaluates a set of weighted hypotheses about the type of the actual letter. Classification of documents allow the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis.

Copyright © 1994 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.


ACM SIGMOD Anthology

CDROM Version: Load the CDROM "Volume 2 Issue 3, SIGIR, DASFAA'97, OODBS'86" and ... DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Printed Edition

W. Bruce Croft, C. J. van Rijsbergen (Eds.): Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum). ACM/Springer 1994, ISBN 3-540-19889-X
Contents CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML

Online Edition: ACM Digital Library

Citation page

Referenced by

  1. Markus Tresch, Neal Palmer, Allen Luniewski: Type Classification of Semi-Structured Documents. VLDB 1995: 263-274
  2. Markus Tresch, Allen Luniewski: An Extensible Classifier for Semi-Structured Documents. CIKM 1995: 226-233

Last update Fri May 25 08:37:46 2012 CET by the DBLP TeamThis material is Open Data Data released under the ODC-BY 1.0 license — See also our legal information page