Intellexer Preformator

Try API for Free
Ask for Customization

Intellexer Preformator extracts plain text and information about the text layout from documents of different formats (doc, pdf, rtf, html, etc.). Also Preformator automatically determines the structure (patent, news or scientific article, review, etc.) of document and its theme (economics, law, sports, etc.).

Preformator combines different natural language processing algorithms for document language detection, structure and theme recognition. Due to the sophisticated linguistic rules of Intellexer Preformator, all the mentioned elements are easily recognized and textual information is processed.

Language detection algorithm is based on vector space model. In this model, documents are represented as vectors in multidimensional space, which components are N-grams with associated frequencies. The similarity between documents and model language category is determined as cosine angle between appropriate vectors. Preformator accurately determines not only the language of the whole document, but also the language of each text fragment.

After the language recognition stage Preformator detects structure type and document theme.

Structure detection algorithms combine the statistical and rule-based approaches. Statistical part is based on k-Nearest Neighbors algorithm (kNN). In kNN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. The rule-based approach uses a set of complex linguistic rules for detection of additional structure patterns.

The rules were formed on base of experts created custom built semantic thesauri for each structure type (patent, news or scientific article, review, etc.), whose elements are used in the kNN classification model as vector components.

For evaluation experiments, we’ve created a dataset which contains more than 30000 documents. Some of the Preformator structure detection results are shown in the table below:

Fiction Book News Article Patent Research Paper Review
Precision (%) 93 96 98 93 89
Recall (%) 80 97 95 84 97

Preformator uses a wide range of text features for document theme classification: elements from custom developed linguistic dictionaries (we have dictionaries for more than 50 predefined topics) and semantic relations between them detected by Intellexer Linguistic Processor. At different widely-used test collections (e.g., TREC Corpora, CoNLL test dataset, Reuters-21578) we achieved theme classification accuracy over 87%.

For developers and integrators

Use Case

Intellexer Preformator