Text summarization can be defined as a process that takes a document (Intellexer Summarizer) or set of documents (Intellexer Multi-Document Summarizer) as input and outputs a shorter document (summary), which contains its most important content and main ideas.

Intellexer Summarizer’s unique feature is the possibility to create different kinds of summaries:

  • Theme-oriented: the output summary includes the sentences, which are mostly relevant to a given topic (e.g. politics, economics, sports and etc.);
  • Structure-oriented: the summary content depends on input document structure (e.g. scientific article, patent, news article);
  • Concept-oriented: the importance of sentences is determined with respect to a number of user defined concepts.

Intellexer Summarizer receives a source document and passes it to the Intellexer Preformator which extracts plain text (along with text formatting information – headers, links, etc.), detects document structure and language (using Intellexer Language Recognizer). Extracted text is received to the Intellexer Linguistic Processor which provides syntactic and semantic processing. After complete analysis extracted information is passed back to the Intellexer Summarizer for document summary generation. Taking into consideration the knowledge of facts and deep semantic relations between them, summarization rules assign a certain value per sentence of the original text. This value defines the importance of the sentence in respect to the idea of the text.

For evaluation experiments, we’ve created the dataset, which consists of more than 5000 documents with associated manually generated summaries. At this collection we’ve achieved summarization precision and recall from 81% to 92% (typical competitors’ results: 75-89% in terms of precision and recall).

