Intellexer SDK incorporates natural language processing tools for semantic analysis of unstructured text data.
Intellexer SDK is based on the principle of modularity and has open architecture. It provides C/C++ and .Net programming interfaces and can be easily integrated into Document/Knowledge management systems.
Intellexer SDK modules are easily modifiable, and the linguistic knowledgebase is scalable. All this gives an opportunity to create flexible, customized solutions that can handle information in different data domains.
Intellexer Linguistic Database uses custom indexing structure based on deterministic finite automaton. This allows processing text information more efficiently than the existing methods.
Intellexer SDK combines different natural language processing algorithms in order to obtain the highest results. Our hybrid approach to text information analysis is based on using not only linguistic and statistical information, but also a set of complex semantic rules. At different widely used test collections (e.g., TREC Corpora, CoNLL test dataset, Reuters-21578, Brown and Lancaster-Oslo/Bergen Corpus) we achieved text processing accuracy over 90% (typical competitors' result: 79-88% accuracy).
Intellexer SDK consists of the set of modules:
- Linguistic Processor is the core of Intellexer SDK that performs morphological, lexical, syntactical and semantic analysis of a text. There are three main stages that perform text processing by Linguistic Processor: segmentation, tagging and parsing. Each of them performs certain steps of document analysis. The segmenter provides word and sentence segmentation. The tagger annotates words with part-of-speech tags, including verbs, nouns, adjectives, adverbs, etc., and the parser provides syntactic analysis of a document along with semantic relations (more than 90) extraction, for example Color, Direction, Degree, Effectiveness, Materials, Location, Temperature and Structure.
- Sentiment Analyzer is a powerful and efficient solution that automatically extracts sentiments (positivity/negativity), opinion objects (e.g., product features with associated sentiment phrases) and emotions (liking, anger, disgust, etc.) from unstructured text data.
- Named Entity Recognizer identifies and classifies elements in text into predefined categories such as personal names, names of organizations, position/occupation, nationality, geographical location, date, age, duration and names of events. Additionally this module allows identifying the relations between named entities.
- Summarizer automatically generates a summary (short description) of a document with its main ideas. Intellexer Summarizer's unique feature is the possibility to create different kinds of summaries: theme-oriented (e.g., politics, economics, sports, etc.), structure-oriented (e.g., scientific article, patent, news article) and concept-oriented.
- Multi-Document Summarizer with Related Facts automatically generates a summary (short description) from multiple documents with their main ideas. Also it detects the most important related facts in the selected document concepts (this feature is called Related Facts).
- Comparator is a module (solution) that accurately compares documents of any format and determines the degree of similarity between them.
- Categorizer classifies documents into user-defined categories.
- Clusterizer hierarchically sorts documents or an array of terms of the given documents.
- Question-Answering System is a module that is used to provide the user with the ability to ask questions in natural language; the system then extracts answers to the question from the database of documents.
- Question Comparison Tool is a module that compares two questions on the syntactic-semantic level and gives their proximity.
- Natural Language Interface transforms Natural Language Queries into Boolean queries, expanding them with synonyms and possible ways of combining and paraphrases.
- Preformator extracts plain text and information about the text layout from documents of different formats (doc, pdf, rtf, html, etc.). Preformator can also automatically identify the structure (patent, news or scientific article, review, etc.) and topic (for example, chemistry, ecology, information technology and more than 50 predefined themes from various data domains) of input documents.
- Language Recognizer identifies the language and character encoding of incoming documents. It supports more than 30 languages, covering major European and Asian languages.
- Spellchecker automatically corrects spelling errors due to well-chosen statistic and linguistic rules, including: rules for context-dependent misspellings; rules for evaluating the probability of possible corrections; rules for evaluating spelling mistakes caused by different means of representing sounds by the letters of alphabet; dictionaries with correct spelling, etc.