Intellexer Linguistic Processor

Try API for Free
Ask for Customization

Intellexer Linguistic Processor is used to parse the input text and to extract multiple kinds of relations, for example, syntactic (noun phrases, verb phrases, adjectival and adverbial phrases, etc.) and semantic (subject-verb-object; Color, Direction, Degree, Effectiveness, etc.) ones. The output of Linguistic Processor is a semantic tree with certain semantic types of relations assigned to the links between the sentence elements.

Below is an example of the semantic tree for the source sentence: The company offers innovative computer-based testing and Internet-based testing solutions.

In order to provide the highest results, Intellexer Linguistic Processor combines three main text information processing stages: segmentation, tagging and parsing.

The segmentation process consists of two parts: word boundary disambiguation and sentence boundary disambiguation. Both units are based on regular expression rules. Our rule-based algorithms achieve word boundary disambiguation performance over 99% and sentence boundary disambiguation accuracy over 92% on different test corpora.

The tagger annotates each word in the sentence with a unique part-of-speech (POS) tag. The tagger uses hybrid architecture that combines statistical and rule-based approaches.

The statistical tagger is based on supervised second order Markov model with modifications, trained on a part-of-speech annotated corpus of patents, technical articles and web pages. The model contains statistics of unigram, bigram, and trigram co-occurrences of tags and words, as well as suffix statistics that are used to predict tags for unknown words. The training corpus was created by our experts and contains more than 20 million token-tag pairs.

The rule-based tagger corrector module is used to improve statistical tagger results. It contains more than 1000 linguistic rules designed specially to work with a tagged text.

To evaluate the effectiveness of tagger algorithms some experiments on Brown and Lancaster-Oslo/Bergen Corpus were carried out. At these corpora we achieved the tagger accuracy over 97%.

After the tagger an annotated sentence is passed to the parser, which builds a parsing tree for the sentence and extracts relations. The parser is designed using a complex model based on semantic rules, statistically collected data and lexical database WordNet. Parsing is performed bottom-up via finite-state cascades, for which a set of rules is written by linguists. Statistical and semantic modules are called by demand where the corresponding disambiguation process is required.

The semantic part of Linguistic Processor consists of more than 1500 custom developed rules. For evaluation experiments, we’ve created the dataset manually parsed by our experts. At this collection we achieved syntactic parser precision and recall over 94% and semantic relation extractor precision and recall over 79% (typical competitors' result: syntactic parser 85-89% and semantic relation extractor 70-75% in terms of precision and recall).

For developers and integrators

Use Case

Intellexer Linguistic Processor