Linguistic Processor

Intellexer Linguistic Processor is used to parse the input text and to extract multiple kinds of relations, for example, syntactic (noun phrases, verb phrases, adjectival and adverbial phrases, etc.) and semantic (subject-verb-object; Color, Direction, Degree, Effectiveness, etc.) ones. The output of Linguistic Processor is a semantic tree with certain semantic types of relations assigned to the links between the sentence elements.

Below is an example of the semantic tree for the source sentence: The company offers innovative computer-based testing and Internet-based testing solutions.

In order to provide the highest results, Intellexer Linguistic Processor combines three main text information processing stages: segmentation, tagging and parsing.

The segmentation process consists of two parts: word boundary disambiguation and sentence boundary disambiguation. Both units are based on regular expression rules. Our rule-based algorithms achieve word boundary disambiguation performance over 99% and sentence boundary disambiguation accuracy over 92% on different test corpora.

The tagger annotates each word in the sentence with a unique part-of-speech (POS) tag. The tagger uses hybrid architecture that combines statistical and rule-based approaches.

The statistical tagger is based on supervised second order Markov model with modifications, trained on a part-of-speech annotated corpus of patents, technical articles and web pages. The model contains statistics of unigram, bigram, and trigram co-occurrences of tags and words, as well as suffix statistics that are used to predict tags for unknown words. The training corpus was created by our experts and contains more than 20 million token-tag pairs.

The rule-based tagger corrector module is used to improve statistical tagger results. It contains more than 1000 linguistic rules designed specially to work with a tagged text.

To evaluate the effectiveness of tagger algorithms some experiments on Brown and Lancaster-Oslo/Bergen Corpus were carried out. At these corpora we achieved the tagger accuracy over 97%.

After the tagger an annotated sentence is passed to the parser, which builds a parsing tree for the sentence and extracts relations. The parser is designed using a complex model based on semantic rules, statistically collected data and lexical database WordNet. Parsing is performed bottom-up via finite-state cascades, for which a set of rules is written by linguists. Statistical and semantic modules are called by demand where the corresponding disambiguation process is required.

The semantic part of Linguistic Processor consists of more than 1500 custom developed rules. For evaluation experiments, we’ve created the dataset manually parsed by our experts. At this collection we achieved syntactic parser precision and recall over 94% and semantic relation extractor precision and recall over 79% (typical competitors' result: syntactic parser 85-89% and semantic relation extractor 70-75% in terms of precision and recall).

For developers and integrators

Use Case

Intellexer Linguistic Processor

Intellexer Linguistic Processor can be easily integrated into custom Document/Knowledge management systems using programming languages C/C++ and C#. Our SDK contains all necessary include files and import libraries for binding user applications with Intellexer Linguistic Processor module.

Here is a C++ example of how to add Intellexer Linguistic Processor to your application:


#include < iostream >
#include < fstream >
#include < LPXml.h >

using std::ofstream;
using std::cout;
using std::cerr;

using namespace NsSemSDK;

int main()
{
	try
	{
		// provide path to license file
		SetLPXMLLicensePath("../../ISDK_License.xml");
		
		// Load database that is required to create an instance of LPXml.
		
		// It may be shared among several instances.
		CInterfacePtr pDB(LoadLPXmlDB("../../LDB", "../../LPlugins"));		
		
		// Create instance of ILPXml that will process documents and generate XML output.
		CInterfacePtr pLPXml(CreateLPXml(*pDB)); 

		// Process text buffer.
		// Document path is NULL, but it may be set to file name, e.c. "abstract.txt".
		// In either case, if buffer is not NULL it will be used instead of path.
		// Size is omitted because buffer is zero-terminated.
		// Buffers containing binary data (e.c. Word document loaded into memory)		
		// may be processed as well, but in this case size must be specified explicitly.		
		CInterfacePtr pXMLOutput = pLPXml->Process(NULL,
			"< HTML >< HEAD >< /HEAD >< BODY >"
			"A mobile phone battery includes a cover being provided at a "
			"predetermined position with an opening, positive and negative "
			"contacts located behind the cover and accessible the opening "
			"of the cover for electrically connecting to a rectangular dry cell, "
			"and step-down means provided at an inner side of the cover and "
			"having a power output electrically connected to power input of "
			"the mobile phone battery. An emergency charging of the mobile phone "
			"battery is possible simply by attaching a rectangular dry cell to the "
			"positive and the negative contacts of the mobile phone battery without "
			"using other battery charger."
			"< /BODY >< /HTML >");
		ofstream ofs1("1.xml");
		ofs1 << pXMLOutput->GetXml();
		cout << "Resulting XML is saved to 1.xml.\n"; 

		// Process file.
		ofstream ofs2("2.xml");

		pXMLOutput = pLPXml->Process("../Data/LPXml/Test.htm");
		ofs2 << pXMLOutput->GetXml();
		cout << "Resulting XML is saved to 2.xml.\n";
	}
	catch (const CSemBaseException& x)
	{
		// Handle exceptions.
		cerr << x.what() << "\n";
	}
	return 0;
}

As a result, you get all Linguistic Processor output information after each stage in the form of xml files.

Intellexer Linguistic Processor

Overview