Intellexer Preformator

Intellexer Preformator extracts plain text and information about the text layout from documents of different formats (doc, pdf, rtf, html, etc.). Also Preformator automatically determines the structure (patent, news or scientific article, review, etc.) of document and its theme (economics, law, sports, etc.).

Preformator combines different natural language processing algorithms for document language detection, structure and theme recognition. Due to the sophisticated linguistic rules of Intellexer Preformator, all the mentioned elements are easily recognized and textual information is processed.

Language detection algorithm is based on vector space model. In this model, documents are represented as vectors in multidimensional space, which components are N-grams with associated frequencies. The similarity between documents and model language category is determined as cosine angle between appropriate vectors. Preformator accurately determines not only the language of the whole document, but also the language of each text fragment.

After the language recognition stage Preformator detects structure type and document theme.

Structure detection algorithms combine the statistical and rule-based approaches. Statistical part is based on k-Nearest Neighbors algorithm (kNN). In kNN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. The rule-based approach uses a set of complex linguistic rules for detection of additional structure patterns.

The rules were formed on base of experts created custom built semantic thesauri for each structure type (patent, news or scientific article, review, etc.), whose elements are used in the kNN classification model as vector components.

For evaluation experiments, we’ve created a dataset which contains more than 30000 documents. Some of the Preformator structure detection results are shown in the table below:

	Fiction Book	News Article	Patent	Research Paper	Review
Precision (%)	93	96	98	93	89
Recall (%)	80	97	95	84	97

Preformator uses a wide range of text features for document theme classification: elements from custom developed linguistic dictionaries (we have dictionaries for more than 50 predefined topics) and semantic relations between them detected by Intellexer Linguistic Processor. At different widely-used test collections (e.g., TREC Corpora, CoNLL test dataset, Reuters-21578) we achieved theme classification accuracy over 87%.

For developers and integrators

Use Case

Intellexer Preformator

Intellexer Preformator can be easily integrated into custom Document/Knowledge management systems using programming languages C/C++ and C#. Our SDK contains all necessary include files and import libraries for binding user applications with Intellexer Preformator module.

Here is a C++ example of how to add Intellexer Preformator to your application:

							
#include < iostream >
#include < string >
#include < PrefCore.h >

using std::cout;
using std::cerr;
using std::endl;
using std::string;

using namespace NsSemSDK;

///    Print preformatted document
void PrintPrefDocument(const IPrefDocument* pPrefDocument)
{
	CPrefCoreDef oPrefCore;
	const char* pszDocStructure = NULL;
	const char* pszDocTopic = NULL;
	const IPrefEnumTopic* pPrefTopics = NULL;
	int nTextFragmentCount = 0;
	const IPrefTextFragment* pPrefTextFragment = NULL;

	// get document structure and topic
	pPrefDocument->GetDocStructure(&pszDocStructure);
	pPrefDocument->GetTopic(&pPrefTopics);
	pPrefTopics->Reset();

	cout << "Document Structure: " << pszDocStructure << endl;
	cout << "Document Topic: ";
	while (pPrefTopics->Next(&pszDocTopic))
	cout << pszDocTopic << "\t";
	cout << endl << endl;

	// get document text fragment count
	pPrefDocument->GetTextFragmentCount(&nTextFragmentCount);
	for (int i = 0; i < nTextFragmentCount; ++i)
	{
		EPrefFragmentSemLabel eSemLabel = EPrefFrSemLblFinish;
		const char* pszText = NULL;
		
		// get text fragment
		pPrefDocument->GetTextFragment(i, &pPrefTextFragment);
		
		// get semantic label
		pPrefTextFragment->GetSemLabel(&eSemLabel);
		
		// get text
		pPrefTextFragment->GetText(&pszText);

		// get semantic label string representation
		cout << oPrefCore.GetSemLabelA(eSemLabel) << ":" << endl;
		cout << pszText << endl << endl;
	}
}

int main(int argc, char* argv[])
{
	string sFileName("../Data/Test.pdf");    // path to source document
	string sDBPath("../../LDB");                // path to ldb
	string sLPluginsPath("../../LPlugins");            // path to plugins

	if (argc > 1)
		sFileName = argv[1]; // path to source document
	try
	{
		if (argc > 3)
		{
			sDBPath = argv[2];
			sLPluginsPath = argv[3];
		}

		// the license is not needed for Preformator
		cerr << "Initializing\t...";
		
		// create Preformator database interface
		CInterfacePtr<IPrefDB> pPreformatorDB(CreatePrefDB());

		// create Preformator interface
		CInterfacePtr<IPreformator> pPreformator(CreatePreformator());
		const IPrefDocument* pPrefDocument = NULL;
		IPrefPreferences* pPrefPreferences = NULL;

		// initialize Preformator database interface
		pPreformatorDB->Setup(sDBPath.c_str(), sLPluginsPath.c_str());

		// initialize Preformator interface
		pPreformator->Setup(pPreformatorDB.Get());
		cerr << "Done" << endl;

		// get Preformator preferences
		pPreformator->GetPreferences(&pPrefPreferences);

		// set Preformator to ignore links extraction
		pPrefPreferences->SetIgnoreLinks(true);

		// set Preformator to ignore alts extraction
		pPrefPreferences->SetIgnoreAlts(true);

		// process file
		pPreformator->ProcessFile(sFileName.c_str());

		// get preformatted document
		pPreformator->GetDocument(&pPrefDocument);
		PrintPrefDocument(pPrefDocument);
	}
	catch (const CSemBaseException& x)
	{
		// Handle exceptions.
		cerr << x.what();
	}
	return 0;
}

As a result, you get all Preformator output information: plain text from document, its structure and theme.

Intellexer Preformator

Overview