In the global Internet environment processing of information in multiple languages has a great importance. Intellexer Language Recognizer identifies the language and character encoding of incoming documents. It supports more than 30 languages, covering major European and Asian languages.
Intellexer Language Recognizer can be successfully used:
- as a pre-filtering step to improve the quality of input text data (because of most natural processing algorithms deal with monolingual texts and inclusion of other languages can decrease the performance of document management systems);
- in mining bilingual texts for machine translation from online resources;
- for retrieval, grouping and understanding relevant information (user’s texts, emails and etc.) in multilingual environment.
Intellexer Language Recognizer combines statistic and linguistic technologies in order to obtain the highest recognition results. Our language detection algorithm is based on strong mathematical model of vector spacing algorithm. It creates multidimensional space of vectors scanning document contests and uses N-grams notion for calculating their frequencies. Afterwards the algorithm analyzes the positions of the necessary vectors in space to determine their similarity. Besides, for correction of the statistical algorithm results, we use special linguistic rules developed by our experts.
Intellexer Language Recognizer accurately determines not only the language of the whole document, but also the language of each text fragment.
For evaluation experiments, we’ve created a dataset, which contains more than 1300000 documents in different languages (English, German, French, Spanish, Japanese, Chinese, etc.). In this collection we've achieved language identification accuracy from 95% to 99% (typical competitors’ results: 86 - 96%).The average processing speed was over 8000 KB/s.
For developers and integrators
Use Case
Intellexer Language Recognizer
Intellexer Language Recognizer can be easily integrated into custom natural language processing systems using programming languages C/C++ and C#. Our SDK contains all necessary include files and import libraries for binding user applications with Intellexer Language Recognizer module.
Here is a C++ example of how to add Intellexer Language Recognizer to your application:
#include < iostream >
#include "LRCore.h"
using namespace std;
using namespace NsSemSDK;
int main(int argc, char* argv[])
{
try
{
// path to ldb
char szDBPath[] = "../../LDB";
// sample text
char pszText[] = "Ces droits et tous les autres, ne sont que des manisfestations concrète du droit général a l'existence et à l'acquisition de sa fin. Ces déterminations doivent provenir du fait.";
// provide path to the license file
SetLanguageRecognizerLicensePath("../../ISDK_License.xml");
// create database interface
CInterfacePtr<ILanguageRecognizerLDB> pDB(CreateLanguageRecognizerLDB());
// create extractor interface
CInterfacePtr<ILanguageRecognizer> pLRExtractor(CreateLanguageRecognizer());
// initialize database interface
pDB->Setup(szDBPath);
// initialize extractor interface
pLRExtractor->Setup(pDB.Get());
// Process sample text and print result
CInterfacePtr<IEnumLanguage> piLanguages;
pLRExtractor->RecognizeLanguage(pszText, strlen(pszText), &piLanguages);
unsigned int nLanguageID;
unsigned int nEncodingID;
double fWeight;
while (piLanguages->Next(&nLanguageID, &nEncodingID, &fWeight ))
{
cout << "LangEncoding:\t" << pDB->GetLanguageName(nLanguageID) << "\t"
<< pDB->GetEncodingName(nEncodingID) << "\t" << fWeight << endl;
}
}
catch (const CSemBaseException& x)
{
// Handle exceptions.
cerr << x.what();
}
return 0;
}
As a result, you get all Intellexer Language Recognizer output information: the document language and its encoding.