Intellexer Language Recognizer

In the global Internet environment processing of information in multiple languages has a great importance. Intellexer Language Recognizer identifies the language and character encoding of incoming documents. It supports more than 30 languages, covering major European and Asian languages.

Intellexer Language Recognizer can be successfully used:

- as a pre-filtering step to improve the quality of input text data (because of most natural processing algorithms deal with monolingual texts and inclusion of other languages can decrease the performance of document management systems);

- in mining bilingual texts for machine translation from online resources;

- for retrieval, grouping and understanding relevant information (user’s texts, emails and etc.) in multilingual environment.

Intellexer Language Recognizer combines statistic and linguistic technologies in order to obtain the highest recognition results. Our language detection algorithm is based on strong mathematical model of vector spacing algorithm. It creates multidimensional space of vectors scanning document contests and uses N-grams notion for calculating their frequencies. Afterwards the algorithm analyzes the positions of the necessary vectors in space to determine their similarity. Besides, for correction of the statistical algorithm results, we use special linguistic rules developed by our experts.

Intellexer Language Recognizer accurately determines not only the language of the whole document, but also the language of each text fragment.

For evaluation experiments, we’ve created a dataset, which contains more than 1300000 documents in different languages (English, German, French, Spanish, Japanese, Chinese, etc.). In this collection we've achieved language identification accuracy from 95% to 99% (typical competitors’ results: 86 - 96%).The average processing speed was over 8000 KB/s.

For developers and integrators

Use Case

Intellexer Language Recognizer

Intellexer Language Recognizer

Overview