One of our clients asked Intellexer team to help them create a database of biographies from unstructured data. They only had hard copies of «Who is Who» book series and wanted to extract information about a person - name, date, place of birth and death, nationality, education, activity and achievements.
Our R&D department team has developed an automated system enabled to integrate unstructured text information into a relation database using proprietary Intellexer semantic software.
At the first step we used an optical character recognition program to convert printed books into electronic form. As the recognized data contained a lot of spelling mistakes we used Intellexer Spellchecker at preprocessing stage to correct them. Intellexer Spellchecker automatically corrected spelling mistakes due to well-chosen linguistic rules, including: rules for context-dependent misspellings; rules for evaluating the probability of possible corrections; rules for evaluating spelling mistakes caused by different means of representing one sound with various letters and letter combinations; dictionaries containing correct spelling variants (more than 424 000 word forms for English).
To determine and classify articles which contain information about a person we applied Intellexer Preformator and Named Entity Recognizer tools. Intellexer Preformator extracted text from input documents and represented text structure along with retrieved parts (for example, titles, subtitles, paragraphs and etc.). Intellexer Named Entity Recognizer was used to detect those titles that contain personal information.
Intellexer Event and Relation Extractor was used to identify relations between named entities. Event and Relation Extractor module is based on our custom Linguistic Processor. It successfully determines the main types of relations: person - age, person - nationality, person - position, person - organization, organization - location, and person - date.
Our information extraction technique shows over 85% in accuracy and could be easily and successfully applied in client’s services.