Attention Makers

CATEGORY: SOFTWARE

Detecting a Written text's language

MAKERS: Taruna

The language detection of a written text is probably one of the most basic tasks in Natural Language Processing (NLP). For any language depending processing of an unknown text, the first thing to know is which language the text is written in. It is one of the easier challenges that NLP has to offer. The idea is that any language has a unique set of character (co-)occurrences. The first step is to collect those statistics for all languages that should be detectable. The problem is to collect a large set of text data (plain text) that contains only one language and that is not domain specific. The language text is analyzed and the most probable language is predicted using the Natural Language Processing Algorithms. All statistics are ordered and ranked by their occurrences. Within the demo application, all models can be studied in detail. Classification of an unknown text is straightforward. The text is tokenized and the three tables for the statistics are generated. The result table is compared to all tables in the model, and a distance is calculated. The comparison table from the model that has the smallest distance to the unknown text is most likely the language of the text.

Home Previous Next

Vote Share Comment