Mark said:
Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?
(In my case the body of text is an email received).
Naturally the text, if not English, may have a little bit of English.
I'm just trying to get the main language used.
I thought of following: take dictionary of the common words of the
language you interested in. Then for each word of the text, calculate
times the word occures. But, this needs several versions of the word;
for example "word" and "words". On some languages this is not possible,
since there can be so many variations of a single word.
But, check article "Language Trees and Zipping" by Dario Benedetto,
Emanuele Caglioti and Vittorio Loreto, downloadable from
http://xxx.uni-augsburg.de/format/cond-mat/0108530 . It seems there is
also perl implementation of the algorithm :
code.activestate.com/recipes/355807 . If I understood it right, zip
archiver is based on the idea that it tries to learn the sequence and
the more it learns (i.e. the bigger the text), the better it compresses.
When you teach zip with English text and then give it two texts A and B;
if A is english it is compressed better than B which is italian.