Cor Ligthert said:
You make a good point, Cor. Sometimes key text is buried in images,
non-standard binaries, internal file compression, and encryption. This can
be really frustrating. However, your point alludes to the most interesting
part of the question.
The beginning for pulling text out of these other formats in a generic way
falls to Natural Language Processing or NLP because language has a
mathematical signature that corresponds to the myriad of rules that apply to
spelling and grammar. In spite of all the spectacular claims, no-one has
NLP - not yet. The foundation of NLP is contextualisation, which has been
the focus of languages such as XML. However, as the folks at Brown
University soon discovered, there are also issues of core structure versus
extensible features of language that vary from node type to node type in
structural hierarchy of communication. Did I mention that language is not
compatible with well-formed hierarchies due to the frequency of two way
ambiguity in word meanings (and often function). Thus context is drawn from
structure, which itself could be any one of a number of possibilities that
cannot always be resolved from structure. Consider the meaning of the word,
"green" in the following examples:
1. The green recruit
2. The green passenger
3. The green corporation
4. The green thumb
In each case the meaning of green depends on the definition of the
applicable noun.
Nobody's clear on a system, and when you compare the effectiveness of .NET
as a language unto itself - it emerges that there may well be some errors in
the conventional academic perception of linguistic structure. Linguists hold
the verb, for example, as an equal classification to the noun when
considering parts of speech - but in the Microsoft class system, a verb is
meagerly a sub-part of the noun. The Microsoft system works very well, so
perhaps the engineering proves they got something right in this
department...?
In any case, we have a long way to go, even if the data and analyses being
accumulated are fascinating.