Detect encoding of a text file

  • Thread starter Thread starter Marc Scheuner [MVP ADSI]
  • Start date Start date
M

Marc Scheuner [MVP ADSI]

Folks,

I have a number of text files in a directory, and I'd like to know
what type of encoding they're in.

I thought I could just open a StreamReader for each file one at a
time, and let .NET determine the encoding (Default, UTF-8, Unicode) -
there's a nice constructor for StreamReader which takes a file name,
and a boolean "detectEncodingFromByteOrderMarks" which I figured would
do exactly what I want - open the file and see if it's a Default
(ISO-8859-1), UTF-8 or Unicode (UTF-16) file.

Here's my function:

public string DetermineFileType(string aFileName)
{
string sEncoding = string.Empty;

StreamReader oSR = new StreamReader(aFileName, true);
sEncoding = oSR.CurrentEncoding.EncodingName;

return sEncoding;
}

But that doesn't seem to work - in this case, *all* files are being
labelled as "UTF-8", which I *KNOW* is *NOT* true....

Is there any easy way in C# to let it determine the file's encoding
reliably? Do I really need to "manually" look at the first three bytes
of each file?

Any ideas??

Marc
================================================================
Marc Scheuner May The Source Be With You!
Bern, Switzerland m.scheuner(at)inova.ch
 
Is there any easy way in C# to let it determine the file's encoding
reliably? Do I really need to "manually" look at the first three bytes
of each file?

There *is* no way of determining it reliably. Something that starts
with "abc" could be in UTF-8, ASCII, UCS-2 without any BOM, Cp1252,
etc...

If you know that all your files are going to be *either* UCS-2 with a
BOM or UTF-8, that makes things a lot simpler - but you'll still
basically have to look at the first few bytes.
 
Hi Marc,

You need to read actually the file to detect the ecoding from Byte Order
mark.
Your sample does not read the file, it means that the BOM has not been read
yet.

I modified your code like below.

public string DetermineFileType(string aFileName)
{
string sEncoding = string.Empty;

StreamReader oSR = new StreamReader(aFileName, true);
oSR.ReadToEnd(); // Add this line to read the file.
sEncoding = oSR.CurrentEncoding.EncodingName;

return sEncoding;
}



Thanks,
Tadao Machida [MS]

--------------------
 
Back
Top