Detect encoding of a text file

Marc Scheuner [MVP ADSI] · Jan 21, 2004

Folks,

I have a number of text files in a directory, and I'd like to know
what type of encoding they're in.

I thought I could just open a StreamReader for each file one at a
time, and let .NET determine the encoding (Default, UTF-8, Unicode) -
there's a nice constructor for StreamReader which takes a file name,
and a boolean "detectEncodingFromByteOrderMarks" which I figured would
do exactly what I want - open the file and see if it's a Default
(ISO-8859-1), UTF-8 or Unicode (UTF-16) file.

Here's my function:

public string DetermineFileType(string aFileName)
{
string sEncoding = string.Empty;

StreamReader oSR = new StreamReader(aFileName, true);
sEncoding = oSR.CurrentEncoding.EncodingName;

return sEncoding;
}

But that doesn't seem to work - in this case, *all* files are being
labelled as "UTF-8", which I *KNOW* is *NOT* true....

Is there any easy way in C# to let it determine the file's encoding
reliably? Do I really need to "manually" look at the first three bytes
of each file?

Any ideas??

Marc
================================================================
Marc Scheuner May The Source Be With You!
Bern, Switzerland m.scheuner(at)inova.ch

Jon Skeet [C# MVP] · Jan 21, 2004

Is there any easy way in C# to let it determine the file's encoding
reliably? Do I really need to "manually" look at the first three bytes
of each file?

There *is* no way of determining it reliably. Something that starts
with "abc" could be in UTF-8, ASCII, UCS-2 without any BOM, Cp1252,
etc...

If you know that all your files are going to be *either* UCS-2 with a
BOM or UTF-8, that makes things a lot simpler - but you'll still
basically have to look at the first few bytes.

Tadao Machida [MS] · Jan 22, 2004

Hi Marc,

You need to read actually the file to detect the ecoding from Byte Order
mark.
Your sample does not read the file, it means that the BOM has not been read
yet.

I modified your code like below.

public string DetermineFileType(string aFileName)
{
string sEncoding = string.Empty;

StreamReader oSR = new StreamReader(aFileName, true);
oSR.ReadToEnd(); // Add this line to read the file.
sEncoding = oSR.CurrentEncoding.EncodingName;

return sEncoding;
}

Thanks,
Tadao Machida [MS]

--------------------

Bug in CurrentEncoding.EncodingName?	4	Aug 23, 2004
Detect the encoding of a stream	1	Dec 19, 2005
StreamReader.StreamReader(String, bool) bug - no BOM detection	4	Apr 22, 2006
Encoding to ISO-8859-1 problems	6	Feb 1, 2007
Need to reliably detect a text file's encoding for XML deserialization	4	Apr 6, 2006
Using regular expressions to parse INI file	1	Jun 28, 2004
How to detect if a text file is ISO8859-1,ISO8859-15,UTF-8 or UniCode encoded	5	Jan 22, 2010
Using detectEncodingFromByteOrderMarks while copying a text file	6	Jun 5, 2008

Detect encoding of a text file

Marc Scheuner [MVP ADSI]

Jon Skeet [C# MVP]

Tadao Machida [MS]

Ask a Question

Similar Threads