What is the fastest way to count lines in a text file?

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I want to very quickly count the number of lines in text files without having
to read each line and increment a counter. I am working in VB.NET and C#.
Does anyone have a very fast example on how to do this?

Thanks,

Matt
 
Mesterak,

In different test in these newsgroups have showed that just looping through
the file using the string as a Char array (not testing on a string however
testing on a char) and testing on the linebreack char is mostly the fastest
method.

I hope this helps,

Cor
 
Maybe using regular expression can be fast solution ( for large text
files ).
You will count matches for \r\n or \n
 
I tried the following which did not seem to work:

strContents = Regex.Replace(strContents, "\r{0,}\n+", vbCrLf)
myArrayList.AddRange(strContents.Split(CType(vbCrLf, Char)))
 
Vadym Stetsyak said:
Maybe using regular expression can be fast solution ( for large text
files ).

That's very unlikely, IMO.
You will count matches for \r\n or \n

And how will you provide the text for the regular expression to match?
As far as I'm aware, you can't provide regular expressions with
TextReaders - you have to provide them with strings.
 
Mesterak,

In those messages I show show you is using the split and the regex the
farmost slowest method to count lines.

Cor
 
So how can I count the lines of the file without loading the whole file into
memory as a string and counting lines?
 
mesterak said:
So how can I count the lines of the file without loading the whole file into
memory as a string and counting lines?

By reading chunks at a time (using StreamReader) and counting '\n'
occurrences.

Here's some sample code:

using System;
using System.IO;

class Test
{
static int CountLines (TextReader reader)
{
char[] buffer = new char[32*1024]; // Read 32K chars at a time

int total=1; // All files have at least one line!

int read;
while ( (read=reader.Read(buffer, 0, buffer.Length)) > 0)
{
for (int i=0; i < read; i++)
{
if (buffer=='\n')
{
total++;
}
}
}
return total;
}

static void Main(string[] args)
{
foreach (string file in args)
{
using (StreamReader reader = new StreamReader(file))
{
Console.WriteLine ("{0}: {1} lines", file,
CountLines(reader));
}
}
}
}
 
Thanks, that works perfectly!!!

I wrote the following which apparently works but does require that the
entire file be read into memory (your code is better):

Public Function GetLineCount(ByVal FileName As String) As Integer

If File.Exists(FileName) Then
Dim LogReader As StreamReader
LogReader = New StreamReader(New FileStream(FileName,
FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
Dim strContents As String = LogReader.ReadToEnd
LogReader.Close()
LogReader = Nothing
Dim r As New Regex(Chr(10))
Dim LineCount As Integer = r.Matches(strContents).Count
r = Nothing
Return LineCount
End If

End Function


Jon Skeet said:
mesterak said:
So how can I count the lines of the file without loading the whole file into
memory as a string and counting lines?

By reading chunks at a time (using StreamReader) and counting '\n'
occurrences.

Here's some sample code:

using System;
using System.IO;

class Test
{
static int CountLines (TextReader reader)
{
char[] buffer = new char[32*1024]; // Read 32K chars at a time

int total=1; // All files have at least one line!

int read;
while ( (read=reader.Read(buffer, 0, buffer.Length)) > 0)
{
for (int i=0; i < read; i++)
{
if (buffer=='\n')
{
total++;
}
}
}
return total;
}

static void Main(string[] args)
{
foreach (string file in args)
{
using (StreamReader reader = new StreamReader(file))
{
Console.WriteLine ("{0}: {1} lines", file,
CountLines(reader));
}
}
}
}
 
Ok, I used your baseline code to rewrite my VB.NET function. It is very fast
and efficient. The only thing I needed to added was a check to see if the
last character was a LF and increment the total if not; I get the correct
number of lines every time! Processing ~200MB of log files (209 files)
occurs extremely fast (only added 2 seconds overall to the date/time indexing
functions I was already performing.)

Thanks a million!!!

Here's my new VB.NET function to benefit anyone else needing to count lines
in a file in VB.NET:

Public Function GetLineCount(ByVal FileName As String) As Integer
Dim total As Integer = 0

If File.Exists(FileName) Then
Dim buffer(32 * 1024) As Char
Dim i As Integer
Dim read As Integer

Dim reader As TextReader = File.OpenText(FileName)
read = reader.Read(buffer, 0, buffer.Length)

While (read > 0)
i = 0
While i < read

If buffer(i) = Chr(10) Then
total += 1
End If

i += 1
End While

read = reader.Read(buffer, 0, buffer.Length)
End While

reader.Close()
reader = Nothing

If Not buffer(i - 1) = Chr(10) Then
total += 1
End If

End If

Return total
End Function

Jon Skeet said:
mesterak said:
So how can I count the lines of the file without loading the whole file into
memory as a string and counting lines?

By reading chunks at a time (using StreamReader) and counting '\n'
occurrences.

Here's some sample code:

using System;
using System.IO;

class Test
{
static int CountLines (TextReader reader)
{
char[] buffer = new char[32*1024]; // Read 32K chars at a time

int total=1; // All files have at least one line!

int read;
while ( (read=reader.Read(buffer, 0, buffer.Length)) > 0)
{
for (int i=0; i < read; i++)
{
if (buffer=='\n')
{
total++;
}
}
}
return total;
}

static void Main(string[] args)
{
foreach (string file in args)
{
using (StreamReader reader = new StreamReader(file))
{
Console.WriteLine ("{0}: {1} lines", file,
CountLines(reader));
}
}
}
}
 
Jon,

While I saw you in past forever telling about multithreading, is this in my
opinion a perfect situations for multithreading.

An IO operation has forever (IO) stops in it and is therefore perfectly to
paralyse with the counting thread.

Just my opinion.

Cor
So how can I count the lines of the file without loading the whole file
into
memory as a string and counting lines?

By reading chunks at a time (using StreamReader) and counting '\n'
occurrences.

Here's some sample code:

using System;
using System.IO;

class Test
{
static int CountLines (TextReader reader)
{
char[] buffer = new char[32*1024]; // Read 32K chars at a time

int total=1; // All files have at least one line!

int read;
while ( (read=reader.Read(buffer, 0, buffer.Length)) > 0)
{
for (int i=0; i < read; i++)
{
if (buffer=='\n')
{
total++;
}
}
}
return total;
}

static void Main(string[] args)
{
foreach (string file in args)
{
using (StreamReader reader = new StreamReader(file))
{
Console.WriteLine ("{0}: {1} lines", file,
CountLines(reader));
}
}
}
}
 
Cor Ligthert said:
While I saw you in past forever telling about multithreading, is this in my
opinion a perfect situations for multithreading.

An IO operation has forever (IO) stops in it and is therefore perfectly to
paralyse with the counting thread.

It's certainly *possible* that it would speed things up. I wouldn't
suggest that it's worth doing unless the performance of doing it in a
single thread is a problem though. Assuming the IO performance
dominates the time taken, you'd only be able to shave off the time
taken for the scanning, which I suspect would be absolutely minute.
Compare this with the development cost/risk of turning a simple bit of
single-threaded code into multi-threaded code, and I'd certainly need
to see concrete figures before taking that risk.
 
The 2 VB.NET functions I created based on your code example are pretty darn
fast. I counted a total of several million lines across about 200+ files in
a matter of a few seconds. If someone has issues with this speed to require
multi-threading, then something's just wrong!

However, one of my new line counting functions is used in a separate thread
after my app initially counts the lines and partially indexes the files'
entries by date/time (to get a time reference per file so I only parse parts
of files applicable to the date/time window of interest.) The line counter
that runs in a separate thread goes back over all of the files and determines
the actual byte position per chr(10) detected. This enables the user of my
log viewer to quickly jump to a particular line and also speeds content
paging (for viewability performance.) So to answer Cor, yes it is good to
use in a separate thread when there are extended purposes at play which you
may not want your app (or user) to wait on to complete.

-Matt
 
Back
Top