slow linq query

  • Thread starter Thread starter Eps
  • Start date Start date
E

Eps

Hi there,

I am doing the following, this is a List of audio files.

this.Where(p => p.Album == AnAudioFileObject.Album).Select(s =>
s.Artist).Distinct().Count() > 1;

The aim is to determine whether AnAudioFileObject is from an album that
has various artists on it or just one artist.

If I load several thousand audio files into the list it becomes very
slow, can anyone think of a way I could speed this up ?.

Any help appreciated.
 
Eps said:
I am doing the following, this is a List of audio files.

this.Where(p => p.Album == AnAudioFileObject.Album).Select(s =>
s.Artist).Distinct().Count() > 1;

The aim is to determine whether AnAudioFileObject is from an album that
has various artists on it or just one artist.

If I load several thousand audio files into the list it becomes very
slow, can anyone think of a way I could speed this up ?.

Any help appreciated.

Could you give definite figures for "several thousand" and "very slow"?
All of those should be linear operations as far as I'm aware, so it's
possible something else is going on.

Could you post a short but complete program which demonstrates the
problem?

See http://www.pobox.com/~skeet/csharp/complete.html for details of
what I mean by that.
 
Anders said:
How about

if (this.Any(p => p.Album == AnAudioFileObject.Album)) { .. }

? :-)

hmmm, do half of it with it linq and the other half programmatically ?.

I thought that linq and programmatic iteration were roughly about as
fast as each other, whats the advantage of mixing the two ?.

Just after I posted I realized I could do this...

this.Where(p => p.Album == AnAudioFileObject.Album).Take(2).Select(s =>
s.Artist).Distinct().Count() > 1;

The Take(2) does seem to have a significant impact on the speed, its not
ideal since there are albums with various artists that do feature the
same artist twice (or more times) but for my purposes I think it will be ok.
 
Jon said:
Could you give definite figures for "several thousand" and "very slow"?
All of those should be linear operations as far as I'm aware, so it's
possible something else is going on.

Could you post a short but complete program which demonstrates the
problem?

See http://www.pobox.com/~skeet/csharp/complete.html for details of
what I mean by that.

ok, lets say 6000 audio files, each is an object that will read in data
and metadata directly from the file on the disk, so its IO bound to a
certain extent. In terms of time I can't be exact but the time taken to
process a file jumped from about 30 - 40 seconds to over a couple of
minutes.

I think I kind of have a solution now so I won't bother trying to
replicate the problem in a demonstration program.

Thanks for your help though.
 
Eps said:
I am doing the following, this is a List of audio files.

this.Where(p => p.Album == AnAudioFileObject.Album).Select(s =>
s.Artist).Distinct().Count() > 1;

The aim is to determine whether AnAudioFileObject is from an album that
has various artists on it or just one artist.

If I load several thousand audio files into the list it becomes very
slow, can anyone think of a way I could speed this up ?.

I've just thought of something which could theoretically help. Instead
of calling Count(), call Take(1).Any(). That way as soon as Distinct()
yields its second element, you can finish.

That way you can deal with *huge* sequences, or even infinite ones (so
long as they aren't an infinite sequence repeating a single element).
For example:

using System;
using System.Linq;

public static class Test
{
static void Main()
{
var allPositiveInts = Enumerable.Range(0, int.MaxValue);
bool quick = allPositiveInts.Distinct().Take(1).Any();
Console.WriteLine("quick = " + quick);
bool slow = allPositiveInts.Distinct().Count() > 1;
Console.WriteLine ("slow = " + slow);
}
}

The result is:

quick = True

Unhandled Exception: OutOfMemoryException.



Having said all of this, I strongly suspect that you won't gain much
from this. Try printing out

this.Where(p => p.Album == AnAudioFileObject.Album)
.Select(s => s.Artist)
.Count()

just to see how many artist entries we're talking about as the input to
Distinct() to start with.
 
Eps said:
ok, lets say 6000 audio files, each is an object that will read in data
and metadata directly from the file on the disk, so its IO bound to a
certain extent. In terms of time I can't be exact but the time taken to
process a file jumped from about 30 - 40 seconds to over a couple of
minutes.

I think I kind of have a solution now so I won't bother trying to
replicate the problem in a demonstration program.

I strongly suspect the problem wasn't in the code you showed then -
because that part should be very fast. 6000 entries is nothing. I
suspect if you load them all into memory to start with, the query will
execute pretty much instantly.
 
Eps said:
Just after I posted I realized I could do this...

this.Where(p => p.Album == AnAudioFileObject.Album).Take(2).Select(s =>
s.Artist).Distinct().Count() > 1;

The Take(2) does seem to have a significant impact on the speed, its not
ideal since there are albums with various artists that do feature the
same artist twice (or more times) but for my purposes I think it will be ok.

See my other post for an alternative which will be correct but still
fast.
 
Jon said:
Having said all of this, I strongly suspect that you won't gain much
from this. Try printing out

this.Where(p => p.Album == AnAudioFileObject.Album)
.Select(s => s.Artist)
.Count()

just to see how many artist entries we're talking about as the input to
Distinct() to start with.

Hmmm, I haven't had a chance to test this yet but I think I know whats
going on.

Where the album is not set (String.IsNullOrEmpty) I put in "Unknown
Album". Obviously the query considers all these audio files (and there
will be a significant number of them) as a part of the same album.

I could combine your code with mine, do something like....

this.Where(p => p.Album == AnAudioFileObject.Album)
.Select(s => s.Artist)
.Distinct().Take(10).Any()

This still isn't foolproof, there could be an album with 10 tracks by
one artist and 1 track (or more) by another. But this should be good
enough, I will test it and post the results.
 
I'm sorry, I misread your original query - sorry about that (I promise I'll
head straight to bed) ;-)
 
Eps said:
Hmmm, I haven't had a chance to test this yet but I think I know whats
going on.

Where the album is not set (String.IsNullOrEmpty) I put in "Unknown
Album". Obviously the query considers all these audio files (and there
will be a significant number of them) as a part of the same album.

Even so it shouldn't be a problem.

Here's a complete program which generates 100,000 tracks and one album
with 1,000 tracks including 5 artists. It counts those 5 distinct
artists in 67ms on my box. That's pretty clear evidence to me that
something is going on beyond what you're aware of. It would be worth
getting to the bottom of that, as it may bite you later on - right now
you've got (I would imagine) a reasonably straightforward situation to
diagnose. The next symptom of the same problem could be much harder to
track down.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;

public class Track
{
public string Artist { get; set; }
public Album Album { get; set; }

public static Track GenerateRandom()
{
return new Track
{
Album = new Album(),
Artist = GenerateRandomString()
};
}

static Random rng = new Random();
static string GenerateRandomString()
{
StringBuilder builder = new StringBuilder(15);
for (int i=0; i < 15; i++)
{
builder.Append((char)(rng.Next(26)+'A'));
}
return builder.ToString();
}
}

public class Album
{
}

public static class Test
{
static void Main()
{
// First generate lots of random tracks
List<Track> tracks = new List<Track>();
for (int i=0; i < 100000; i++)
{
tracks.Add(Track.GenerateRandom());
}

string[] artists = { "Mike Rutherfield", "Phil Collins",
"Tony Banks", "Peter Gabriel", "Steve Hackett" };

// Now make 1,000 (or thereabouts) of them belong to one album.
// Give each of them one of our 5 artists
Album picked = new Album();
Random rng = new Random();
for (int i = 0; i < 1000; i++)
{
Track track = tracks[rng.Next(tracks.Count)];
track.Album = picked;
track.Artist = artists[rng.Next(artists.Length)];
}

Console.WriteLine("Finding the distinct count...");
Stopwatch sw = Stopwatch.StartNew();
int count = tracks.Where(track => track.Album == picked)
.Select(track => track.Artist)
.Distinct()
.Count();
sw.Stop();
Console.WriteLine("Found {0} distinct artists in {1}ms",
count, sw.ElapsedMilliseconds);
}
}
I could combine your code with mine, do something like....

this.Where(p => p.Album == AnAudioFileObject.Album)
.Select(s => s.Artist)
.Distinct().Take(10).Any()

This still isn't foolproof, there could be an album with 10 tracks by
one artist and 1 track (or more) by another. But this should be good
enough, I will test it and post the results.

No, it was already foolproof beforehand. You only need Take(1), because
Take(1).Any() is the equivalent of Count() > 1. The important point is
that I put the Take(1) *after* the Distinct() whereas you put it
*before*.
 
Jon said:
No, it was already foolproof beforehand. You only need Take(1), because
Take(1).Any() is the equivalent of Count() > 1. The important point is
that I put the Take(1) *after* the Distinct() whereas you put it
*before*.

Yep, your completely right, I am now using the line below to determine
if a track is from an various artists album.

this.Where(p => p.Album == af.Album).Select(s =>
s.Artist).Distinct().Take(1).Any();

This seems to be very quick, thanks for all your help Jon, your examples
are as ever very informative.
 
Eps said:
Yep, your completely right, I am now using the line below to determine
if a track is from an various artists album.

this.Where(p => p.Album == af.Album).Select(s =>
s.Artist).Distinct().Take(1).Any();

This seems to be very quick, thanks for all your help Jon, your examples
are as ever very informative.

Thanks :) I'm still intrigued as to why it was taking a long time in
the first place though. It really, really shouldn't... if you're able
to send me some sample code, I'd be very interested to have a look. I
understand if you can't do that though.
 
I do not understand this terminology. Without contributing, may I ask
the stimulated minds if they can speed up this query any further:
Dim Names As IEnumerable(Of String) = _
From QryRow As QueriesRow _
In DTblQueries _
Where QryRow.Query = Me.sQuery _
Select QryRow.PIName

Dim DetailsWithImage As IEnumerable(Of DetailWithImage) = _
From _
TagRow As TagsRow In DTblTags _
Order By TagRow.Designator _
Join _
Name As String In Names _
On _
TagRow.Name Equals Name _
Join _
SeriesChkBx As QueriesRow In DTblSeriesChkBx_Session
_
On _
SeriesChkBx.Name Equals Name _
Select New DetailWithImage With { _
.TagRow = TagRow, .ImagePath = SeriesChkBx.Query _
}

Basically, 3 tables
1) Table DTblTags - from which rows are needed in ascending order of
column "Designator"
2) Table DTblQueries - which tells us which rows have to be picked from
DTblTags
3) Table DTblSeriesChkBx_Session from which a value ("Query") is needed
for every row shortlisted from the join of the first two tables.

Since we OPTION STRICT ON, the selection was done by making and using a
dummy class ("DetailWithImage"), which is here:
Private Class PITagDetailWithImage
Public TagRow As TagsRow
Public ImagePath As String
End Class

Thanks for the help. If I happen to strike something, will surely post
 
Back
Top