IEnumerable<T> caching technique

  • Thread starter Thread starter kndg
  • Start date Start date
K

kndg

Dear all,

If I have a data sequence that is computationally expensensive, what is
your recommended strategy for data caching? My current approach would be
as below sample code, so any subsequent call to foreach statement would
be as fast as possible. Is this a good approach? Or is .Net already have
a build-in class for this purpose?

using System;
using System.Collections;
using System.Collections.Generic;
using System.Threading;

public class TestSequenceCaching : IEnumerable<int>
{
private const int MAX_DATA = 10;
private readonly List<int> dataCache = new List<int>(MAX_DATA);

public int Length
{
get { return MAX_DATA; }
}

public int this[int index]
{
get
{
if (index < 0 || index >= MAX_DATA) throw new
IndexOutOfRangeException();

if (index >= dataCache.Count)
{
var enumerator = GetEnumerator();

while (enumerator.MoveNext() && dataCache.Count <= index) { }
return enumerator.Current;
}

return dataCache[index];
}
}

public IEnumerator<int> GetEnumerator()
{
for (int i = 0; i < MAX_DATA; i++)
{
if (dataCache.Count <= i)
{
int result = GetData(i);
dataCache.Add(result);
yield return result;
}
else
{
yield return dataCache;
}
}
}

IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}

private static int GetData(int i)
{
Thread.Sleep(2000); // simulate expensive computation

return i; // just return back the input for illustration purpose
}
}

Regards.
 
kndg said:
Dear all,

If I have a data sequence that is computationally expensensive, what is
your recommended strategy for data caching? My current approach would be
as below sample code, so any subsequent call to foreach statement would
be as fast as possible. Is this a good approach? Or is .Net already have
a build-in class for this purpose?

Caching is too scenario-specific for general-purpose advice to be
terribly value, IMHO. But, as a general approach: I would implement the
enumeration as IEnumerable<T>, and then if caching is needed, just pass
an instance of the IEnumerable<T> to the List<T> constructor that takes one.

I don't think there's anything wrong with the idea of the code you
posted per se, but it appears that your class conflates the data
generation and caching duties. I'd think it would make more sense to
generalize the caching; even if you've got an indexable data source, you
can create a general-purpose cache class that can be reused with a
variety of indexable data sources.

As far as the specifics of your implementation go, a couple of observations:

– your indexer could be change to eliminate the "return
enumerator.Current". After all, the whole point of the loop is to
enumerate until your cache includes the value of interest. So you know
that value is available in the cache, which the last statement in the
indexer can return.

– One big efficiency issue you seem to have is that the indexer
enumerates repeatedly. A theoretically common usage for your class
might look like this:

TestSequenceCaching tsc = new TestSequenceCaching();

for (int i = 0; i < tsc.Length; i++)
{
Console.WriteLine(tsc);
}

Of course, since each time you call the indexer, you start a new
enumeration, iterating until you reach the value you want. This is
awful algorithmically, even though you are caching earlier values. It
turns what ought to be O(n) into O(n^2).

If you've got an enumeration that does have to be restarted from scratch
each time, then it would be best (most efficient) to run through the
entire thing all at once, storing all the results. If the enumeration
is really just a wrapper around some kind of indexed retrieval, as in
your example, then the caching should be more intelligent. That is,
being able to skip to a specific value of interest and/or extending the
cache in an efficient way.

At the very least, when a value is requested that's outside the cache,
the cache ought to be extended by some minimum number of elements, even
if the one being requested is just past the end of the cache.

For an example of one more robust caching scheme, take a look at the
System.Windows.Forms.ListView caching API. In virtual mode, it has a
way to interact with the client code to cache sections of the data of
interest. It's not a great general-purpose caching scheme, but for that
particular application it can work very well.

Pete
 
[...]
Caching is too scenario-specific for general-purpose advice to be
terribly value, IMHO. But, as a general approach: I would implement the
enumeration as IEnumerable<T>, and then if caching is needed, just pass
an instance of the IEnumerable<T> to the List<T> constructor that takes
one.

For general-purpose approach, yes I think it would be feasible. But my
target is to get the first item as quickly as possible. If pass to
List<T>, I would have to wait until all the items had been enumerated
just to get the first item (in that case, I would just use an array
instead).
[...]
As far as the specifics of your implementation go, a couple of
observations:

– your indexer could be change to eliminate the "return
enumerator.Current". After all, the whole point of the loop is to
enumerate until your cache includes the value of interest. So you know
that value is available in the cache, which the last statement in the
indexer can return.

Good point!
– One big efficiency issue you seem to have is that the indexer
enumerates repeatedly. A theoretically common usage for your class might
look like this:

TestSequenceCaching tsc = new TestSequenceCaching();

for (int i = 0; i < tsc.Length; i++)
{
Console.WriteLine(tsc);
}

Of course, since each time you call the indexer, you start a new
enumeration, iterating until you reach the value you want. This is awful
algorithmically, even though you are caching earlier values. It turns
what ought to be O(n) into O(n^2).


I'm having a hard time to see why it would become an O(n^2) operation.
Perhap you can elaborate? If you're referring to a call to
Enumerator.MoveNext() operation, but since I have an if-block to return
the cache data instead of actual computation, I think it would not be a
noticable performance problem. The purpose of that loop is just to
trigger the caching operation if the data is not in cache.
If you've got an enumeration that does have to be restarted from scratch
each time, then it would be best (most efficient) to run through the
entire thing all at once, storing all the results. If the enumeration is
really just a wrapper around some kind of indexed retrieval, as in your
example, then the caching should be more intelligent. That is, being
able to skip to a specific value of interest and/or extending the cache
in an efficient way.

Perhaps my example is flawed. I should have not include the indexer in
the first place. It is just for illustration purpose, for example when
the user call it using indexer, ex: tsc[4], then when they call tsc[5],
it shouldn't take long to get the result (in this example -
2miliseconds). My point is, to have a data that although it is very
computationally expensive, it can give a fast access to the first data
and once all the data had been cached, any subsequest access will be
fast also.
At the very least, when a value is requested that's outside the cache,
the cache ought to be extended by some minimum number of elements, even
if the one being requested is just past the end of the cache.

For an example of one more robust caching scheme, take a look at the
System.Windows.Forms.ListView caching API. In virtual mode, it has a way
to interact with the client code to cache sections of the data of
interest. It's not a great general-purpose caching scheme, but for that
particular application it can work very well.

I will take a look on this API. Pretty big class though.

Thanks for your inputs.
You had always been very helpful.

Regards.
 
[...]
Caching is too scenario-specific for general-purpose advice to be
terribly value, IMHO. But, as a general approach: I would implement the
enumeration as IEnumerable<T>, and then if caching is needed, just pass
an instance of the IEnumerable<T> to the List<T> constructor that takes
one.

For general-purpose approach, yes I think it would be feasible. But my
target is to get the first item as quickly as possible. If pass to
List<T>, I would have to wait until all the items had been enumerated
just to get the first item (in that case, I would just use an array
instead).

Okay, I think general-purpose caching probably would be like this,
(at first I would thought to use Threading...)

ExpensiveSequence expensiveSequence = new ExpensiveSequence(); //
ExpensiveSequence is a class that implement IEnumerable<int>
List<int> expensiveSequenceCache = new List<int>();

foreach (int data in expensiveSequence)
{
// do something with data
expensiveSequencesCache.Add(data);
}

// below code onward when we need the data again just use the cache
int data3 = expensiveSequenceCache[3];

Though we have two objects, I think it is simple enough for a
general-purpose caching. Any thought?
 
kndg said:
[...]
Caching is too scenario-specific for general-purpose advice to be
terribly value, IMHO. But, as a general approach: I would implement the
enumeration as IEnumerable<T>, and then if caching is needed, just pass
an instance of the IEnumerable<T> to the List<T> constructor that takes
one.

For general-purpose approach, yes I think it would be feasible. But my
target is to get the first item as quickly as possible. If pass to
List<T>, I would have to wait until all the items had been enumerated
just to get the first item (in that case, I would just use an array
instead).

Just depends on your needs. Getting to the first item as quickly as
possible may or may not be a valid design goal. IMHO, while it _can_ be
valid, the importance of it is overrated as compared to the question of
keeping code simple.

Also, if that _does_ turn out to be a justifiably important design goal,
one may want to consider using background processing to implement the
enumeration/caching logic. Harlan Messinger just recently started at
least a couple of topics here in which he did just that for an
enumeration of files at a specific path. His techniques may be of
interest to you. The threads can be found here:
http://groups.google.com/group/micr...read/thread/1fc00380b4518144/1791120f0648cd6c
http://groups.google.com/group/micr...read/thread/2df9edeeb1735786/f8a9b39230484179

(I think the first thread is more to relevant to this discussion, but
the second might have some useful information for you too).

All that said, note that in the example you posted, you have the
_ability_ to retrieve the value asked for _immediately_, and yet will
search the entire enumerable data set from index 0 up to the index asked
for. This seems contrary to your stipulation that you want to "get the
first item as quickly as possible".
[...]
TestSequenceCaching tsc = new TestSequenceCaching();

for (int i = 0; i < tsc.Length; i++)
{
Console.WriteLine(tsc);
}

Of course, since each time you call the indexer, you start a new
enumeration, iterating until you reach the value you want. This is awful
algorithmically, even though you are caching earlier values. It turns
what ought to be O(n) into O(n^2).


I'm having a hard time to see why it would become an O(n^2) operation.
Perhap you can elaborate? If you're referring to a call to
Enumerator.MoveNext() operation, but since I have an if-block to return
the cache data instead of actual computation, I think it would not be a
noticable performance problem. The purpose of that loop is just to
trigger the caching operation if the data is not in cache.


I fear you are looking only at the "expensive" operation to actually
fetch data.

You are correct that subsequent iterations through the cached portion of
the data set are much less expensive. But for them to exist at all
still creates an O(n^2) situation.

Why go to the trouble to implement caching, but then do so in a
sub-optimal way?

It's like saying that you'll use a bubble sort to order data, just
because retrieving the data from magnetic tape is so slow, even a bubble
sort is relatively instantaneous once everything's in memory. Just
because the bubble sort is fast as compared to the i/o, that doesn't
make it the right sorting algorithm to use.
If you've got an enumeration that does have to be restarted from scratch
each time, then it would be best (most efficient) to run through the
entire thing all at once, storing all the results. If the enumeration is
really just a wrapper around some kind of indexed retrieval, as in your
example, then the caching should be more intelligent. That is, being
able to skip to a specific value of interest and/or extending the cache
in an efficient way.

Perhaps my example is flawed. I should have not include the indexer in
the first place. It is just for illustration purpose, for example when
the user call it using indexer, ex: tsc[4], then when they call tsc[5],
it shouldn't take long to get the result (in this example -
2miliseconds). My point is, to have a data that although it is very
computationally expensive, it can give a fast access to the first data
and once all the data had been cached, any subsequest access will be
fast also.

Nevertheless, a) it seems reasonable to expect that if you are going to
the trouble to cache data, you might as well include _some_ API to
provide random access into the cached enumeration, and b) the given
implementation is inefficient.

Personally, I would implement your example such that the entire
enumeration is cached on first access. You can still return the value
of interest as soon as you arrive at it, or even (as in the example
you've provided) get that value first, and then go back and cache the
rest as a background task (similar to the stuff Harlan was doing).

But at the very least, the implementation should not have to restart the
enumeration every time the client code asks for a new value. For
example, you could save the IEnumerator instance as part of your cache
data, and simply resume the previous enumeration when you need to fetch
more values.
[...]
For an example of one more robust caching scheme, take a look at the
System.Windows.Forms.ListView caching API. In virtual mode, it has a way
to interact with the client code to cache sections of the data of
interest. It's not a great general-purpose caching scheme, but for that
particular application it can work very well.

I will take a look on this API. Pretty big class though.

It's huge. Just focus on the high level description of the class and
the virtual mode stuff. The rest is irrelevant to this discussion.

Pete
 
kndg said:
Okay, I think general-purpose caching probably would be like this,
(at first I would thought to use Threading...)

ExpensiveSequence expensiveSequence = new ExpensiveSequence(); //
ExpensiveSequence is a class that implement IEnumerable<int>
List<int> expensiveSequenceCache = new List<int>();

foreach (int data in expensiveSequence)
{
// do something with data
expensiveSequencesCache.Add(data);
}

// below code onward when we need the data again just use the cache
int data3 = expensiveSequenceCache[3];

Though we have two objects, I think it is simple enough for a
general-purpose caching. Any thought?

That may be appropriate for simple, one-off code where you can mix the
caching with the client code.

Otherwise, I would do something more general purpose. For example,
here's a non-threaded alternative to your original example (warning:
uncompiled, untested):

class Cache<T> : IEnumerable<T>
{
private readonly List<T> _list = new List<T>();
private IEnumerator<T> _enumerator;
private int? _count;

public Cache(IEnumerable<T> source)
{
_enumerator = source.GetEnumerator();

ICollection collection = source as ICollection;

if (collection != null)
{
_count = collection.Count;
}
}

public int Count
{
get
{
while (_count == null)
{
T t;

GetElementAt(_list.Count, out t);
}

return _count.Value;
}
}

public T this[int index]
{
get
{
T t;

if (index < 0 || (_count != null && _count.Value <= index) ||
!GetElementAt(index, out t))
{
throw new IndexOutOfRangeException();
}

return t;
}
}

private bool GetElementAt(int index, out T t)
{
while (_list.Count <= index)
{
if (!_enumerator.MoveNext())
{
_count = _list.Count;
_enumerator.Dispose();
_enumerator = null;
t = default(T);
return false;
}

_list.Add(_enumerator.Current);
}

t = _list[index];
return true;
}

public IEnumerator<T> GetEnumerator()
{
T t;

for (int i = 0; GetElementAt(i, out t); i++)
{
yield return t;
}
}

private IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}

With a class like that, you can throw _any_ IEnumerable<T>
implementation at it, and it will cache the results in a reasonably
efficient way. Similar to your original example, but without having to
repeat work.

The above can be modified, with some effort, to perform the underlying
enumeration as a background task. This would be similar to the work
Harlan was doing in his LazyFileList class.

Note that I make no representation that the above is necessarily the
best fit for _all_ caching needs (or even the best possible
implementation for the basic technique, for that matter). As I
mentioned before, caching can be very scenario-specific. The above
simply happens to be a reasonably efficient, but general purpose
implementation of one caching scheme.

With respect to the above caveat, it's important to keep in mind that
this sort of approach is really only appropriate for caching data sets
that are not exceptionally large. At some point, one will want a cache
that behaves in a more traditional way, maintaining only some subset of
the original data at a time, discarding old data and populating the
cache with new as the client code moves through or around the actual
data set.

But if it's reasonable to eventually keep the entire data set in memory,
and you expect a linear, from-the-beginning access pattern (something
that is implicitly enforced in my example by requiring the data source
to implement IEnumerable<T> :) ), the above code should work reasonably
well.

I have some ideas for a similar cache, but which works with a data
source that supports random access. Lacking time, I won't post that
implementation, but the basic idea would be:

– Rather than a List<T>, use a T[] array and managed growth of the
array explicitly as source data are fetched

– Include a bitmap to indicate which data elements have actually been
retrieved

– The data source would implement an interface that stipulates the
ability to retrieve a given element for a given index. This can either
be IList<T>, or you can just define a simpler custom interface that just
includes a method for fetching an element given an index, and skips the
other features of IList<T>.

Hope that helps.

Pete
 
Peter said:
[...]
Otherwise, I would do something more general purpose. For example,
here's a non-threaded alternative to your original example (warning:
uncompiled, untested):

As is sometimes the case, the above warning is especially important for
the code I posted here. :)

Here's a slightly better version of the "GetElementAt()" method. The
previous one had a potential null-reference exception. Note the new
helper method included with this version:

private bool GetElementAt(int index, out T t)
{
while (_list.Count <= index)
{
if (_enumerator == null || !MoveNext())
{
t = default(T);
return false;
}

_list.Add(_enumerator.Current);
}

t = _list[index];
return true;
}

private bool MoveNext()
{
if (!_enumerator.MoveNext())
{
_count = _list.Count;
_enumerator.Dispose();
_enumerator = null;
return false;
}

return true;
}

With the above changes, the indexer could be a little simpler too:

public T this[int index]
{
get
{
T t;

if (index < 0 || !GetElementAt(index, out t))
{
throw new IndexOutOfRangeException();
}

return t;
}
}

Sorry for any inconvenience. That's what I get for trying to be clever. :)

Pete
 
Hi Pete,

As I'm reading your previous post and Harlan's post, you had already
post this new message (along with your implementation code). You're too
fast. Thanks for your inputs. I will study it and I think I get the idea
what my caching class will look like. You had been of great help.

Thanks again. And as somebody has said in this newsgroup, you are truly
priceless. Regards.
 
kndg said:
Hi Pete,

As I'm reading your previous post and Harlan's post, you had already
post this new message (along with your implementation code). You're too
fast. Thanks for your inputs. I will study it and I think I get the idea
what my caching class will look like. You had been of great help.

Glad you find the input useful.
Thanks again. And as somebody has said in this newsgroup, you are truly
priceless. Regards.

<blush>

Thanks for the very kind comment. I would be remiss if I failed to
point out that, if you consider me "priceless", I think there are a
number of other very helpful folks who also post to this newsgroup who
are also priceless. :)

For that matter, I think your high praise of me risks glossing over your
own contributions to the newsgroup!

I'm very gratified to hear my posts are considered useful. It makes the
time spent writing a reply all worthwhile, and of course is great for my
ego. :) But I think it's worth pointing out that I don't carry this
newsgroup alone. I could stop participating tomorrow, and there would
still be plenty of smart, helpful people to make the newsgroup as useful
as it ever was.

In fact, one of the reasons I continue to participate here is that this
newsgroup was so helpful to me when I was first learning C# and .NET! I
owe quite a lot to the other helpful people who participate in this
newsgroup.

All that said…I certainly don't mind being complimented. Thanks very
much! :)

Pete
 
Peter said:
[...]
Otherwise, I would do something more general purpose. For example,
here's a non-threaded alternative to your original example (warning:
uncompiled, untested):
[...]

Hi Pete,

Based on your idea, below is the threaded version (though I have
over-simplified it a bit). I have to admit that I'm pretty dumb at
threading, so I highly appreciated if you have any advices on below
code. If I understand threading correctly, the object that get modified
on this code is cache, enumerator and count, so I have to protect these
objects from multiple access. The only place that this state get
modified is on GetElementAt() method, so I think it is enough if I place
the lock on this method. Am I correct? (though I still feel uneasy...)

using System;
using System.Collections;
using System.Collections.Generic;
using System.Threading;

namespace Nizetto.Collections
{
public class CachedEnumerable<T> : IEnumerable<T>
{
private readonly List<T> cache = new List<T>();
private IEnumerator<T> enumerator;
private int count = -1;
private readonly object locker = new object();

public CachedEnumerable(IEnumerable<T> source)
: this(source, false)
{
}

public CachedEnumerable(IEnumerable<T> source, bool threadedCaching)
{
enumerator = source.GetEnumerator();

if (threadedCaching)
{
new Thread(ThrottleCache).Start();
}
}

public int Count
{
get
{
ThrottleCache();

return count;
}
}

public T this[int index]
{
get
{
if (index < 0 || !GetElementAt(index))
{
throw new IndexOutOfRangeException();
}

return cache[index];
}
}

private void ThrottleCache()
{
while (count < 0)
{
GetElementAt(cache.Count);
}
}

private bool GetElementAt(int index)
{
while (cache.Count <= index)
{
lock (locker)
{
if (enumerator == null) return false;

if (!enumerator.MoveNext()) // no next item -> enumeration
complete
{
count = cache.Count;
enumerator.Dispose();
enumerator = null;

return false;
}

cache.Add(enumerator.Current);
}
}

return true;
}

public IEnumerator<T> GetEnumerator()
{
for (int i = 0; GetElementAt(i); i++)
{
yield return cache;
}
}

IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
}
 
kndg said:
Peter said:
[...]
Otherwise, I would do something more general purpose. For example,
here's a non-threaded alternative to your original example (warning:
uncompiled, untested):
[...]

Hi Pete,

Based on your idea, below is the threaded version (though I have
over-simplified it a bit). I have to admit that I'm pretty dumb at
threading, so I highly appreciated if you have any advices on below
code. If I understand threading correctly, the object that get modified
on this code is cache, enumerator and count, so I have to protect these
objects from multiple access. The only place that this state get
modified is on GetElementAt() method, so I think it is enough if I place
the lock on this method. Am I correct? (though I still feel uneasy...)

It's late, and I haven't really had time to look through the code
closely. But a couple of things stand out:

– You need to lock around _all_ data structure access in
GetElementAt(). There's no guarantee about the order in which the
List<T>.Count property is updated as compared to actually copying the
relevant value to the internal storage. For the same reason, you need
to lock around the actual access to the List<T> item.

– It is IMHO a poor idea to have the possibility of multiple threads
all trying to fill the cache at once. To some extent this problem would
be mitigated with broader locking; a big issue in your current
implementation is that the GetElementAt() method acquires and releases
the lock over and over again, providing lots of opportunity for
contention. But even if that's fixed, IMHO it would be better for
threads accessing the data structure to coordinate with each other, so
that if someone (for example) calls Count while the background thread is
working to fill the cache, it just waits until that's done and then uses
the count that the other thread determined, rather than simultaneously
trying to do the same work itself.

Of course, in addressing the first issue, you'll also create a situation
where the lock is acquired and held for much longer at once than it is
in your implementation, making it a bit trickier to allow client code to
access already-cached values while the background thread is still working.

This is addressable, and ironically involves reintroducing a slightly
different variation on the "repeatedly acquire and release the lock"
issue I mentioned earlier. The difference being that, in the fixed
version, you wouldn't have threads fighting for the lock (managing the
lock is much more efficient when there's not actually contention going on).

Anyway, I don't have time to go through and produce an alternative I
think would work better at the moment. Hopefully the above can steer
you in the right direction.

(I do like the refactoring you did of the GetElementAt() method, with
respect to eliminating the "out" parameter; it just means that for the
threaded version, you need locking elsewhere too, not just in that method).

Pete
 
Hi Pete,

Thank you very much for your response.
I think I have to step back and re-study threading again.
(I don't have the idea how to put your suggestion into code).
Anyway, it is not an urgent project and at the moment I still can live
with a non-threaded version (though a threaded version is a very nice
feature to have).

Thanks for taking the time to reply (even at late night - though you
should care your health better).
Regards.
 
Back
Top