Indexing - How it does/should work? - Debate

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I read often hereabouts that the index does not update in real time, to which
the obvious question is: why not?

The indexer receives notifications of events that relate to finding & use
(such as create and delete - I don't know what the whole list is...) so it
seems to me logical that it should cache recent events.

Searches for files should be masked with the results the cache: if the main
index says the file exists but the cache says it doesn't then don't show that
result; if a file is moved then redirect the result to the new location using
the cache; etc. etc.; if the file is only in the cache then... well, there's
no conflict with the main index. (NB I do hope that when a file is moved the
Indexer doesn't delete and rebuild for that file - I hope it's clever enough
to save that effort...)

As for the cache itself... well, if indexing responds to various events at
some later time, it must be keeping a record of them somewhere.

Now, I can appreciate that users create few files manually, but can delete
very large numbers - and could create very large numbers programmatically.

It would therefore make sense to separate file attributes from content as
far indexing and searching is concerned - to a certain degree... you could
separate search results for new files based on basic OS attributes only from
searches based on the content of those files (highlight them the way new apps
get highlighted after install?)

No one can reasonably expect an arbitrarily large amount of data to be
indexed by content in an arbitrarily short space of time - but if the file
can be created/deleted/moved and not lost in the process one can reasonably
expect the Search function to know about basic attributes such as the
filename at the very least..

And with regard to indexing removable media, if the medium is r/w why not
store the index on the medium - space/bandwidth permitting. You could even
give the user the choice:... no index, index of filenames only, full index,
etc. especially since at the filename only level it surely wouldn't take too
long compared to the time to update the directory anyway - would it?

[And BTW - since searching for shortcuts produces oddities... what does the
Save Shortcut Properties indexing filter do?]

I'd be interested to know more about how and why it works (gremlins and
their offspring excepted) ... and other user's opinions.

Julian
 
Dear Julian,

Microsoft is not an open source company, yet; thus I won't be able to
satisfy your curiosity in full. Here are some vague and hand-waving answers
though. We do listen to USN journal change notifications, as a matter of
fact we do pretty much everything you've suggested and more, except for
supporting multiple catalogs. The previous incarnation of the indexer (the
one Yellow Dog of XP told you to turn on) could do that, maybe we'll
resurrect that later, or maybe we won't.

The reason why we still do not update stuff in real time is quite simple -
you don't want us to. First thing any XP optimization guide suggests is
turning off CISVC. Indexing is an expensive hobby, since it is very heavy on
disk IO. One of our main objectives is to stay out of the way and let you
get stuff done while the indexer is up and running, so that you wouldn't
turn it off to begin with. We haven't achieved the perfect balance yet, but
don't expect index to ever update instantaneously.

If you want to index millions of files without significant perf impact, make
sure your data and your catalog (and preferably your swap file) are on
different physical HDDs, use multicore CPU, 2Gb of RAM or more, and throw
some readyboost in for a good measure. Then you can go and set values
starting with DisableBackOff under
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows Search\Gathering Manager to
various interesting numbers, and see indexer behaving quite differently.
Purely hypothetically speaking, of course...

Thanks,
Ilia

Julian said:
I read often hereabouts that the index does not update in real time, to
which
the obvious question is: why not?

The indexer receives notifications of events that relate to finding & use
(such as create and delete - I don't know what the whole list is...) so it
seems to me logical that it should cache recent events.

Searches for files should be masked with the results the cache: if the
main
index says the file exists but the cache says it doesn't then don't show
that
result; if a file is moved then redirect the result to the new location
using
the cache; etc. etc.; if the file is only in the cache then... well,
there's
no conflict with the main index. (NB I do hope that when a file is moved
the
Indexer doesn't delete and rebuild for that file - I hope it's clever
enough
to save that effort...)

As for the cache itself... well, if indexing responds to various events at
some later time, it must be keeping a record of them somewhere.

Now, I can appreciate that users create few files manually, but can delete
very large numbers - and could create very large numbers programmatically.

It would therefore make sense to separate file attributes from content as
far indexing and searching is concerned - to a certain degree... you could
separate search results for new files based on basic OS attributes only
from
searches based on the content of those files (highlight them the way new
apps
get highlighted after install?)

No one can reasonably expect an arbitrarily large amount of data to be
indexed by content in an arbitrarily short space of time - but if the file
can be created/deleted/moved and not lost in the process one can
reasonably
expect the Search function to know about basic attributes such as the
filename at the very least..

And with regard to indexing removable media, if the medium is r/w why not
store the index on the medium - space/bandwidth permitting. You could even
give the user the choice:... no index, index of filenames only, full
index,
etc. especially since at the filename only level it surely wouldn't take
too
long compared to the time to update the directory anyway - would it?

[And BTW - since searching for shortcuts produces oddities... what does
the
Save Shortcut Properties indexing filter do?]

I'd be interested to know more about how and why it works (gremlins and
their offspring excepted) ... and other user's opinions.

Julian
 
Hi Ilia

Thanks for the reply, [I notice it's only Ms/Mr Briefcase at MS who is
keeping their head down <g> ]... it's extremely good to get even hand-wavings
from the source and it all seemed very sensible... Yes, indexing was one of
the first things to turn off in XP and I didn't miss it.

I don't think I was as clear as I might have been re "instant" update of
indexing - I tried to acknowledge the impracticability of RT content
indexing... I meant to emphasise masking any type of result from the full
index with simple file property information from the "cache" (UNS (Update
Notification Service?) Journal?) , specifically to avoid issues such as "I
deleted that file but the indexer still lists it".

Or was that what you meant when you said "we do most of that...and more" -
on re-read I think perhaps you did...

[I did just create a txt file on the desktop - instantly there from Start
Search, instantly gone when I renamed it, but I have read of other user's
issues and have been puzzling over indexing's operations for a while
recently.]

Whether there is a return of multiple indices, indexability of removable
media would be a big plus.

I do have a dual core CPU and 2GB RAM, readyboost soon maybe - but don't
think the laptop will be getting second disc, so very interesting as the
hypothesis is (thanks for the provocative thoughts!) it won't be tested for a
while :)

And finally, emphasising that I'm not being ironic or aggressively critical,
whilst it is unlikely that MS will become Open Source soon (<g> how long
before a rumour starts from a random hit for "MS" "Open Source"), easier
access to functional specifications might be an interesting idea... I bet all
the good stuff gets patented ASAP so the implementations are protected. (Not
meaning to start a long debate about that either: I can already hear it in my
head)

Thank you again... when someone from MS hears the question, the answers are
worth listening to.

(An equally forthcoming response to the more immediate Briefcase issues
would be appreciated... but the deafening silence isn't your fault!)

Julian
 
Back
Top