writing a search text program

  • Thread starter Thread starter Edward
  • Start date Start date
E

Edward

Hi,
I want to write a program to search for a given word in a large text file (
it could have any format ) like a book and high light matchin words . Regular
search will be so slow because it might be several hundreds of pages , I was
thinking about some sort of indexing but I wasn't sure . Any suggestions?
 
Edward,

It seems that you want to make a program that nobody yet was able to, and
you ask here how to do that?

Cor
 
Timothy Casey said:
Far from impossible, this has been done before. Somewhere, I have a 16kb
standalone (no dependencies) executable I picked up years ago and it was
the fastest bulk search I'd seen until I discovered the speed of the Get
statement.
it could have any format
 
Edward said:
Hi,
I want to write a program to search for a given word in a large text file
(
it could have any format ) like a book and high light matchin words .
Regular
search will be so slow because it might be several hundreds of pages , I
was
thinking about some sort of indexing but I wasn't sure . Any suggestions?

The Indexing Service is one option, otherwise you have to make your own. See
here:

Programming with Visual Basic
http://msdn.microsoft.com/en-us/library/ms692251(VS.85).aspx

I am not sure if the samples are for dotnet, or for classic VB. Here is an
ASP.Net version:

How to use an ASP.NET application to query an Indexing Service catalog by
using Visual Basic .NET
http://support.microsoft.com/kb/820105
 
Edward said:
Hi,
I want to write a program to search for a given word in a large text file
(
it could have any format ) like a book and high light matchin words .
Regular
search will be so slow because it might be several hundreds of pages , I
was
thinking about some sort of indexing but I wasn't sure . Any suggestions?

Far from impossible, this has been done before. Somewhere, I have a 16kb
standalone (no dependencies) executable I picked up years ago and it was the
fastest bulk search I'd seen until I discovered the speed of the Get
statement.

My world scripture collection tops 150Mb of raw ascii text - that's 25
million words or the equivalent of 50,000 pages (given an average page size
of 500 words per page). In 1998 it took several hours for the fastest search
algorithms to scan the collection. Although the three arm bottle neck
created when the Windows PageFile shares the same disk platters with user
data, and operating system+applications still has a profound effect on
speed, faster CPUs, SATAII protocol, and the introduction of the Honeywell
memory access system ("DDR") to the general public has sped things up a
great deal.

In any case, Windows search has always been one of the fastest, and now
incorporates an index generated by the Search service. I presume that if
..NET offers access to the Windows Search, this would save you the trouble of
writing your own index - although I'd suggest the Search service needs to be
house-trained so that, like the disc defragmenter, it only runs when asked
or as part of a user initiated maintenance procedure.

This leaves us with the question of how to invoke the Windows Search API
(the one that utilises the Windows search index, and preferably in a .NET
namespace), to return file locations and file access points (byte number for
the start of the search string). As these come in, your program could
assemble context statements with the search string highlighted within.

Good luck...
 
Timothy,

Your reply let me think about this, you see it often(mostly) in my replies.

http://en.wikipedia.org/wiki/Dunglish

Be aware that all those persons where is spoken about are probably fluent
speakers in at least English, French and German beside Dutch

Cor
 
Thanks for all great responses. There are several websites that have
different types of word search for example bibleGeteway.com that you type
word and it almost immediately brings up all the verses as an hyperlinked
that include that word. I'm trying to write this type of application which is
certainly doable , but I don't know what would be the best approch to achive
the best speed ?
 
Edward said:
Thanks for all great responses. There are several websites that have
different types of word search for example bibleGeteway.com that you type
word and it almost immediately brings up all the verses as an hyperlinked
that include that word. I'm trying to write this type of application which
is
certainly doable , but I don't know what would be the best approch to
achive
the best speed ?

Start here(Copy and Paste the URL):

http://en.wikipedia.org/wiki/Index_(search_engine)
 
Cor Ligthert said:
it could have any format

You make a good point, Cor. Sometimes key text is buried in images,
non-standard binaries, internal file compression, and encryption. This can
be really frustrating. However, your point alludes to the most interesting
part of the question.

The beginning for pulling text out of these other formats in a generic way
falls to Natural Language Processing or NLP because language has a
mathematical signature that corresponds to the myriad of rules that apply to
spelling and grammar. In spite of all the spectacular claims, no-one has
NLP - not yet. The foundation of NLP is contextualisation, which has been
the focus of languages such as XML. However, as the folks at Brown
University soon discovered, there are also issues of core structure versus
extensible features of language that vary from node type to node type in
structural hierarchy of communication. Did I mention that language is not
compatible with well-formed hierarchies due to the frequency of two way
ambiguity in word meanings (and often function). Thus context is drawn from
structure, which itself could be any one of a number of possibilities that
cannot always be resolved from structure. Consider the meaning of the word,
"green" in the following examples:

1. The green recruit
2. The green passenger
3. The green corporation
4. The green thumb

In each case the meaning of green depends on the definition of the
applicable noun.

Nobody's clear on a system, and when you compare the effectiveness of .NET
as a language unto itself - it emerges that there may well be some errors in
the conventional academic perception of linguistic structure. Linguists hold
the verb, for example, as an equal classification to the noun when
considering parts of speech - but in the Microsoft class system, a verb is
meagerly a sub-part of the noun. The Microsoft system works very well, so
perhaps the engineering proves they got something right in this
department...?

In any case, we have a long way to go, even if the data and analyses being
accumulated are fascinating.
 
Edward said:
Thanks for all great responses. There are several websites that have
different types of word search for example bibleGeteway.com that you type
word and it almost immediately brings up all the verses as an hyperlinked
that include that word. I'm trying to write this type of application which
is
certainly doable , but I don't know what would be the best approch to
achive
the best speed ?

When you see an app like you describe remember that the search has actually
taken place before a user asks. There are indexes present which are created
off-line which present the data quickly. Remember this: Index once for
speed and then for the rest of time benefit. If the data is dynamic (not an
already written text) then you have to way the indexing time vs the search
time. Never an easy thing to do.

LS
 
All that is quite correct, but it's not relevant to the problem, and the
task as stated is definately, as you state, far from impossible . Although
OP used the term "any format" he also used the terms "large text file" and
"given word". So he is not considering language information represented in
anything other than plain text, and an indexer does not need to comprehend
the file in order to find a match between a 'given word' and some portion of
a text file. The respondent is choosing to interpret the question in a way
that enables him to avoid addressing the real issue.
 
Timothy said:
Which brings us back to Windows Search and the attached index provided
by the Search Service. Two questions: does anyone know

1. The .NET Namespace necessary to tap into Windows Search?

In general, this requires using the Content Indexing Service and COM
API. Lets see what the .NET wrappers are......

There is a bunch of stuff. I found this example:

How to use an ASP.NET application to query an Indexing
Service catalog by using Visual Basic .NET

http://support.microsoft.com/kb/820105

2. The range of house-training options available to the Windows Registry
for the search service?

I too would like to know.

Not sure what that means. Don't you want to ideally want to eliminate
any direct Windows Registry usage?

--
 
Off topic: Please check your computer date and time zone, your posts seem to
be ahead by few hours...
 
Nobody said:
Off topic: Please check your computer date and time zone, your posts seem
to be ahead by few hours...

Sorry, the message was meant for Timothy Casey.

Thank you
 
James Hahn said:
All that is quite correct, but it's not relevant to the problem, and the
task as stated is definately, as you state, far from impossible .
Although OP used the term "any format" he also used the terms "large text
file" and "given word". So he is not considering language information
represented in anything other than plain text, and an indexer does not
need to comprehend the file in order to find a match between a 'given
word' and some portion of a text file. The respondent is choosing to
interpret the question in a way that enables him to avoid addressing the
real issue.

Which brings us back to Windows Search and the attached index provided by
the Search Service. Two questions: does anyone know

1. The .NET Namespace necessary to tap into Windows Search?
2. The range of house-training options available to the Windows Registry for
the search service?

I too would like to know.

Thanks in Advance...
 
Timothy said:
The Search service comes on often when it's not wanted. Plug in a USB
(thumb) drive for a 27 second backup and the system tells you that the
drive is in use for the next 30 minutes because the Search service has
decided to index the drive for the Nth time.

I have noticed odd-ball locks with some development where I was
creating folders and sub-tree, files and to retest had a batch file to
first delete it. Sometimes the delete failed saying it was in use
and/or there was delays. My solution was to turn off indexing either
at the drive or test root folder. So if the created folder was
C:\ROOT\XYZ, I turned off indexing for C:\ROOT.

Hmmmm, now that I think about it, I might had used the test root
folder on a USB drive! I don't recall if I had noticed that aspect of
it in seeking a solution. I thought I was having drive problems at
first. But it turned out turning off the indexing for disk areas where
you are rapidly creating/deleting solved it. It makes sense too.
Thats a lot of CJ (Change Journal) notifications, caching and
buffering going on.

Try turning off indexing for the USD device drive via My Computer
drive properties.

--
 
Timothy said:
How wold this be applied to a desktop application in VB2005?
Also, is there a way to get the program to initiate the building of the
catalogue without the user having to know that...?

Hi Timothy,

First, a small side note. Check your system date or the mail writer
system you are using to properly set the localized date or ZULU/GMT
what have you date. Your mail is showing up post dated and it skewed
the threading or date sort order of incoming mail. Very annoying but
I also recommend it because some AVS filtering systems will look for
incorrect or posted date mail as a maker of spammers. Just a side note.

I didn't think the complete example would be useful, but rather
showing how to access the indexing COM API.

--
 
Mike said:
In general, this requires using the Content Indexing Service and COM API.
Lets see what the .NET wrappers are......

There is a bunch of stuff. I found this example:

How to use an ASP.NET application to query an Indexing
Service catalog by using Visual Basic .NET

http://support.microsoft.com/kb/820105
Thanks


Not sure what that means. Don't you want to ideally want to eliminate any
direct Windows Registry usage?

The Search service comes on often when it's not wanted. Plug in a USB
(thumb) drive for a 27 second backup and the system tells you that the drive
is in use for the next 30 minutes because the Search service has decided to
index the drive for the Nth time. This is not good when you are in a hurry,
have soewhere else to go, and were not planning to wait 30 minutes for the
Windows to release the thumb drive. Other bad habits include hogging
resources needed on demand by other program launches (which sometimes leads
to a freeze). A search program with an easy means of regulating indexing and
giving the user more control would ultimately be a better product. If this
can be done through a namespace I'm all ears - otherwise it falls to
registry settings does it not?
 
Mike said:
In general, this requires using the Content Indexing Service and COM API.
Lets see what the .NET wrappers are......

There is a bunch of stuff. I found this example:

How to use an ASP.NET application to query an Indexing
Service catalog by using Visual Basic .NET

http://support.microsoft.com/kb/820105

How wold this be applied to a desktop application in VB2005?
Also, is there a way to get the program to initiate the building of the
catalogue without the user having to know that...?
 
Back
Top