Strange Idea

  • Thread starter Thread starter Peace
  • Start date Start date
P

Peace

I have been accumulating newsgroup messages from here for about 9 months now
almost since I started getting into .NET. Being the new year I decided to do
some organization.

The idea occured to me to write an application that stores the title, author
and contents to SQL database from which searches and the like can be done by
author and subject matter either in the subject line or the body of the
message. Then of course, becuase my SQL Server is public I would then share
the application with my many loyal friends on this ng and we would all
rejoice. (LOL)

Well maybe not :). Most of you guys that taught me would be too far advanced
to need access to something like that. I have found going back and searching
old messages to be extremely helpful. If anyone will be interested let me
know. I will be happy to share the app when complete.

If nothing else it will be a good exercise for me.


The database, datasets, and such are set and I have done practice runs with
single messages to see how it would work and it worked okay.


As it turned out the easy part was writing the content to the database in
individual runs. The hard part was what you would think would be the
easiest, looping through the folder and reading the files.

The first thing I needed to do is write a routine to loop through all the
files in the folder I have stored these messages in that have an extension
of .nws and get the contents of the files. But with .nws files it is not
behaving as expected. I knew how to do this with text files. What do I need
to change for .nws files in order to do a loop thorugh the folder and read
the file?

(By the way using Outlook Express as a newsreader)
 
Peace said:
I have been accumulating newsgroup messages from here for about 9 months now
almost since I started getting into .NET. Being the new year I decided to do
some organization.

Hi,
Were you aware that Google has been doing this too (and for some years.)?
The first thing I needed to do is write a routine to loop through all the
files in the folder I have stored these messages in that have an extension
of .nws and get the contents of the files. But with .nws files it is not
behaving as expected. I knew how to do this with text files. What do I need
to change for .nws files in order to do a loop thorugh the folder and read
the file?

The .nws format is some sort of binary format and apparently MS doesn't
publish the format. You probably have a few choices since Outlook Express
can save the files as text files (it's an option) but header information is
lost doing that. There is a utility (here)
http://www.oehelp.com/DBXtract/Default.aspx that might be of help. It
sounds like the guy figured out the .nws format and his program will convert
them to text.

Now what would really be useful is... rather than duplicate what Google is
doing (and doing well) would be to distill the data into "information."
There is way too much noise and far too little signal. This happens when
the same questions are asked 300 times, when entire messages are quoted and
when lengthy arguments about meaningless topics take place. Also note that
Google (and others) store everything which means every wrong answer is
available for search also... you have to believe that from time-to-time
somebody finds the wrong answer (but not the correction) and goes off trying
that :-)

You should also add a "concept" search. So for instance if there was a
bunch of source code in the message but the words "source code" didn't
appear in the message it would still be found when somebody typed in "vb.net
source code" as a search criteria. You'll have a lot of work to do...

Tom
 
Hi Peace,

I read Tom's notes and he's certainly correct about a lot of things, but,
hey, I've been doing what you've been doing for a bit more than a year now!
It would be nice to use your device, if it a similar mechanism exists on
google; besides, it represents a great exercise for you.

Go for it!

Bernie Yaeger
 
I would be interested in seeing what you have done or helping out if you
need it ( which you probably dont now ). I would certainly be intersted in
your database if nothing else.

Regards - OHM
 
Hi,
Were you aware that Google has been doing this too (and for some years.)?

Right and I have not been all that happy with their engine pretty much for
the reasons you specify below.

There is a utility (here)
http://www.oehelp.com/DBXtract/Default.aspx that might be of help. It
sounds like the guy figured out the .nws format and his program will convert
them to text.

Perfect. Thank you for the link.
Now what would really be useful is... rather than duplicate what Google is
doing (and doing well) would be to distill the data into "information."
There is way too much noise and far too little signal. This happens when
the same questions are asked 300 times, when entire messages are quoted and
when lengthy arguments about meaningless topics take place. Also note that
Google (and others) store everything which means every wrong answer is
available for search also... you have to believe that from time-to-time
somebody finds the wrong answer (but not the correction) and goes off trying
that :-)

You should also add a "concept" search. So for instance if there was a
bunch of source code in the message but the words "source code" didn't
appear in the message it would still be found when somebody typed in "vb.net
source code" as a search criteria. You'll have a lot of work to do...

Yes I do have a lot of work. What got me started on this was Mr.
IAmIronMan's assault of the group and I realized his messages were being
given as much credit as legit topics.

Thank you for your thoughts though. Developing the filter will be the
toughest part. I was thinking about using what I hae seen Herfired quote, I
believe it to be some sort of etiquette rules, as a "filter" for the group
and messages that violated that would be removed from the database. Of
course anything marked OT would automatically be dumped.
 
Peace said:
Thank you for your thoughts though. Developing the filter will be the
toughest part. I was thinking about using what I hae seen Herfired quote, I
believe it to be some sort of etiquette rules, as a "filter" for the group
and messages that violated that would be removed from the database. Of
course anything marked OT would automatically be dumped.

Google has an API which might be interesting for you too look at:
http://www.google.com/apis/

And, just a thought but I wouldn't arbitrarily dismiss messages that didn't
meet some etiquette rule. Not to belabor the point but consider that there
can be two answers posted to a question. One is somewhat rude and abrasive
but contains 25 lines of code illustrating the solution, the other is a link
to a web page which may or may not be available today. Few people would
trade the answer for a dead link.

That's why I often use the terms "information" and "data." There is a lot
of data in the world, it's produced all the time but information
(particularly useful information) is often hard to come by.

Good luck,
Tom
 
Back
Top