find similar content

  • Thread starter Thread starter Tem
  • Start date Start date
T

Tem

Im writing a forum app. I would like to know
what are some ways to find posts with similar content to a selected post.
(match certain important keywords, but not all words)

what is the easiest way to do this? and
what is the most accurate way to do this?

can someone point me to the right direction.

Thank you.

Tem
 
You need to clearly define what "similar" means to you. If you're using SQL
Server as a back-end full-text search functionality might help you meet your
requirements.
 
im not sure how to define it programmatically. but you see it in forums and
blogs all the time.
such as "related topics" "similar posts" "suggested reading"
 
A lot of the blog functionality for "related topics" is based on the tags
assigned to it by the blogger at publication time. You could have writers
assign tags to posts as they are created, but for a forum app you would
probably be better off parsing the content being posted, extracting key
words from it, and storing that in a database. Then you could calculate a
"relativity" score for different posts based on the # of keywords included.
There are a few different methods that come to mind, but the starting point
is clearly defining the problem. For instance, if you expect a high volume
of posts you might want to precalculate the similarity scores for posts and
limit the number of "hits". For smaller volumes it might make more sense to
do it on the fly. It all depends on a lot of factors...
 
Yeah I talked to some coworkers about that contest a while back, and we
ended up with all kinds of interesting ideas :) Many of them involved trying
to figure out what types of entertainment people might be interested in (in
the aggregate) based on patterns in freely available information. The
discussion took us from analyzing the stock market to automatically scanning
newspaper articles and political blogs to glean information to predict the
future :)

For what this poster wants though it might be a bit easier to narrow down
the solution, assuming his forum app. will be used to set up forums that
host a slightly narrower range of topics than are presented by the
entertainment industry as a whole :) For what the OP wants, assuming he can
come up with specifics, keyword scanning and analysis of his posts would
probably do the trick. Depending on the platform he's using as a back end
he might be able to take advantage of specific functionality also, but again
it depends on his specifics.
 
I use sql server 2005 and asp.net.
how can I utilize the full text search feature to do what I need?
im not sure how to formulate the queries.
 
FTS queries are performed based on keywords, so the first thing you would
need to do to go down the FTS route is set up a method of extracting
relevant keywords from posts. If you want to go with FTS, you're basically
looking to automate FTS query formulation. So the first step is to grab all
the relevant keywords out of your posts; then you need to create a full-text
index in the database. After that, creating FTS queries is pretty simple.
With the FREETEXT predicate or FREETEXTTABLE function, for instance, you can
do something like this:

keyword1 OR keyword2 OR keyword3 OR keyword4 ...

The hardest part of the whole exercise would be the initial step of
stripping out relevant keywords which FTS cannot do for you. And there are
some design decisions you'll have to make - for instance, do you want to
store all keywords found in posts in the database or do you want to grab
them dynamically every time the post is displayed? Do you want to determine
which posts are related in a scheduled batch process and store those
results, or will this be performed dynamically at post viewing time also?
This will probably be determined by your requirements--i.e., what's more
important to you: speed, storage efficiency, dynamism/currency of results?

Personally I'd probably go with a method other than FTS for this particular
task, but only because that's not how I normally use it. I'm definitely
interested to find out how well FTS works for you on this.
 
I would say speed is most important to me.

I will let know you how well it works.

Thanks
Tem
 
Speed being the most important, I'd recommend pre-parsing your keywords out
of the posts and storing them in the database. Then you can extract the
keywords from the database and create FREETEXT predicates using the
keywords. For a bigger speed improvement, you could run the FTS queries in a
regularly scheduled batch and store the results in the database.
 
Back
Top