Steve B. said:
Your solution seems to be quite effective, but there a scenario where this
cannot apply :
The solution will probably works well for domain filtering, but if I want
to exlude a specific page or sub pages, it won't works.
Let's imagine for example a web site hosting site :
http://www.anygoodprovider.com/asiteofmalicioususer/spywareeldoradodownloads.htm
In this case, the domain is ok, but not the page.
yes, but you get a hit on the domain, so you proceed to the page. If you
make the domain matching efficient, you make 95% of your matching efficient.
I'm thinkg about a "IUrlFilter" :
interface IUrlFilter {
bool TestUrl(string urlToTest);
}
This interface will be implemented twice :
class HostFilter : IUrlFilter
class PageFilter : IUrlFilter
I disagree. I wouldn't use your mechanism for either method.
I firstly test agains hostname (which will probably be 95% of the tests),
and if host is ok, I also test against page (with a very lower number of
test to execute) using the regex method.
Regex can be used not only to match, but also to parse. Use it to parse,
but then match using a data structure.
Use regex to return a list of terms seperated by slashes. Create a tree
structure of all of the offending pages by URL. I'm willing to bet that 80%
of the hits against an offending page will be consumed by less than 5% of
the root elements and less than 30% of the first level elements. In other
words, virus sites will have many virus pages, and hosting providers that
don't care about virus hosting will have many many virus sites.
allow for aliases by making your tree so that you can have references from
multiple roots point to a single tree. This covers the fact that DNS
systems routinely allow for nearly infinite aliases.
hostingprovider.com
subdomain: abc, jbx, sand, zx*
dir: foo
dir: bar
page: viruspage1.html
page: viruspage2.html
page: viruspage3.html
dir: fudge
page: viruspage4.html
hostalias.com -- reference to hostingprovider.com
127.198.41.77 -- reference to hostingprovider.com
One regex can parse your entire URL into parts. Domain, subdomain, a list
of directories, and a page reference, followed by a list of parameters.
So you can match once against every URL, take the parts list and run it
through the tree you create. Simple matching will catch every hit. And it
scales well.
Let's say that you have 10,000 URLs in your list. Let's say that 80% of
them have the same 100 domains. That leaves 20% with a smattering of other
domains. I'll swag that down to about 1000 domains. Let's say that you
create a tree structure like above, but the root is a sorted list of these
1000 domains. Let's say you use my method and a binary search against the
domain name. You can find the domain hit in 10 hits or less... you'll
average five hits. Then, if your directory and page lists follow the same
pattern, you are likely to find your match or free the URL in less than 10
more matches. That's 20 matches. Add another 10,000 URLs to the list. The
number of matches goes up... to 21.
With your method, the first scenario requires 10,000 matches. The second
requires 20,000 matches. It scales EXCEEDINGLY badly.
Please, Steve, use a data structure for this mechanism. The fundamental
concept is a Suffix tree, which is a specialized form of a Trie. Note that
you could use a pure Trie structure, especially a Patricia Trie structure,
if you are clever and want REALLY fast performance, but that may require
some programming chops. You can find a good writeup on Trie structures on
wikipedia.
http://en.wikipedia.org/wiki/Trie You could also use a Directed
Acyclic Word Graph, which will give you some very fast performance as well.
To be honest, I may have described a compact DAWG above.
Some other links:
http://www.nist.gov/dads/HTML/suffixtree.html
http://www.nist.gov/dads/HTML/directedAcyclicWordGraph.html
--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik
Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--