C# Crawler and performance (speed of crawling)

  • Thread starter Thread starter Benjamin Lefevre
  • Start date Start date
B

Benjamin Lefevre

I am currently developping a web crawler, mainly crawling mobile page (wml,
mobile xhtml) but not only (also html/xml/...), and I ask myself which speed
I can reach.
This crawler is developped in C# using multithreading and HttpWebRequest.
Actually my crawler is able to download and crawl pages at the speed of
around 5 pages per second. It's running on a development machine with 512Mb
Ram and a shared ADSL-connection (2Mbits). Is it ridiculous ? Which speed
may I expect if I improve my code (how ?) ?
I would be very interested to have feedback from some people having already
worked on such stuff.

/Benjamin

N.B.: sorry for my poor english (I am french ;)).
 
Hi Benjamin,

I'm working on a crawler too, and I'd be interesting in swapping notes. It's
hard to know just how much you should be getting from your crawler.

The current log file indicates that my app is capable of processing up to
about 8 pages per second, although it's often around the 2-5 mark, and
varies wildly depending on many factors. This processing includes converting
pages to XHTML, analysing them, filtering out unwanted pages or areas of the
page, page duplication checking, and finally writing the pages to the
database (which is full text indexed). We're scanning 4000+ sites and taking
12,000 new pages a day. The app can cope with double this amount of
throughput. However, this is still quite low cause it has to pause to make
time for other tasks other than crawling.

I'm not expert, but improving the code is down to your performance factors,
which is down to both code and hardware. I'm spinning 20 threads, using
in-memory queues etc.

The database is our biggest problem at the moment, since it can't cope with
simaltaneous indexing and searching. I've spent hours inside both query
analyser and following traces to get it tuned, but pinpointing bottlenecks
is like peeling an onion - you have to methodically pick away at it cause
there's so many possible problem areas.

We're also considering scaling up to having read-only databases that are for
querying only, so that we can index around the clock. Also, using
technologies such as Lucene for text searching etc may yeild performance
increases.

If you want to swap notes, email me at t0bin_<at>_t0binharris_<dot>_c0m.
Replace 0 an o.

Hope this helps

Tobin
 
Back
Top