Maximum Throughput with HttpWebRequests

  • Thread starter Thread starter tobin
  • Start date Start date
T

tobin

Hi there,

Question 1:
I'm writing a spider in C# and want to be able to achieve the maximum
number of page fetches per second using the HttpWebRequest. Is, for
example, 100 pages/sec possible? I'm sure I've seen open source
perl/java spiders boast up to 500 pages/sec performance on an average
spec developer machine.

Note that I'm polite and not trying to hit any given server more than
once within any given minute, our spider has lots of sites to scan and
so has plenty to do without breaking the 2 concurrent requests rule!

Question 2:
I suspect one main area of pain is to be the threadpool. If I do this
(example only):

for( int i=0; i<3000; i++)
{
...
WebRequest r = HttpWebRequest.Create( serverUrl );
r.BeginGetRequest(SomeCallBack, someState);
}

.... will this use the treadpool behind the scenes? If so, will the pool
throw an exception as soon as the 26th item gets added, or does it have
some flexible scheme for queueing callbacks? This appears to be the
behavour I'm seeing, and I don't know how to make BeginGetRequest() NOT
use the pool, or to stop the pool from overflowing.

Any help really really appreciated :-)

Tobin
 
Thanks Vadym,

Thanks for the article link, I need to read that a few times I think!

I'm feeling that the bottom line is that, when using the HttpWebRequest
either asynchronously or synchronously, it's going to use the
threadpool in both situations.

Therefore, the only way to increase the number of concurrent requests
is to either increase the threadpool size (in machine.config), or to
run separate programs in separate application domains (each having
their own threadpool).

Anyway, I'm still not sure but will keep reading and hopefully folk
will keep sending feedback!

Thanks

Tobin
 
run separate programs in separate application domains (each having
their own threadpool).

This can introduce an overhead if these domains have to communicate with
each other.

When using BeginGetResponse completion port thread will only be used - that
is you can leave the default count
of threads in ThreadPool ( number of completionport threads is 1000 )

Here is small test that shows that only cp threads are only being used .The
test is not perfect, however it shows the idea...

class Program
{
static List<HttpWebRequest> list = new List<HttpWebRequest>();
static void Main(string[] args)
{
int wr, cp;
ThreadPool.GetAvailableThreads(out wr, out cp);
Console.WriteLine("Before wr = {0} cp = {1}", wr, cp);

StartRequest("http://localhost/helloworld/hello.html");
StartRequest("http://localhost/helloworld/hello.html");
StartRequest("http://localhost/helloworld/hello.html");
StartRequest("http://localhost/helloworld/hello.html");
StartRequest("http://localhost/helloworld/hello.html");
StartRequest("http://localhost/helloworld/hello.html");
StartRequest("http://localhost/helloworld/hello.html");

ThreadPool.GetAvailableThreads(out wr, out cp);
Console.WriteLine("After wr = {0} cp = {1}", wr, cp);
Console.ReadLine();

}

static void StartRequest(string url)
{
int wr, cp;
HttpWebRequest webReq = WebRequest.Create(url) as
HttpWebRequest;
list.Add(webReq); //in order to be not GCed
webReq.BeginGetResponse(new AsyncCallback(WebReqCallback),
webReq);
ThreadPool.GetAvailableThreads(out wr, out cp);
Console.WriteLine("After StartRequest wr = {0} cp = {1}", wr,
cp);
}

static void WebReqCallback(IAsyncResult ar)
{
int wr, cp;

HttpWebRequest webReq = (HttpWebRequest)ar.AsyncState;
HttpWebResponse webResp = webReq.EndGetResponse(ar) as
HttpWebResponse;

ThreadPool.GetAvailableThreads(out wr, out cp);
Console.WriteLine("After WebReqCallback wr = {0} cp = {1}", wr,
cp);

}

}
 
Thanks for the response Vadym, and the sample code. I'll test this, but
do you know off top of head if .NET 1.1 works the same way? I'm fairly
sure they changed this in .NET 2.0 to overcome the limitation in .NET
1.1

Thanks

Tobin
 
Back
Top