Surprising threading issue

  • Thread starter Thread starter GeezerButler
  • Start date Start date
G

GeezerButler

In my app, I need to connect to an external web service. I need to
connect to it about 100 times per request and so i use threads to
quicken the processing. Since the external web service accepts only 5
requests per second, I need to check for that too.

I use Jon Skeet's CustomThreadPool class for the Threads (because i
dont want to use the ASP.NET thread pool and I dont have nearly enough
technical expertise to write my own).

Here's my code:

private static DateTime lastHttpCallTime = DateTime.MinValue;
private readonly object weeklyProcessorLocker = new object();
private static readonly object dateTimeLocker = new object();
private int weeksProcessed = 0;


//Code enters here
void MainFunc()
{
//Create thread pool for handling data for each week. Kill any
thread which has been idle for more than 10s
threadPool = new CustomThreadPool();

//Add each web service call item to thread pool
for (int i=0; i<itemsToProcess.Length; i++)
threadPool.AddWorkItem(new WeeklyProcessorDelegate
(ReadDataForWeek), itemsToProcess.from, itemsToProcess.to);

//Keep waiting till all items are processed
lock (weeklyProcessorLocker)
{
while (weeksProcessed < itemsToProcess.Length)
Monitor.Wait(weeklyProcessorLocker);
}

//Process further
}


//Delegate method for the ThreadPool Work Item
private void ReadDataForWeek(string from, string to)
{
//Create url using from and to
Uri url = new Uri(String.Format(URL_STUB, from, to));
XPathNodeIterator weeklytTrackIt = GetHttpUrlContent(url, XPATH);

//Use weeklytTrackIt for some processing here

//Signal end of processing for this week
lock (weeklyProcessorLocker)
{
weeksProcessed++;
Monitor.Pulse(weeklyProcessorLocker);
}
}

//Code for making http request and getting the xml returned
private XPathNodeIterator GetHttpUrlContent(Uri uri, string xpath)
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
Stream responseStream = null;
WebResponse response = null;

try
{
//Wait for 200ms if needed
lock (dateTimeLocker)
{
DateTime now = DateTime.Now;
TimeSpan interval = now - lastHttpCallTime;

if (interval < new TimeSpan(2000000))
Thread.Sleep(new TimeSpan(2000000) - interval);


//Record the last http call
lastHttpCallTime = DateTime.Now;

//Make the http call
response = request.GetResponse();

//dummy statement
//string x = "Hi"; <--------- If i uncomment this it works!
}

//Load to xml document and select iterator to the nodes to be
selected
using (responseStream = response.GetResponseStream())
<-------------------- exception here
{
XPathDocument doc = new XPathDocument(responseStream);
return doc.CreateNavigator().Select(xpath);
}
}
finally
{
if (response != null)
((IDisposable)response).Dispose();
}
return null;
}


At the place i marked above, i intermittently get a malformed xml
exception which says that some tags are not properly closed. However
if i uncomment the dummy statement, it never throws exception.
It's weird but to me it seems that the execution leaves the lock
before the response is fully received.
I'm sure there's a better explanation to this.
Hope someone can get through reading this thread and can help.
 
Hi Pete,
Thanks for your reply.
I think we have 2 separate problems here:
1) the xml loading issue (btw you are right, the xml exception is not
on that line, it's where i load the responseStream into the
XpathDocument). I'll post about this later or simply hope that it goes
away!)
2) The design issue you mentioned (which I too realized in the past
hour). I'll talk about the design issue first as that's much more
important.

A (complete) test code that I'm running is as follows:

static void Main(string[] args)
{
CustomThreadPool threadPool = new CustomThreadPool();
//Add week to thread pool
for (int i = 0; i < 50; i++)
threadPool.AddWorkItem(new WeeklyProcessorDelegate(ReadDataForWeek));

//Keep waiting till all weeks are processed
lock (weeklyProcessorLocker)
{
while (weeksProcessed < 50)
Monitor.Wait(weeklyProcessorLocker);
}
}

static void ReadDataForWeek()
{
lock (myLocker)
{
DateTime now = DateTime.Now;
TimeSpan interval = now - lastHttpCallTime;
if (interval < new TimeSpan(2000000))
Thread.Sleep(new TimeSpan(2000000) - interval);

//Record the last http call - before
DateTime beforeCall = DateTime.Now;
//Make the http call
TheHttpCall();
//Record the last http call - after
DateTime afterCall = DateTime.Now;

//Update the last http call time to the mid point b/w before and
after call. (This is a rough approx for the time when request reached
server)
TimeSpan ts = afterCall.Subtract(beforeCall);
lastHttpCallTime = beforeCall.AddTicks(ts.Ticks / 2);
}

TheXmlProcessing();

lock (weeklyProcessorLocker)
{
weeksProcessed++;
Monitor.Pulse(weeklyProcessorLocker);
}
}

private static void TheHttpCall()
{
Thread.Sleep(500);
}

private static void TheXmlProcessing()
{
Thread.Sleep(500);
}


I make 50 web service calls. Assuming that both the http call and xml
processing take 0.5 seconds each. So a normal sequential code would
take 50 * (0.5 + 0.5) = 50secs
Points that I found:
1. The code above takes roughly 25 secs. This beats the purpose of
having 10 threads because essentially we could have done the same
thing using 2 threads. You noted this point above too. However here i
CAN make sure that the http calls are spaced at at least 200ms
2. If i remove the myLock, the code takes 5 secs which is what I would
expect from 10 threads. However here i CANNOT make sure that the http
calls are spaced at at least 200ms

So basically I want to retain the 2nd behavior but also have some way
of having the http calls spaced at least 200ms.
Note that here i make an assumption that the http calls take 500ms but
in actuality it could be anything (below 200ms, above 200ms etc.)
 
     -- Naïvely, one might think it would work to just start a thread pool  
work item every 200 ms, in which the work item does the web request.
(if you simply need to _average_ one per 200 ms then
the naïve approach would actually be okay).

Wow, your naive solution is very pertinent to my requirements,
considering that I simply need to average 200ms.
This does waste 200ms per web service call but it makes the code speed
totally dependent on the number of threads and thus faster and which
is also good because then i can tweak the number of threads depending
on traffic.
BTW, I don't think it's naive at all. It's subtle and elegant.

I think that some sort of a static variable will be required to track
http calls for different concurrent users.
 But,  
thread pools generally will limit how quickly new threads are started up;
Pretty much configurable in Jon Skeet's class. (i think)
work items get queued if there aren't enough threads to work on them,  
which could result in some number of work items all being started  
simultaneously.
Not a problem as i mentioned above.
static void Main(string[] args)
{
     CustomThreadPool threadPool = new CustomThreadPool();
     DateTime dtPrev = DateTime.Now - TimeSpan.FromMilliseconds(200);

     //Add week to thread pool
     for (int i = 0; i < 50; i++)
     {
         TimeSpan tsElapsed = DateTime.Now - dtPrev;

         if (tsElapsed.TotalMilliseconds < 200)
         {
             Thread.Sleep(200 - (int)tsElapsed.TotalMilliseconds);
         }

         dtPrev = DateTime.Now;

         TheHttpCall();

        threadPool.AddWorkItem(TheXmlProcessing);
     }

     //Keep waiting till all weeks are processed
     lock (weeklyProcessorLocker)
     {
        while (weeksProcessed < 50)
            Monitor.Wait(weeklyProcessorLocker);
     }

}

private static void TheHttpCall()
{
     Thread.Sleep(500);

}

private static void TheXmlProcessing()
{
     Thread.Sleep(500);

     lock (weeklyProcessorLocker)
     {
        weeksProcessed++;
        Monitor.Pulse(weeklyProcessorLocker);
     }

}

In a real application, you'd replace "TheHttpCall()" with the setup and  
call to GetResponse().  Then you'd pass the result to "TheXmlProcessing()"  
when queuing the new work item.  Hopefully, you can see that this approach  
is much simpler than the code you had before, with the static variables  
and thread synchronization.

Yup, this is simpler because no thread synchronization is needed. But
the running time is similar to scenario 1 in my post above. (speed is
very important because some end-users can have as many as 250-300
weeks to process).
And if i'm not wrong dtpPrev will need to be static in the case where
there are concurrent users accessing the site.

Now, as I mentioned, if the web requests take longer than 200 ms to  
complete but you're still allowed to submit one every 200 ms, the above  
could be less efficient than is possible.  The solution there would be to  
use an asynchronous web request.  The asynchronous call adds a _little_ 
more complexity, but IMHO the net result is actually code that's a bit  
simpler (once you get past the asynchronous aspect).

That version might look something like this:

static CustomThreadPool threadPool = new CustomThreadPool();

static void Main(string[] args)
{
     //Add week to thread pool
     for (int i = 0; i < 50; i++)
     {
         TheHttpCall();

         Thread.Sleep(200);
     }

     //Keep waiting till all weeks are processed
     lock (weeklyProcessorLocker)
     {
        while (weeksProcessed < 50)
            Monitor.Wait(weeklyProcessorLocker);
     }

}

private static void TheHttpCall()
{
     // Thread.Sleep(500);

     // ...set up the HttpWebRequest, then...

     // Note: using an anonymous method here, which I prefer.  Ifyou like,  
you could
     // use a named method instead.

     request.BeginGetResponse(delegate (IAsyncResult ar)
         {
             WebResponse response =  
((WebRequest)ar.AsyncState).EndGetResponse(ar);

             threadPool.AddWorkItem(TheXmlProcessing, response);

         }, request);

}

private static void TheXmlProcessing(WebResponse response)
{
     // Thread.Sleep(500);

     // do processing using "response"

     lock (weeklyProcessorLocker)
     {
        weeksProcessed++;
        Monitor.Pulse(weeklyProcessorLocker);
     }

}

This is a nice concept, although I probably would not need to be so
accurate.
Hopefully the above gives you some ideas as to how to simplify your  
original code, and perhaps even make it more efficient.

Yes, it does much more than that. Your thoughts on the matter are very
much appreciated.
 
BTW, the XmlException was a first chance exception which my debugger
was catching because of a breakpoint.
I don't really understand why first chance exceptions are needed in
the first place and how the debugger decides that they are not
harmful.

First chance exception in a .NET library is ok but it's kind of
disconcerting to see a First chance NullReferenceException in
MyLibrary.dll and the debugger not having any problems with it at all
 
Not knowing what you mean by "concurrent users accessing the site", I  
can't say.  Your posts have all implied that you have a single thread  
issuing these requests.  At worst, if you actually have multiple threads  
trying to issue the requests, I would not make the timing variable static,  
but instead would create a queue that is processed (consumed) by a single 
thread in which all of the timing is handled.

This application is a web site and there might be more than 1 users
online at a given moment (say A and B)
Although ASP.NET will process A and B's request in separate threads,
but going by the *naive* solution, A will make requests approx every
200ms and so will B. But the 3rd party web-service will see it as a
request per 100ms.
So IMO, we'll still need a lastHttpCallTime static/global/application-
level variable and the code in any thread will have to make
adjustments according to this variable.

I sure hope I'm not totally off here.

I do understand your approach using the queue but maybe the static
variable should suffice?

Geez
 
Using a  
queue, you wind up with a queued thread pool item only when you have  
actual work to do.  Using a static variable tracking time, you wind up  
with a queued thread pool item practically immediately for every single  
possible work item.

Yes, you are right about this.
Using the static variable will lead to the same old design issue
revisited.

Thanks for all the help! This is really appreciated.
 
All of this will work, but I think you are overlooking something. If you
set your Application Pool to a web garden of 2 or more processes or ever web
farm your application, using singleton or static member or whatever will not
help. You will need to serialize your use of the web service via a database
or singleton back end Windows service or web service. That may not be in
the plan now, but it will be a huge design change if required in the future.
 
All of this will work, but I think you are overlooking something.  If you
set your Application Pool to a web garden of 2 or more processes or ever web
farm your application, using singleton or static member or whatever will not
help.  You will need to serialize your use of the web service via a database
or singleton back end Windows service or web service.  That may not be in
the plan now, but it will be a huge design change if required in the future.

Thanks for bringing that up however unfortunately I don't understand
most of it :(
Right now my app is on a single machine but going forward I'm going to
upgrade to this. http://www.tsohost.co.uk/clustered-hosting.php
So, does clustered hosting have something to do with what you
mentioned above?
If it does, can you elaborate in a little more layman language?
 
Sure, I'd be happy to. Basically, in a web farm, there are multiple
machines executing your code. Each machine runs independently of the
others, unless you make them aware of each other, which is rarely done. So,
if you write code to limit your use of a remote web service such that the
code relies on timing in a single process and that process is executing on
multiple machines, you end up circumventing the logic. If you farm the
application on 2 machines, you will hit the web service twice as much
because they both will be running independently of each other. A web garden
is a way to have IIS instantiate multiple processes to service your
application, thus exposing the same situation without the need for multiple
servers. To configure this, set the Max Worker Processes setting for the
application pool for your site.

Clearly, the next question is how to manage this. You need to centralize
something. Typically, a web application has a central database, so that is
a natural start. When you make a request to the remote web server, you can
record the fact that the request was made and include logic in your class to
check the database before it makes a call to see if it can and if not, how
long it is going to have to wait. The most straight forward design would be
to create a table that will have a row per request that is to be made. Each
attempt to make a request does:

1. Get the row with the highest request date/time
2. Calculate when the next allowed request date/time is (previous
highest request date/time plus minimum request interval, if in the past, use
current date/time)
3. Wait until the calculated next allowed request date/time arrives
4. Make the request

Of course, if your site gets overloaded, you probably want to put in some
throttling logic so that if there are so many outstanding requests that it
is unreasonable to expect completion any time soon (the calculated next
allowed request Date/Time) is too far off in the future, just tell the user
the site is too busy or something.

Two other options are to create a Windows service and using remoting to
centralize access to the remote web service or create a web service that is
not load balanced that does the same thing. Both of these options create
single points of failure in the architecture, unless you load balance them
using the same mechanism as before, so they do not make the entire solution
any more robust in my opinion.

I hope this helps and is in a more "layman" language.

Regards,
Dan Ruehle
 
Thanks for the detailed explanation.
Currently the site does not have too much traffic (about 6 users/hr)
and situations with 2 or more concurrent users are rare.
But yeah, I get the idea of centralization and I think moving to that
kind of an architecture should not be too much of a problem, in case
the traffic increases.
 
Back
Top