David said:
[...]
Anyhow, my application needs to be able to scan a document and upload it
to a webservice along with some data about the scan. The scanner will
have an Automatic Document Feeder which can scan up to 50 documents, at
15 pages per minute. My current tests (with my own scanner, no ADF and I
have not timed the scan) are generating a colour A4 file in jpg format,
150DPI scan of about 150K to 250K in size, depending on the document.
Now, I want to take advantage of the sheet feeder, as there is an initial
tranch of about 50,000 documents to scan and upload, so if I get the
system set up optimally, I will get a scan every 4 seconds.
As each scan comes in, I think it would be the perfect time to be
uploading the document to the webserver at the same time (obviously,
after each scan, but to not stop the scanning process). I am thinking
that it should be a thread that is running to do the upload.
You probably should already be using a thread, to do the scan. Otherwise,
the UI of your program cannot respond to user input, including simple
things like just redrawing a window, as well as more complex activities
like being able to try to cancel the scan (depending on the scan API...I'm
not familiar with it, but I'd guess it has that ability).
So, I'd say the first thing to do is work on the thread needed for
scanning. That will give you a simpler introduction to the concepts
involved. Then move on to dealing with the file upload issue.
Would that be the best way to handle the uploads?
With there being 50 documents, that could potentially mean that there are
50 threads running at the same time, would that be a problem?
You should not need 50 threads. Based on the information you've provided
so far, it sounds as though you need a "producer" thread that handles
scanning, and a "consumer" thread that handles uploading. The producer
will scan a document and then place the results in a queue that is read by
the consumer. The consumer will upload one document at a time, as fast as
it can. A queue is used to coordinate data transfer from one thread to
the other; the producer enqueues, the consumer dequeues.
50 threads each trying to upload a different file at the same time will
just cause contention for your bandwidth.
How do I synchronise and monitor the threads? Basically, I need to
monitor so that the scanner operators do not close the application while
the upload is happening. The synchronise is to try and ensure that the
first document is uploaded first and the rest are in order of their scan.
If you use a producer/consumer design, ordering will be implicit.
As far as closing the application goes, that should not be a problem to
include in your implementation. You just need to keep track of the status
of the consumer thread, and handle that state appropriately in the UI
(with both the producer and consumer in threads separate from the UI, this
should be easier).
A volatile bool variable, exposed as a property in your thread class(es)
(producer and consumer, which you may or may not decide should be their
own individual classes...it can be simpler to combine the two into a
single class), along with an event that the class can raise indicating
changes to the property value, will allow your UI thread to not only check
the current status (via the property), but also be made aware of changes
to that status (via the event) so that it knows when to update the UI
(e.g. enable/disable a button, showing/hiding a progress indicator, etc.).
In fact, if you use a regular Thread instance and don't make it a
background thread, it will be a foreground thread by default, and your
process won't actually exit until that thread completes. I think in most
cases, it would make more sense to simply ask the user if they really want
to quit, and if so allow them to interrupt the processing (scanning and
uploading). But, one option is to let the non-UI threads continue to run;
they will continue to do their work, even if the user has closed the UI
for the program (and thus the thread associated with it).
I prefer a more explicit interaction with the user, but if you're going to
write threaded code, it's good to know the options available, if for no
other reason than so that you understand the various members of the types
you're using (for example, the Thread.IsBackground property).
As far as synchronization goes, .NET offers a number of synchronization
objects, many of them just wrappers or reimplementations of
synchronization objects that people familiar with the unmanaged Win32 API
will recognize. But it's my preference to use the .NET-specific API, the
Monitor class, when possible. And for a producer/consumer scenario, this
in fact is a very good choice, because the semantics of the Monitor class
match what you need exactly.
In particular, you can use the Monitor.Wait() and Monitor.Pulse() methods
to coordinate between the two threads. The consumer thread will acquire
the lock (e.g. by using the "lock" statement) and then enter a work loop.
The loop will terminate according to whatever condition you set (e.g. some
flag set internally by a "Stop()" method). Inside the loop, the consumer
will check for work to do, will do any work necessary (by inspecting the
queue and dequeuing anything in it), and then call Monitor.Wait() when
there's nothing left to do.
In your producer thread, upon completing a scan (or whatever work the
product of which then needs to be passed to the consumer), you'll acquire
the lock, enqueue the data object representing the scan, and then call the
Monitor.Pulse() method.
Note that I didn't mention releasing the lock in either of the above
paragraphs. That's not because you don't need to. It's because if you do
things "the usual way", it happens implicitly.
Specifically, the "lock" statement is just an easy way to refer to the
Monitor class; it creates a block of code for which before the block is
entered, the Monitor.Enter() method is call, and before the block is
exited, the Monitor.Exit() method is called. It also implicitly puts that
"Exit()" call in a finally block, to ensure that the lock is always
released before exiting the block of code.
In addition, the Wait() and Pulse() methods both must be called by a
thread that already holds the lock. But the Wait() method automatically
releases the lock when it's called, and re-acquires the lock before
returning. So a thread blocked at the Wait() method is not actually
holding the lock while it's waiting.
Are there any 'watchits' or whatever that I need to be aware of and keep
an eye out for?
There's no shortage of "gotchas" that come up when writing threaded code.
Here are some of the ones I consider most significant:
* Control.Invoke():
Assuming your application is based on Forms or WPF (and if it's a .NET GUI
app, that's almost definitely the case), you need to be aware of the fact
that all of your GUI objects have "affinity" for the thread in which they
were created (generally the main thread for the application). That is,
those objects must be used only on that thread.
To accomplish this, you need to use the Control.Invoke() or
Control.BeginInvoke() method. These allow you to pass a delegate
reference for a method you want to execute on the thread that owns that
control. If you do anything from the other threads that needs to
manipulate the UI, you have to use Invoke(), either directly or indirectly
(it can happen indirectly if you use, for example, the BackgroundWorker
class...but for your producer/consumer scenario, I think you'll be better
off creating your own threads...they will live too long to justify using a
thread pool thread for the work, as the BackgroundWorker class will do).
* Use an instance of "object" dedicated for locking:
Some code examples (in fact, the default implementation for C# events)
will use the "this" reference as the target object for the "lock"
statement and uses of the Monitor class. But IMHO this is a bad idea,
because it exposes the same object used for locking to code elsewhere.
This can lead to code you don't control (or at least are not thinking
about at the moment) being written to lock using the same object instance
you're already using. This can impair performance at best, and create new
deadlock scenarios at worst.
Always use a private, dedicated object for locking, and then you'll know
that any code that tries to get the lock can only be blocked (directly,
anyway...see "deadlock" below
) by other code in the same class,
specifically related to the code you're looking at.
* Don't do any actual work while holding the lock:
In the producer/consumer scenario, it is generally better to dequeue one
or more work items while holding the lock, and then release the lock
before actually doing the work. If you know in advance that the work
items can be handled _very_ quickly, you can get away without this feature
in your code, but the mere fact that you need a producer/consumer design
usually means that the work takes too long for that to be feasible.
A typical consumer thread might look like this:
while (!fStop)
{
Item[] rgitem = null;
lock (_objLock)
{
if (_queue.Count > 0)
{
rgitem = _queue.ToArray();
_queue.Clear();
}
else
{
Monitor.Wait(_objLock);
}
}
if (rgitem != null)
{
// process all items in "rgitem"
}
}
An alternative pattern might look like this:
lock (_objLock)
{
while (!fStop)
{
Item[] rgitem = null;
if (_queue.Count > 0)
{
rgitem = _queue.ToArray();
_queue.Clear();
}
else
{
Monitor.Wait(_objLock);
}
if (rgitem != null)
{
Monitor.Exit(_objLock);
// process all items in "rgitem"
Monitor.Enter(_objLock);
}
}
}
The latter of those two is a little more complicated, but it has the
advantage that when the thread wakes up, it doesn't release the lock again
until it really has to. The former example, a thread that has something
to dequeue will still release the lock before trying to reacquire it and
check the queue, even though it was just granted the lock by the Wait()
method.
I prefer the latter.
* Watch out for deadlock:
A very common accident when learning to write threaded code in .NET is to
forget that the Invoke() method is a blocking statement requiring a
specific resource.
In general, deadlock occurs when two threads are waiting to acquire the
resource the other thread already has, _and_ each thread will not release
the resource it has until it acquires the other resource. There are lots
of ways this can happen, and the dependency can either be direct or
indirect (i.e. there can be more than two threads or more than two
resources involved).
The simplest way to avoid deadlock is to simply never try to acquire a
lockable resource while already holding some other lockable resource. A
slightly more complicated way is to always acquire lockable resources in
the same order. And an even more complicated way is to encapsulate that
previous rule in an actual class that enforces "lock leveling".
In any case, be careful about the order and circumstances of every lock
you acquire. And if your program just stops working and sits there doing
nothing, not responding, you probably have a deadlock problem. If you are
using a non-Express version of Visual Studio, you can use the debugger to
examine your program threads and see where they are blocked. That will
show you what lockable resources are being held and how your two or more
threads are interfering with each other.
* C# events are essentially just delegate invocations:
The thing that makes a C# event useful is the encapsulation of the add and
remove methods. But when it comes to actually _raising_ a C# event, it's
really just a plain method call (or calls) via delegate invocation.
This means that if code executing in a particular thread raises an event,
all the event handlers for that event are executed _in the same thread_.
This has a number of implications, but the most pertinent in a GUI
application is that if you subscribe to an event with a handler in a GUI
object, and that handler will actually interact with the GUI data
structures (e.g. by showing/hiding a control, enabling/disabling a
control, displaying text, etc.) the handler at some point needs to use
Control.Invoke() to get that interaction back on the GUI thread where it
belongs.
For some specific classes (BackgroundWorker is a good example), the class
is specifically designed to be used in a thread-with-GUI environment, and
there are ways to fix the event-raising code itself to ensure the event is
raised on the correct thread.
But for something like this, the more appropriate technique is just to
have your event handler deal with it. So, the event handler will just
call Control.Invoke(), passing a delegate that will do the actual work
necessary.
Related to this, take care to use a thread-safe pattern for raising an
event. For example, either save the event field in a local before
checking for null, or pre-initialize the event field with an empty
delegate so you know it's always non-null.
Jon Skeet has a very nice article that goes into greater detail on some of
the above, as well as other issues around threading:
http://www.yoda.arachsys.com/csharp/threads/
And just because I can, here's a link to my little screed on why I don't
like any of the code examples on MSDN that use Control.Invoke():
http://msmvps.com/blogs/duniho/arch...chnique-for-using-control-invoke-is-lame.aspx
Pete