atomic operations...

  • Thread starter Thread starter Chris M. Thomasson
  • Start date Start date
Peter said:
That is what I've been saying all along.


It might or it might not. As I mentioned, I personally wouldn't use
an approach that uses allocations to simulate CMPXCHG8, but that
doesn't mean that allocations are actually a problem (if you don't
measure it, there's no way to know) nor that there's no performant
way to implement your desired design in Java or C#. It could be that
plain old synchronization is fine, or it could be that you need to
learn more about high-performance synchronization objects in Java or
.NET (for example, if you're dealing with producer/consumer queues,
the ReaderWriterLockSlim class might be useful).

Part of the problem here is that your question has been solely about
implementing a specific synchronization technique, whereas most likely
what you really have is a higher-level problem that needs solving and
which in some specific implementation you're familiar with uses some
specific synchronization technique that may or may not be applicable
in a managed environment like Java or .NET. Just because a particular
implementation doesn't translate directly into a different
environment, that doesn't mean that the environment doesn't provide a
completely different way to arrive at the same results.

The hardware isn't changing, so I'd think that the set of instructions that
performs well doesn't change based on whether a compiler is emitting them or
they're in a runtime library.
 
Peter Duniho said:
If you haven't already, you should read Joe Duffy's blog:
http://www.bluebytesoftware.com/blog/Default.aspx

All of it, but especially his discussions on lock-free algorithms.

The short answer: lock-free is much harder to implement and maintain, and
does not always improve performance.
[...]

I agree that non-blocking algorithms do not always improve performance.
Sometimes, they can make things worse. I prefer a marriage between
non-blocking and blocking algorithms. For instance, one can use non-blocking
algorihtms on fast-path, and resort to blocking on slow-path. Anyway, I
created some very crude little test applications that attempt to show
peformance difference between non-blocking and blocking
multi-producer/consumer and single-producer/consumer queues. Here are the
results I get on my old P4 3.06 ghz hyper-threaded box:


Blocking Multi-Producer/Consumer Test Results
___________________________________________
http://pastebin.com/f4341bbf6
------------------------------------------------------------
Number of producers: 3
Number of consumers: 3
Number of messages per-producer: 2000000
------------------------------------------------------------
Milliseconds: 49375
Ticks: 493750000
------------------------------------------------------------
Milliseconds: 44656.25
Ticks: 446562500
------------------------------------------------------------
Milliseconds: 50046.875
Ticks: 500468750
------------------------------------------------------------
Milliseconds: 51046.875
Ticks: 510468750
___________________________________________





Non-Blocking Multi-Producer/Consumer Test Result
___________________________________________
http://pastebin.com/f1bc14065
------------------------------------------------------------
Number of producers: 3
Number of consumers: 3
Number of messages per-producer: 2000000
------------------------------------------------------------
Milliseconds: 18968.75
Ticks: 189687500
------------------------------------------------------------
Milliseconds: 18156.25
Ticks: 181562500
------------------------------------------------------------
Milliseconds: 17609.375
Ticks: 176093750
------------------------------------------------------------
Milliseconds: 17593.75
Ticks: 175937500
___________________________________________








Blocking Cached Single-Producer/Consumer Test Results
___________________________________________
http://pastebin.com/f3d81e1f5
------------------------------------------------------------
Number of messages: 2000000
------------------------------------------------------------
Milliseconds: 83718.75
Ticks: 837187500
------------------------------------------------------------
Milliseconds: 138609.375
Ticks: 1386093750
------------------------------------------------------------
Milliseconds: 76796.875
Ticks: 767968750
------------------------------------------------------------
Milliseconds: 80843.75
Ticks: 808437500
___________________________________________




Non-Blocking Cached Single-Producer/Consumer Test Results
___________________________________________
http://pastebin.com/f6fdc765d
------------------------------------------------------------
Number of messages: 2000000
------------------------------------------------------------
Milliseconds: 52343.75
Ticks: 523437500
------------------------------------------------------------
Milliseconds: 52781.25
Ticks: 527812500
------------------------------------------------------------
Milliseconds: 52968.75
Ticks: 529687500
------------------------------------------------------------
Milliseconds: 52781.25
Ticks: 527812500
___________________________________________






Please note that this does not really prove anything. The tests are of
highly contrived worst case scenarios...
 
Chris M. Thomasson said:
Peter Duniho said:
If you haven't already, you should read Joe Duffy's blog:
http://www.bluebytesoftware.com/blog/Default.aspx

All of it, but especially his discussions on lock-free algorithms.

The short answer: lock-free is much harder to implement and maintain, and
does not always improve performance.
[...]

I agree that non-blocking algorithms do not always improve performance.
Sometimes, they can make things worse.
[...]

I think I can show that in the context of the multi-producer/consumer
example. I think that the naive lock-based version should be able to beat
the non-blocking version under certain conditions. I also think that the
two-lock queue algorithm should be able to beat the full non-blocking
version. This is because of memory management issues that are specific to
the non-blocking version; I need fast DWCAS emulation in C# in order to
reuse nodes! Anyway, I will code up the test and post it when I get some
more time.
 
Back
Top