Tri-cubic interpolation speed on video card?

buyanovsky · Apr 4, 2005

I'm wandering what number of tri-cubic (not trilinear) interpolations
per second can be achieved on video card? Does somebody have
non-assumptive but actual test-proven numbers? The only number I could
find on internet is the tri-cubic speed on GeForce3 what seems very
slow (1.800.000 tricubics per second) , see:
< http://wwwvis.informatik.uni-stuttgart.de/vmv01/dl/papers/8.pdf >
Even on general CPU the speed is around 7.000.000 tricubics per second
(P4 3.8Ghz)

Task description:
- Given 512x512x2048 (12-bit) volume
- The oblique/arbitrary oriented plane 1024x1024 (12-bit) crosses this
volume
- Each pixel of the plane has to be filled by tri-cubic interpolation
values

Thanks,
George

Evenbit · Apr 4, 2005

I'm wandering what number of tri-cubic (not trilinear) interpolations
per second can be achieved on video card? Does somebody have
non-assumptive but actual test-proven numbers? The only number I could
find on internet is the tri-cubic speed on GeForce3 what seems very
slow (1.800.000 tricubics per second) , see:
< http://wwwvis.informatik.uni-stuttgart.de/vmv01/dl/papers/8.pdf >
Even on general CPU the speed is around 7.000.000 tricubics per second
(P4 3.8Ghz)

Task description:
- Given 512x512x2048 (12-bit) volume
- The oblique/arbitrary oriented plane 1024x1024 (12-bit) crosses this
volume
- Each pixel of the plane has to be filled by tri-cubic interpolation
values

Thanks,
George

You can achieve a 25-fold efficiency increase by reversing the polarity
on the influx capacitors and cross-coupling the warp-field signitures.
;-)

Nathan.

Kristijan Korazija · Apr 4, 2005

Evenbit said:
You can achieve a 25-fold efficiency increase by reversing the polarity
on the influx capacitors and cross-coupling the warp-field signitures.
;-)

Nope. No go.

In that case, you'll also need transphasic inductor with tricobalt
injector, to simulate secondary phase field capacitor, which results
in three times slower reaction during tri-cubic interpolation.

Vaughn L.Porter · Apr 4, 2005

Kristijan Korazija said:
Nope. No go.

In that case, you'll also need transphasic inductor with tricobalt
injector, to simulate secondary phase field capacitor, which results
in three times slower reaction during tri-cubic interpolation.

Don't forget to brew a Very Strong, Very Hot cup of tea for the Infinite
Improbability Generator.

Vaughn L.Porter

Herman Dullink · Apr 4, 2005

I'm wandering what number of tri-cubic (not trilinear) interpolations

per second can be achieved on video card?

Are you sure you want to do this on a video card?
True, it can do specific tasks very quickly, but only very specific tasks...
Also, it's optimised to output data on the screen, not back to the CPU
and/or system memory

Task description:
- Given 512x512x2048 (12-bit) volume

That's a lot of data. And video processors don't often support 12-bit...
Using 16-bit, this is more than 1GB.

With current normal available hardware, I think that it will not be worth
the effort. I think that a smart addressing system, where you can map
the cube in system memory (but not loading unless necessary) will be
more effective (unless of course... you want to do this many times,
or in real time).

Medical data?

H

buyanovsky · Apr 4, 2005

Herman said:
Are you sure you want to do this on a video card?
True, it can do specific tasks very quickly, but only very specific tasks...
Also, it's optimised to output data on the screen, not back to the CPU
and/or system memory

That's a lot of data. And video processors don't often support 12-bit...
Using 16-bit, this is more than 1GB.

With current normal available hardware, I think that it will not be worth
the effort. I think that a smart addressing system, where you can map
the cube in system memory (but not loading unless necessary) will be
more effective (unless of course... you want to do this many times,
or in real time).

Medical data?

Thanks for the prompt reply,

Are you sure you want to do this on a video card?

No, I'm not sure, and this is the reason I'm looking for the
opinion of people who has an expertise in programming of video card to
do a custom job.

That's a lot of data. And video processors don't often
support 12-bit... Using 16-bit, this is more than 1GB.

It is exactly 1GB = (2^(9+9+11+1)) bytes. It is very affordable today
to have 4GB (~3.2 under Windows XP /3GB) so the memory size is not the
problem. The bottleneck is the memory latency. Just today finished the
accurate benchmarking of brute force tricubic MIP performance on dual
Xeon 3.6/800fsb and is 19 million tricubic&MIP samples per sec (for
SSE2 - 4 threads) with coherent memory access, and only half of this
speed for the slowest oblique MIP.

The numbers I'v seen about tri-linear performance on NVIDIA 6800
makes me wander that maybe there is a way to harvest this power.

Medical data?

Yes, today the typical range of CT datasets size is
512x512x(300...1000) and taking into account the new 64 slices CT the
3000-4000 slices of near isotropic volumes are going to be pretty
common in 2-3 years.

Thanks,
George

calypso · Apr 4, 2005

Nope. No go.

In that case, you'll also need transphasic inductor with tricobalt
injector, to simulate secondary phase field capacitor, which results
in three times slower reaction during tri-cubic interpolation.

You forgot again... Compensate!!! Then you can achieve peak performance of
the warp core during tri-cubic interpolation...

--
Na biciklu se za svaku Novu Godinu celav balkonu farbu.
By runf

Damir Lukic, calypso@_MAKNIOVO_fly.srk.fer.hr
a member of hr.comp.hardver FAQ-team

frik · Apr 4, 2005

Herman Dullink said:
Yes, for system memory. But I haven't seen many consumer class graphics
adapters yetvwith at least this size of memory. A graphics adapter also
needs some on-screen memory for the GUI, and some off-screen buffers for
the
result view(s).

I have some expertise, but not in 3D (yet).
So I can't really help you with the implementation details using modern 3D
GPUs, but I know a bit about busses, the data channels in a system. The
main
problem with most architectures is that it performs best when you 'push'
the
data through a channel (e.g from CPU to graphics adapter). Pulling data is
very bad for performance, the CPU has to wait many cycles for one fetch to
complete. A cache helps (and only helps) if certain data is fetched
multiple
times, and as long no more data is used than the cache size (ie. it's very
effective with 'looping' algorithms).
DMA techniques are used for some better performace; a device is then
programmed to push the data through a channel without further CPU
intervention.
If you use a large sequential stream of data, prefetching can be used. You
probably know about that, MMX/SSE/3Dnow have some prefetch instructions.

Maybe somebody else can give you some info about 3D specifics. You might
even contact the manufacturers of GPUs. Theoretically, the newest
generation
of 3D GPUs should be able to address more than a GB of data. These GPUs
are
programmable, so it should be possible to implement the whole algorithm in
GPU code.
All that's needed is (somebody with) the right (programming)
information...
Because of competition of these manufacturers, it'll very hard to get
detailed info. Maybe someone working there can see the challenge

Another approach is to look at the implementation of your algorithm.
Rewrite
(parts of) it so that memory cycles and caches are used optimal.
You might e.g. split up the volume in to smaller subvolumes, and/or use
tile-based rendering, ie split the screen up in smaller rectangular (or
square) parts, so that chances of data still in cache is higher.

There is one big problem with most GPUs, and that CRC, or to say, the lack
of it. AGP doesn't have hardware CRC, so in order to keep the data safe
drivers have huge tables (that's a big part of the 20+ MB you have to
download every time a new driver is out) that check the respose to every
command given to the GPU. That's why it's very hard to make programs that
would use GPUs huge power. I tried adding two numbers using ATi SDK and I
can tell you it's hard work. However PCI Express (I'm 99% positive of this)
has hardware CRC so it may be easier to make a program for PCI Express GPU.
The problem Herman mentioned (getting the info from the GPU's memory) is not
as big with PCI Express since upload and download are pretty much same
speed. Then there are TurboCache models that use system memory and only have
16MB of memory onboard. Those models are not as powerful as the top models,
but they show that there might be a way to use system memory for GPU data.

Hope this information helps you with your project
Greetz

Herman Dullink · Apr 4, 2005

That's a lot of data.

It is exactly 1GB = (2^(9+9+11+1)) bytes. It is very affordable today

Yes, for system memory. But I haven't seen many consumer class graphics
adapters yetvwith at least this size of memory. A graphics adapter also
needs some on-screen memory for the GUI, and some off-screen buffers for the
result view(s).

No, I'm not sure, and this is the reason I'm looking for the
opinion of people who has an expertise in programming of video card to
do a custom job.

I have some expertise, but not in 3D (yet).
So I can't really help you with the implementation details using modern 3D
GPUs, but I know a bit about busses, the data channels in a system. The main
problem with most architectures is that it performs best when you 'push' the
data through a channel (e.g from CPU to graphics adapter). Pulling data is
very bad for performance, the CPU has to wait many cycles for one fetch to
complete. A cache helps (and only helps) if certain data is fetched multiple
times, and as long no more data is used than the cache size (ie. it's very
effective with 'looping' algorithms).
DMA techniques are used for some better performace; a device is then
programmed to push the data through a channel without further CPU
intervention.
If you use a large sequential stream of data, prefetching can be used. You
probably know about that, MMX/SSE/3Dnow have some prefetch instructions.

Maybe somebody else can give you some info about 3D specifics. You might
even contact the manufacturers of GPUs. Theoretically, the newest generation
of 3D GPUs should be able to address more than a GB of data. These GPUs are
programmable, so it should be possible to implement the whole algorithm in
GPU code.
All that's needed is (somebody with) the right (programming) information...
Because of competition of these manufacturers, it'll very hard to get
detailed info. Maybe someone working there can see the challenge

Another approach is to look at the implementation of your algorithm. Rewrite
(parts of) it so that memory cycles and caches are used optimal.
You might e.g. split up the volume in to smaller subvolumes, and/or use
tile-based rendering, ie split the screen up in smaller rectangular (or
square) parts, so that chances of data still in cache is higher.

Herman

Ben Pope · Apr 4, 2005

The numbers I'v seen about tri-linear performance on NVIDIA 6800
makes me wander that maybe there is a way to harvest this power.

The problem here is that they are designed to do *linear* operations.
In moving to cubic from linear, you are increasing your workload
considerably.

With linear you have f(x) = ax + b
With cubic you have f(x) = ax^3 + bx^2 + cx + d

Now extend them to 3 dimensions and see how much more tricky it gets?

Off the top of my head, I can't think of a way to make use of trilinear
operations that can count towards your tricubic result.

Perhaps the best thing to do here is to take a look at what operations
can be performed in a DX9 pixel shader and decompose a cubic from there.
You can safely ignore the other 2 dimensions to start with, as your
dimensions are linearly seperable.

The problem of getting the data from the card to the CPU is one that
also needs considering, but I suspect that PCI Express will help
considerably as it is faster than AGP.

Sounds like an intersting problem.

Ben

Blento · Apr 5, 2005

You forgot again... Compensate!!! Then you can achieve peak performance of
the warp core during tri-cubic interpolation...

Why would you need compensation?..or interpolation???....
you just need to plug in your favorite pickup..and O®E®I!!..

ISKOOO
DISKOOO MILEE VOLI DISKOOOOO...

buyanovsky · Apr 5, 2005

Ben said:
The problem here is that they are designed to do *linear* operations.

In moving to cubic from linear, you are increasing your workload
considerably.

With linear you have f(x) = ax + b
With cubic you have f(x) = ax^3 + bx^2 + cx + d

Now extend them to 3 dimensions and see how much more tricky it gets?

Off the top of my head, I can't think of a way to make use of trilinear
operations that can count towards your tricubic result.

Perhaps the best thing to do here is to take a look at what operations
can be performed in a DX9 pixel shader and decompose a cubic from there.
You can safely ignore the other 2 dimensions to start with, as your

dimensions are linearly seperable.

On P4 it takes consecutive block of 49 SSE2 instructions to compute
tricubic result. If to substitute the memory reading (voxels reading)
by some fictive register contents the speed goes up dramatically from
19mln to 55mln. Note: I switched off only voxel reading from tricubic
part, it still writes result and Maximum Intensity blending still
reads/writes memory. So the main bottleneck is the memory latency (not
computations). Even if we assume that tricubic computations take zero
time; still the speed for this specific task can not go higher
36mln/sec. I guess that the same problem is true for video cards, but
maybe video card has more efficient memory organization to process
these kind tasks. Anyway, before I invest time trying video card
approach I would like to gather as much as possible info.

Thanks everybody for useful reply.

Marko ^SuperUnknown^ Martinovic · Apr 5, 2005

Blento said:
Why would you need compensation?..or interpolation???....
you just need to plug in your favorite pickup..and O®E®I!!..ISKOOO
DISKOOO MILEE VOLI DISKOOOOO...

ROTFL

)))

Ben Pope · Apr 5, 2005

On P4 it takes consecutive block of 49 SSE2 instructions to compute
tricubic result. If to substitute the memory reading (voxels reading)
by some fictive register contents the speed goes up dramatically from
19mln to 55mln. Note: I switched off only voxel reading from tricubic
part, it still writes result and Maximum Intensity blending still
reads/writes memory. So the main bottleneck is the memory latency (not
computations). Even if we assume that tricubic computations take zero
time; still the speed for this specific task can not go higher
36mln/sec. I guess that the same problem is true for video cards, but
maybe video card has more efficient memory organization to process
these kind tasks. Anyway, before I invest time trying video card
approach I would like to gather as much as possible info.

OK, so if the computations are effectively starved by limited memory
bandwidth, you have two options:

1. Read from memory less (I'm not sure how clever your data structure is)

2. Use faster memory.

The first technique will apply to a CPU and GPU.

The second technique would be helped considerably by the graphics card,
if the access patterns are suitable for GDDR3.

I'm not an expert on these new memories and what types of memory
addressing GPUs have, but I suspect that you have a similar thing to a
CPU, but just wider and faster - so I'm guessing that your biggest
concern is to keep your access patterns to something that will at least
fit in cache for the duration of a computation "unit".

Ben

Herman Dullink · Apr 5, 2005

The problem Herman mentioned (getting the info from the GPU's memory) is

not as big with PCI Express since upload and download are pretty much same
speed.

Yes and no, the communication on the bus can be at the same speed. But
fetching the data from the adapter's memory by the CPU will add many delays
to the overall system.
When writing, the CPU writes the data to a write buffer (or cache), so it's
like fire and forget. But when reading, the CPU has to wait until the data
has found its way back to the CPU via all bridges and other controllers. The
adapter's memory is not cached, so we're looking at worst case scenario
timing wise. The CPU may write GB/s, but will only be able to read MB/s when
there's no caching.
If the adapter has a function (DMA) to write the data directly into system
memory, that would be a great improvement.

I wonder if that's needed in this case though. If it's used for real-time
displaying, then all data can stay in the adapter's memory. Only when a
snapshot is required, data has to be copied to system memory.

H

Herman Dullink · Apr 5, 2005

So the main bottleneck is the memory latency (not

computations).

I couple of things crossed my mind:

- latency; the Xeon architecture uses a shared bus between CPUs. Therefor,
memory access is shared too, and there's a memory controller (north bridge)
between the CPUs and the memory. This will increase latency a bit. Try
running this on a AMD64 platform, it should give better results, if indeed
latency is the main bottleneck.

- Threads; because of the shared bus, multiple threads doing the same task
won't do much good... they'll block each other on memory access.

- Cache size of CPUs, the AMD64 and Intel Xeons have onchip caches between
512KB and 2MB. Try splitting up the work in smaller reactangular 'tiles',
where to process each tile requires to access less amount of data than the
cache size. E.g. 64×64 pixels or 256×256 pixels. Cache latency is very low

H

frik · Apr 5, 2005

Herman Dullink said:
Yes and no, the communication on the bus can be at the same speed. But
fetching the data from the adapter's memory by the CPU will add many
delays to the overall system.
When writing, the CPU writes the data to a write buffer (or cache), so
it's like fire and forget. But when reading, the CPU has to wait until the
data has found its way back to the CPU via all bridges and other
controllers. The adapter's memory is not cached, so we're looking at worst
case scenario timing wise. The CPU may write GB/s, but will only be able
to read MB/s when there's no caching.
If the adapter has a function (DMA) to write the data directly into system
memory, that would be a great improvement.

I wonder if that's needed in this case though. If it's used for real-time
displaying, then all data can stay in the adapter's memory. Only when a
snapshot is required, data has to be copied to system memory.

You are correct, I checked some of my test data and PCI Express GPUs are
twice as fast downloaders as the fastest AGP GPUs, but downloading
information from the GPUs memory to system memory is somewhere 15 do 20
times slower (250-380 MB/s) the uploading. However, there must be a DMA
function avaliable since nVidia 6200 TurboCache GPUs use system memory and
their performance is not much slower (as much as you would expect if the
memory write speed was 250-380MB/s) comparing to the plain 6200 model that
has onboard memory.

Tri-cubic interpolation speed on video card?

buyanovsky

Evenbit

Kristijan Korazija

Vaughn L.Porter

Herman Dullink

buyanovsky

calypso

frik

Herman Dullink

Ben Pope

Blento

buyanovsky

Marko ^SuperUnknown^ Martinovic

Ben Pope

Herman Dullink

Herman Dullink

frik