HP's Q&A about OpenVMS, x86-64, and Itanium

  • Thread starter Thread starter Yousuf Khan
  • Start date Start date
|>
|> BB> Hardware hasn't done a great job of just in time compiling one ISA
|> into another in the past--are you expecting a breakthrough?

No. The Crusoe, Pentium 4 and Banias don't do badly, which is what
I am referring to. DEC got good results compiling one ISA to
another, but the momentum seems to have been lost.

|> > All of the definitions of microcode that I have seen that exclude
|> > the Pentium 4 have been revisionist marketing. In the 1960s and
|> > 1970s, it would have been classified as an advanced microcoded
|> > design.
|>
|> BB> Microcode is used to emulate slow legacy things, more or less.
|> Anything that wants to be fast or power-efficient needs to run on
|> dedicated hardware.

So the Pentium 4 wasn't intended to be either?

|> Now if what you are advocating is a development framework which
|> further abstracts hardware from the people writing the software while
|> trying to recover as much performance as possible--that's hardly a
|> novel goal, but it's certainly noble.

That is what I am advocating, but I don't think that it's what Intel
is going to do.

|> As an aside--microcode on x86 and PAL code on IPF are used to add
|> features (reliability, security, and otherwise) and simplify the
|> implementation for things that can be slow.

Yes and no. The way that the trace cache is used on a Pentium 4
is conceptually just a variant of microcode.


Regards,
Nick Maclaren.
 
|> >>BB> Hardware hasn't done a great job of just in time compiling one ISA
|> >>into another in the past--are you expecting a breakthrough?
|> >
|> > Well, Transmeta's doing a credible job of JITing x86 into VLIW, so
|> > presumably they could have JITs for IA64 and AMD64 as well. ...
|>
|> The relatively fast x86 emulation in Transmeta chips owe quite a lot to
|> the fact that the hw was intentionally designed to be superset of all
|> important x86 features, plus extra hw to handle/detect memory aliasing
|> problems, thereby allowing them to enregister memory variables.
|>
|> Extending it to AMD64 would mostly be a matter of extending the
|> registers to 64 bit, and possibly increasing the number a bit, to more
|> easily handle 16 instead of 8 architected regs.

Yes, indeed.

|> Doing IA64 the same way, with good performance, would be _hard_.

Yes. My speculation is that Intel management have not learnt the
lessons from the IA-64, and will not be put off by the complexity
of that task.

The underlying engine would clearly owe a lot to the Alpha, and
be particularly similar in its VAX-emulation features. But, of
course, it would have to have TWO legacy architectures to support.
Not an easy task.


Regards,
Nick Maclaren.
 
Ah. The doctrine of Historical Inevitability. How nice to see
such, er, interesting philosophies being preserved.

Yep. It doesn't take a card-carrying analyst to see that its success
has been inevitable for a long time now, and it's probably going to
stay inevitable for the foreseeable future.

:-)

-kzm
 
In comp.arch Terje Mathisen said:
The relatively fast x86 emulation in Transmeta chips owe quite a lot to
the fact that the hw was intentionally designed to be superset of all
important x86 features, plus extra hw to handle/detect memory aliasing
problems, thereby allowing them to enregister memory variables.

Extending it to AMD64 would mostly be a matter of extending the
registers to 64 bit, and possibly increasing the number a bit, to more
easily handle 16 instead of 8 architected regs.

Doing IA64 the same way, with good performance, would be _hard_.

Couldn't you make teh IA64 set reside in scratchpad ram, and JIT towards
a 32 reg arch that only kept the most often / lately used regs in
actual registers and the rest in scratch? it would then be a pretty ordinary
target with more or less a couple of extra quirks...
 
Well, Transmeta's doing a credible job of JITing x86 into VLIW, so
presumably they could have JITs for IA64 and AMD64 as well. That's also
rumored to be how future Itanics are going to handle x86 emulation (i.e.
FX!32); hopefully it'll be better than the direct hardware support of
earlier models...

Transmeta's "credible" job of JITing x86 isn't all that credible.
They haven't managed to match the performance of VIA's C3 chips yet
they are using a die that is more than twice as large and a
SIGNIFICANTLY more expensive design, the core of their "Efficeon" chip
is more complicated and more expensive than the AthlonXP core and
about on-par with Intel's "Northwood" core.
 
Sander said:
Couldn't you make teh IA64 set reside in scratchpad ram, and JIT towards
a 32 reg arch that only kept the most often / lately used regs in
actual registers and the rest in scratch? it would then be a pretty ordinary
target with more or less a couple of extra quirks...

Yes, you could, except that all the sw for which IA64 is currently fast,
i.e. relatively regular fp codes, are fast specifically because they fit
the rotating registers/sw pipelining model of IA64.

This model will use all the regs, or at least all the regs that can be
live at the same time: Since the L2 latency used to be 9 cycles, this
means that you have to expect up to (at least?) N*9, with N = number of
regs required by the base algorithm, to be active at any given time.

I.e. 128 regs isn't just a hint, it's a requirement for a fast emulator,
unless you want to completely unravel all the logic behind those
predicated/pipelined/unrolled sw loops.

Terje
 
Couldn't you make teh IA64 set reside in scratchpad ram, and JIT towards
a 32 reg arch that only kept the most often / lately used regs in
actual registers and the rest in scratch? it would then be a pretty
ordinary
target with more or less a couple of extra quirks...

Yes, you could, except that all the sw for which IA64 is currently fast,
i.e. relatively regular fp codes, are fast specifically because they fit
the rotating registers/sw pipelining model of IA64.

This model will use all the regs, or at least all the regs that can be
live at the same time: Since the L2 latency used to be 9 cycles, this
means that you have to expect up to (at least?) N*9, with N = number of
regs required by the base algorithm, to be active at any given time.

I.e. 128 regs isn't just a hint, it's a requirement for a fast emulator,
unless you want to completely unravel all the logic behind those
predicated/pipelined/unrolled sw loops.[/QUOTE]

But how are you going to efficiently emulate the register rotation
itself, if the IA64 emulated registers are in 128 ordinary registers in
a conventional CPU?

Depending on the ratio of rotates to computation you could be much
better off keeping them in an array and changing a base pointer (and
doing a mod on each index into it).

-- Bruce
 
Bruce said:
But how are you going to efficiently emulate the register rotation
itself, if the IA64 emulated registers are in 128 ordinary registers in
a conventional CPU?

Ouch, you're right. That would be tough.

Emulating indirect register access without having hw support for the
feature would suck.
Depending on the ratio of rotates to computation you could be much
better off keeping them in an array and changing a base pointer (and
doing a mod on each index into it).

It might actually be better to do a limited level of sw unrolling of the
IA64 loop, then accept (if absolutely unavoidable) a few reg-reg moves
at the end to make it all add up.

If the IA64 code contains (as expected) fp loads 9 cycles in front of
first use, it should probably be replaced (after unrolling by four or
so) with a single prefetch operation which won't actually require real
registers to hold the load results.

Terje
 
Tony Hill said:
Haha, somehow I'm not at all surprised that HP is placing it's faith
in "industry analysts"! Having recently gained some exposure to the
inner workings of HP, I can say without a doubt that it's one of the
most confused and disorganized companies out there and they've got a
lot of the wrong people making decisions.

I don't know if the old Packard family was right about the merger with
Compaq being a bad idea, or if things were this bad before the merger,
but HP is definitely suffering from schizophrenia at this stage.
Basing your product lines on the whims of some analyst is perhaps the
best proof that no one is home in upper management!

It gives them someone to blame when their plan fails. "Well, the
ANALYSTS said we were doing the right thing!"
 
|>
|> BB> Hardware hasn't done a great job of just in time compiling one ISA
|> into another in the past--are you expecting a breakthrough?

No. The Crusoe, Pentium 4 and Banias don't do badly, which is what
I am referring to.

Okay--what you are talking about doesn't have anything to do with
running one ISA on another. It has to do with specifically designing
an internal micro-op format for the architecture you are going to run.
Nobody has shown they can do this for two different ISA's on the same
hardware with optimal execution performance in both.
DEC got good results compiling one ISA to
another, but the momentum seems to have been lost.

DEC used software to compile one ISA into another--not hardware.
|> > All of the definitions of microcode that I have seen that exclude
|> > the Pentium 4 have been revisionist marketing. In the 1960s and
|> > 1970s, it would have been classified as an advanced microcoded
|> > design.
|>
|> BB> Microcode is used to emulate slow legacy things, more or less.
|> Anything that wants to be fast or power-efficient needs to run on
|> dedicated hardware.

So the Pentium 4 wasn't intended to be either?

I realize now that we are talking about two different things. The
Pentium 4 decoding of complex instructions into micro-ops isn't
'Microcode' by any definition that I'm familar with. The instruction
to micro-op conversion is fixed at design time (i.e., hardwired), and
at best takes 1 instruction and turns it into a few internal
micro-ops. It's really just a fancy instruction decoder. True
microcode is invoked in rare situations, it involves basically
trapping to code stored on a ROM, here we may replace 1 instruction
with *many* internal ones--and it is very slow.

True microcode (and PAL code, the IPF equivalent) is also much more
flexible--and I suppose, if you were sufficiently motivated, you could
use it to emulate one ISA with another. But my guess is that
sofware-based JIT's will generally be a better solution.
|> Now if what you are advocating is a development framework which
|> further abstracts hardware from the people writing the software while
|> trying to recover as much performance as possible--that's hardly a
|> novel goal, but it's certainly noble.

That is what I am advocating, but I don't think that it's what Intel
is going to do.

Why not?
|> As an aside--microcode on x86 and PAL code on IPF are used to add
|> features (reliability, security, and otherwise) and simplify the
|> implementation for things that can be slow.

Yes and no. The way that the trace cache is used on a Pentium 4
is conceptually just a variant of microcode.

Maybe there's a design space out there in which a microarchitecture
which is the superset of widgets that both x86 & IPF need to have good
native performance and backwards compatibility--and leverages
microcode or PAL code for emulating operations which can be slow. I
can't imagine that anyone would buy the thing, though.

Instead, my guess is that we continue to see incremental improvement
of software-based emulators with the occasional adding of hardware
features to improve the performance & robustness of these.

Brannon
not speaking for Intel
 
Bill said:
I'd say probably not before they get its vanilla-x86 emulation up to snuff -
i.e., probably never.

Well, ia32el-4.4-1.2.ia64.rpm from SuSe seems to work quite well around
these parts.

But what do I know.
 
Stephen said:
That's also
rumored to be how future Itanics are going to handle x86 emulation (i.e.
FX!32); hopefully it'll be better than the direct hardware support of
earlier models...
As I posted earlier, that's even how most current Itanium2 users would like to run
their code (with an IA32EL layer that's recent enough to work). Some of the codes
I've tried are 4-5 times faster using FX!32^WIA32EL than using the hardware engine
(which you can still use by doing /etc/init.d/ia32el stop).
 
Alexis Cousein said:
As I posted earlier, that's even how most current Itanium2 users would like to run
their code (with an IA32EL layer that's recent enough to work). Some of the codes
I've tried are 4-5 times faster using FX!32^WIA32EL than using the hardware engine
(which you can still use by doing /etc/init.d/ia32el stop).

Oops... Last I heard it was a future thing; I must have missed the
announcement in January when it shipped for Win2k3.

Can you say if the hardware engine will be removed in future chips?

S
 
Alexis Cousein said:
Well, ia32el-4.4-1.2.ia64.rpm from SuSe seems to work quite well around
these parts.

But what do I know.

Hard to say: does the performance evoke anything but laughter when compared
with current IA32 Intel and AMD competition, or is it still pretty much in
the toilet? Last I heard, the only thing that made the software emulation
look particularly good was the fact that it was less utterly abysmal than
the hardware kludge (i.e., might now be approaching 1.5 GHz P4/Xeon speeds -
hardly inspiring, though probably adequate for a somewhat wider range of
loads than the hardware IA32 Itanic box is).

- bill
 
Bruce Hoult said:
Yes, you could, except that all the sw for which IA64 is currently fast,
i.e. relatively regular fp codes, are fast specifically because they fit
the rotating registers/sw pipelining model of IA64.

This model will use all the regs, or at least all the regs that can be
live at the same time: Since the L2 latency used to be 9 cycles, this
means that you have to expect up to (at least?) N*9, with N = number of
regs required by the base algorithm, to be active at any given time.

I.e. 128 regs isn't just a hint, it's a requirement for a fast emulator,
unless you want to completely unravel all the logic behind those
predicated/pipelined/unrolled sw loops.

But how are you going to efficiently emulate the register rotation
itself, if the IA64 emulated registers are in 128 ordinary registers in
a conventional CPU?

Depending on the ratio of rotates to computation you could be much
better off keeping them in an array and changing a base pointer (and
doing a mod on each index into it).

-- Bruce[/QUOTE]


Do you mean something like a little wp workspace ptr:-)

regards

johnjakson_usa_com
 
I had this particular problem a long time ago, when implementing the
sliding window extension to my version of Kermit.

Today the fastest way is probably to use one (or even two) compares and
then a conditional/predicated move/subtraction to adjust, right?

Since Kermit could settle on arbitrary window sizes, I had to find a
fast way to determine not just the current packet, but also [curr-N].
The solution I settled on was to waste a little memory, and use a level
of indirection that used a power-of-two-sized table which pointed into
the real packet buffer array. :-)
Do you mean something like a little wp workspace ptr:-)

What impresses me is that HP/Intel decided they could implement a mod-96
register indirect access without causing this to become a critical path.

With OOE I would expect all this logic to be handled early in the
(decoding?) pipeline, so that the actual execution logic wouldn't see it
at all, right?

Terje
 
Bill said:
Last I heard, the only thing that made the software emulation
look particularly good was the fact that it was less utterly abysmal than
the hardware kludge (i.e., might now be approaching 1.5 GHz P4/Xeon speeds -
hardly inspiring, though probably adequate for a somewhat wider range of
loads than the hardware IA32 Itanic box is).

That's a correct assessment, IMO (if it weren't using such laden terms).

Still, OpenOffice does run quite good on a 1.5GHz P4, and finding
128 CPU P4 1.5GHz NUMA-machines *is* rather hard ;).

You wouldn't want to run your performance-critical applications with it
(surprise, surprise: you'd better have little-endian 64-bit clean source
code, or an IA64 binary, for those) -- but at least your glue logic/GUI/
toolsets etc. do work (even though it takes some Linux gymnastics with
alternate glibc versions to make very *old* IA32 binaries work).

For other applications that I shan't detail, IA32EL is now fast enough
to make other parts of applications be the bottleneck -- which couldn't
exactly be said of the hardware engine. Especially when they can
be multi-threaded.
 
Back
Top