Clarifications about AMD TLB L3 bug

  • Thread starter Thread starter Robert Myers
  • Start date Start date
R

Robert Myers

A number of assertions have been made here about the AMD TLB L3 Bug:

1. Only affects virtualization.

2. Is fixed in 64-bit Linux without a significant performance hit.

1. TRUTH: AMD, which knew about the bug before the chip was released,
falsely made this claim. The bug apparently affects all workloads,
potentially resulting in a system freeze.

2. TRUTH: A fix is available under NDA for RHEL 4 and not otherwise
apparently.

http://techreport.com/discussions.x/13721

http://techreport.com/discussions.x/13724

Robert.
 
Robert said:
A number of assertions have been made here about the AMD TLB L3 Bug:

1. Only affects virtualization.

2. Is fixed in 64-bit Linux without a significant performance hit.

1. TRUTH: AMD, which knew about the bug before the chip was released,
falsely made this claim. The bug apparently affects all workloads,
potentially resulting in a system freeze.


The truth actually is that it only affects virtualized workloads,
because the problem occurs when nested page tables are used. Nested page
tables only are used in virtualization, no other times. AMD never made
the claim it only affects virtualization, it is actually trying to keep
that hushed up: I assume because it does not want a virtualization bug
to be associated with its products since that kind of a reputation would
be hard to shake off, even if fixed.
2. TRUTH: A fix is available under NDA for RHEL 4 and not otherwise
apparently.

http://techreport.com/discussions.x/13721

http://techreport.com/discussions.x/13724

How secret can it be if it's open-source?

Yousuf Khan
 
The truth actually is that it only affects virtualized workloads,
because the problem occurs when nested page tables are used. Nested page
tables only are used in virtualization, no other times. AMD never made
the claim it only affects virtualization, it is actually trying to keep
that hushed up: I assume because it does not want a virtualization bug
to be associated with its products since that kind of a reputation would
be hard to shake off, even if fixed.

It's not clear to me whether that is true or not. Here's the bug:

"The processor operation to change the accessed or dirty bits of a
page translation table entry in the L2 from 0b to 1b may not be
atomic. A small window of time exists where other cached operations
may cause the stale page translation table entry to be installed in
the L3 before the modified copy is returned to the L2. In addition, if
a probe for this cache line occurs during this window of time, the
processor may not set the accessed or dirty bit and may corrupt data
for an unrelated cached operation. The system may experience a machine
check event reporting an L3 protocol error has occurred. In this case,
the MC4 status register (MSR 0000_0410) will be equal to
B2000000_000B0C0F or BA000000_000B0C0F. The MC4 address register (MSR
0000_0412) will be equal to 26h."

I know what a Page Table Entry is, but I'm not sure what a PTTE
is...it sort of sounds like the nested page table. Perhaps someone
who is intimately familiar with the architecture could comment?
How secret can it be if it's open-source?

Really easy, nobody cares enough to sue AMD/RH to get it. It's not
like there are more than 10-20 end users for Barcelona at the moment.

DK
 
The truth actually is that it only affects virtualized workloads,
because the problem occurs when nested page tables are used. Nested page
tables only are used in virtualization, no other times. AMD never made
the claim it only affects virtualization, it is actually trying to keep
that hushed up: I assume because it does not want a virtualization bug
to be associated with its products since that kind of a reputation would
be hard to shake off, even if fixed.

Discussing AMD with you can be an interesting undertaking:

"In order to better understand this problem, TR spoke with Michael
Saucier, Desktop Product Marketing Manager at AMD. Saucier confirmed
that the TLB erratum can cause the system to hang when the chip is
experiencing high utilization. AMD has stated previously that
virtualization workloads can lead to this problem, but Saucier
clarified that other workloads can trigger system hangs, as well. He
characterized the issue as a race condition in the TLB logic "where
the other guy wins who isn't supposed to win," and said the likelihood
of the erratum causing a system hang is extremely rare."

The report could be factually incorrect, but since I cited something
other than my own impression to support my statement, I'd expect you
to do the same.

You know that I'm not an admirer of AMD, so you won't be surprised
that I think AMD may be mortally wounded. Between the ATI fiasco and
this, AMD is a company with products that no one is going to want to
buy and seems unlikely to survive until it will have products that
someone does want to buy. That AMD is publicly whining about the
pounding its stock price has taken should tell you something. Vendors
who *finally* took a chance on AMD after years of hanging back have
been fried. First there was the lame roadmap. Now this.

What's the difference between this and Intel's botched FDIV bug?
Very, very simple. At the time of the FDIV bug, x86 was for
"peecees," and no one cared if Intel made mistakes that IBM (or DEC or
Sun) never would. Now they do.
How secret can it be if it's open-source?
How is part of SUSE kept proprietary?

Robert.
 
A number of assertions have been made here about the AMD TLB L3 Bug:

1. Only affects virtualization.

2. Is fixed in 64-bit Linux without a significant performance hit.

1. TRUTH: AMD, which knew about the bug before the chip was released,
falsely made this claim. The bug apparently affects all workloads,
potentially resulting in a system freeze.

2. TRUTH: A fix is available under NDA for RHEL 4 and not otherwise
apparently.

A number of assertions have been made here by Mr Myers about the AMD
TLB L3 Bug:

1. That a fix is available under NDA for RHEL 4 and not otherwise
apparently.

Truth: Mr Myers, which knew about the openly released fix before the
post was released, falsely made this claim. The fix apparently is
available for all, not requiring a NDA that could potentially result
in an information freeze.

Truth : AMD released the fix publicly without a NDA requirement on 5
Dec, documented on the same day by the same website used by Mr Myers
to cite the two truths above, 8 days before Mr Myer's posting on 13
Dec... ;)

http://www.techreport.com/discussions.x/13742
https://www.x86-64.org/pipermail/discuss/2007-December/010260.html

=P
 
A number of assertions have been made here by Mr Myers about the AMD
TLB L3 Bug:

1. That a fix is available under NDA for RHEL 4 and not otherwise
apparently.

Truth: Mr Myers, which knew about the openly released fix before the
post was released, falsely made this claim. The fix apparently is
available for all, not requiring a NDA that could potentially result
in an information freeze.

Truth : AMD released the fix publicly without a NDA requirement on 5
Dec, documented on the same day by the same website used by Mr Myers
to cite the two truths above, 8 days before Mr Myer's posting on 13
Dec... ;)

http://www.techreport.com/discussio...g/pipermail/discuss/2007-December/010260.html
As I'm sure you know, I wasn't aware of the follow-up article.
Somewhere, there might be a customer who matters who would apply such
an "invasive" patch without support. Who or where that customer might
be is beyond my imagining, except that someone important must have a
bunch of these AMD chips installed somewhere and has no choice but to
take the chance. So,

1. We rushed a chip into production and missed an infrequently-
occurring but potentially disastrous bug.

2. We are now rushing out a patch that purports to fix the bug without
a serious penalty. We told you to trust us about the chip, and it
turns out you shouldn't have. Now we're telling you *not* to trust us
about the patch. Why, exactly, would anyone install the unsupported
patch? Presumably there is a handful of important customers whose
hands are being held. For everyone else, it's just PR.

Robert.
 
Robert said:
As I'm sure you know, I wasn't aware of the follow-up article.

Don't use such lame excuses.
Somewhere, there might be a customer who matters who would apply such
an "invasive" patch without support. Who or where that customer might
be is beyond my imagining, except that someone important must have a
bunch of these AMD chips installed somewhere and has no choice but to
take the chance. So,

1. We rushed a chip into production and missed an infrequently-
occurring but potentially disastrous bug.

2. We are now rushing out a patch that purports to fix the bug without
a serious penalty. We told you to trust us about the chip, and it
turns out you shouldn't have. Now we're telling you *not* to trust us
about the patch. Why, exactly, would anyone install the unsupported
patch? Presumably there is a handful of important customers whose
hands are being held. For everyone else, it's just PR.

Nonsense. Go, check how many errata there was in the Core Duo. Just see the
example from the same site, from the comments from the article you quoted...

http://techreport.com/forums/viewtopic.php?t=43352&view=next&sid=a3a9ffe993e91c1453d97652f7222e65


rgds
\SK
 
Robert said:
What's the difference between this and Intel's botched FDIV bug?
Very, very simple. At the time of the FDIV bug, x86 was for
"peecees," and no one cared if Intel made mistakes that IBM (or DEC or
Sun) never would. Now they do.

What a nonsense!

You know what is the difference?
There is a workaround for this AMD bug, like there are for Inte's TLB bus in
their Core2 Duos. Both AMD & Intel fixes reduce the perofrmance a bit.

You know what is the difference? There was no fix for FDIV bug at all.
Reducing performance slightly or not. The buggy stuff was hard coded and not
bypassable. Intel has learned from that disaster and AMD has too.
How is part of SUSE kept proprietary?

Go buy a little clue and read how GPL works. Then you'll know that parts
which are not derived work of the GPL Linux kernel can be proprietary and
how those which are dervied work (as such patch has to) can not.

BTW. The patch is public, so the point is moot, you're just spreading
unfounded FUD.

rgds
\SK
 
What a nonsense!

You know what is the difference?
There is a workaround for this AMD bug, like there are for Inte's TLB bus in
their Core2 Duos. Both AMD & Intel fixes reduce the perofrmance a bit.
The workaround costs anywhere from 5% (one of AMD's numbers) to 50%
(other's numbers, naturally) in performance. You think that's
acceptable? AMD bought it on this one. Perhaps AMD should have had
you go out and address investors. You'd have been a big hit.

Your comment that "both AMD & Intel fixes reduce the perofrmance [sic]
a bit" is like Yousuf coming out with the item about Intel's bug right
after the AMD bug, as if they canceled one another out. Go look at
the financial press, and see if anyone but AMDroids (or anyone that
matters) reads it that way.
You know what is the difference? There was no fix for FDIV bug at all.
Reducing performance slightly or not. The buggy stuff was hard coded and not
bypassable. Intel has learned from that disaster and AMD has too.



Go buy a little clue and read how GPL works. Then you'll know that parts
which are not derived work of the GPL Linux kernel can be proprietary and
how those which are dervied work (as such patch has to) can not.

BTW. The patch is public, so the point is moot, you're just spreading
unfounded FUD.
If you can't be bothered to read the entire thread, then I can't be
bothered to respond.

Robert.
 
Don't use such lame excuses.
When you've grown up, you'll know better than to talk to people that
way, especially to people you don't know.
Nonsense. Go, check how many errata there was in the Core Duo. Just see the
example from the same site, from the comments from the article you quoted...

http://techreport.com/forums/viewtopic.php?t=43352&view=next&sid=a3a9...
There are mistakes, and there are mistakes. This mistake is one that
AMD could not afford. Your idea that "errata happen" and that they're
all equivalent is interesting. I suggest that you buy some AMD
stock. It's a bargain right now.

Robert.
 
Robert said:
The workaround costs anywhere from 5% (one of AMD's numbers) to 50%
(other's numbers, naturally) in performance.

Yeah, sure. Maybe it's 200%! Or maye 199929292%...

It may be 50% in some artificial test.
You think that's
acceptable?

Whatever. It's going to be fixed as many eralier bug from both Amd & Intel
(and Dec & IBM & Sun &...)

AMD bought it on this one. Perhaps AMD should have had
you go out and address investors. You'd have been a big hit.

Your comment that "both AMD & Intel fixes reduce the perofrmance [sic]
a bit" is like Yousuf coming out with the item about Intel's bug right
after the AMD bug, as if they canceled one another out. Go look at
the financial press,

Whatever. Finacial press is a poor source of techical info.

and see if anyone but AMDroids (or anyone that
matters) reads it that way.

Whatever. Such recalls do happen. It's a seruoius blow to AMD (as it delays
their more competitive products and causes the to loose Christmast season),
but such things are none the less reality and they do happen to everyone
from time to time. AMD has still enough money to wether that one (with their
current burning rate then can go for about 2 more years). And some of that
buring is one time (ATI acquisition costs are big, but one time expense)

If you can't be bothered to read the entire thread,

I did read it. That's the very reason i put the above BTW.

then I can't be
bothered to respond.

Yet you responded :)
Your response is practically empty (of course), but it's here.


rgds
\SK
 
Robert said:
When you've grown up, you'll know better than to talk to people that
way, especially to people you don't know.


Don't be silly. Quoting you from the very same thread:

If you can't be bothered to read the entire thread,
then I can't be
bothered to respond.

So, you in the very same thread:
a) falsely accuse others of not reading the things they're discussing (it
did read LL's answer
b) excuse yourself by not reading the things you're dicussing

At least be consistent on short time-scale!

There are mistakes, and there are mistakes. This mistake is one that
AMD could not afford.

Because our great all-knowing Robert Myers said so...

Your idea that "errata happen" and that they're
all equivalent is interesting. I suggest that you buy some AMD
stock. It's a bargain right now.

Yeah, the old stockshill, johncorsish note.
I'd like to remind you, this is a technical group not stock talk BS forum...

rgds
\SK
 
Robert Myers wrote:
AMD bought it on this one. Perhaps AMD should have had
you go out and address investors. You'd have been a big hit.
Your comment that "both AMD & Intel fixes reduce the perofrmance [sic]
a bit" is like Yousuf coming out with the item about Intel's bug right
after the AMD bug, as if they canceled one another out. Go look at
the financial press,

Whatever. Finacial press is a poor source of techical info.
You think the articles (and press releases) you get on the internet
are a *good* source of information?

Your contempt for markets is revealing. A stock price is the
cumulative opinion of many people who follow the stock and who wager
actual money (not usenet bandwidth) on their opinions. The United
States (in particular) has companies like Intel (and, yes, even AMD)
because it so efficiently predicts and rewards success and predicts
and punishes failure through market mechanisms.

Chattering away like this is an interesting pastime, but it doesn't
affect anything of importance. Even much less so now than it used
to. You may not think much of the dimwits who majored in management,
but they can buy and sell as many techies as they need to find out
what's going on.
Whatever. Such recalls do happen. It's a seruoius blow to AMD (as it delays
their more competitive products and causes the to loose Christmast season),
but such things are none the less reality and they do happen to everyone
from time to time. AMD has still enough money to wether that one (with their
current burning rate then can go for about 2 more years). And some of that
buring is one time (ATI acquisition costs are big, but one time expense)
The question here is whether AMD will even survive. For one thing,
the stock is selling below book. That makes AMD a takeover target.
Would an AMD that was bought in a leveraged buyout continue the
ruinous war with Intel it's undertaken? I certainly hope not. Only
time will tell.
I did read it. That's the very reason i put the above BTW.
So your "BTW" was a me-too pile-on. Very impressive.

In the sense that I wasn't going to repeat what I'd already said on
the subject. You are a piece of work.

The sum of the opinion here is that it wishes to minimize the
seriousness of what has happened with AMD. If anyone here really
believes that, there is a serious opportunity to make a lot of money,
because, as I said, AMD is currently selling below it's book value.

There may be other things fueling the fire-sale prices. For example,
who wants to send out a quarterly report showing that they'd made a
bet on AMD? Things might not be *quite* as bad for AMD as the stock
price would indicate. You can find that sort of thing out in the
financial press, too.

Robert.
 
You know that I'm not an admirer of AMD, so you won't be surprised
that I think AMD may be mortally wounded. Between the ATI fiasco and
this, AMD is a company with products that no one is going to want to
buy and seems unlikely to survive until it will have products that
someone does want to buy.

Although it's a really bad piece of news for AMD, I'd have to disagree
about it being a mortal wound. After all, I distinctively remember AMD
prices once was at US$3+ back in the early days of the K7 which
ironically was a comparatively better product than Intel's P3 then,
and it's now still $7+. If AMD survived on a single key product back
then and nobody bought over them at $3+, I don't see why they can't
survive now with an overall stronger product portfolio and a share
price double those times.

Despite what you *claim* about nobody wanting to buy AMD products, I
see regular messages about AMD/ATI 3850/3870 selling out locally.
That's at least one wing that's still flying reasonably even if not
outperforming the competition. The X2 processors are still selling due
to their relatively cheap prices for the performance.

That AMD is publicly whining about the
pounding its stock price has taken should tell you something. Vendors
who *finally* took a chance on AMD after years of hanging back have
been fried. First there was the lame roadmap. Now this.

What's the difference between this and Intel's botched FDIV bug?
Very, very simple. At the time of the FDIV bug, x86 was for
"peecees," and no one cared if Intel made mistakes that IBM (or DEC or
Sun) never would. Now they do.

Actually, I think the real difference between the two is that nobody
saw it was a mistake Intel couldn't recover from. However for AMD,
this would look like a killing blow on top of the underwhelming
performance against competition for a new product generation. While I
vaguely remember outcry against Intel for that bug, I don't remember
anybody saying that it's going to sink Intel. There just wasn't
sufficient competition capacity to takeover a company with over 90% of
the market share. Thus the difference in perceived impact.
 
Although it's a really bad piece of news for AMD, I'd have to disagree
about it being a mortal wound. After all, I distinctively remember AMD
prices once was at US$3+ back in the early days of the K7 which
ironically was a comparatively better product than Intel's P3 then,
and it's now still $7+. If AMD survived on a single key product back
then and nobody bought over them at $3+, I don't see why they can't
survive now with an overall stronger product portfolio and a share
price double those times.

Despite what you *claim* about nobody wanting to buy AMD products, I
see regular messages about AMD/ATI 3850/3870 selling out locally.
That's at least one wing that's still flying reasonably even if not
outperforming the competition. The X2 processors are still selling due
to their relatively cheap prices for the performance.
The "nobody wants to buy" is obvious hyperbole, but there is a really
big problem here that I actually haven't seen get much discussion.
AMD's huge win in the last few years has been to get Tier 1 vendors to
take them seriously. People stuck with Intel because they knew that
if they got burned, so would everyone else. Now those who have taken
a chance on Barcelona are in trouble in exact proportion as they
banked on AMD. The fact that I don't particularly admire AMD has
nothing to do with it. People don't want to take risks that won't pay
off. A bet on Opteron was a bet worth making, and it paid. What
corresponding bet does AMD have to offer now or in the forseeable
future that would justify vendors risking getting hung out to dry like
they just did? Why *would* anyone want to buy on a scale that will
matter to a company of AMD's size (and with the debt it has on its
balance sheet)?
Actually, I think the real difference between the two is that nobody
saw it was a mistake Intel couldn't recover from. However for AMD,
this would look like a killing blow on top of the underwhelming
performance against competition for a new product generation. While I
vaguely remember outcry against Intel for that bug, I don't remember
anybody saying that it's going to sink Intel. There just wasn't
sufficient competition capacity to takeover a company with over 90% of
the market share. Thus the difference in perceived impact.
There's an interesting argument that fixing the FDIV bug cost Intel
about what a major advertising campaign would cost. In other words,
the FDIV bug may have paid for itself in terms of public awareness of
what was then still very much a "peecee" processor. No similar fairy
dust is going to settle on AMD over Barcelona, which has gone from
being AMD's next big threat to Intel dominance to synonym for screw-up
and late delivery. The more fair comparison might be to Itanium,
except that Intel could afford its Itanium mistakes.

Maybe AMD will dance away from this one the way they've danced away
from so many disasters in the past. If they do, the secret is in
balance sheets and corporate deals; for example

http://news.bbc.co.uk/2/hi/technology/7149704.stm

Despite what the article says about Intel being about just one
product, almost no one in the business wants to see Intel with
essentially a monopoly on the technology at the end of the yellow
brick road.

Robert.
 
Whatever. Such recalls do happen. It's a seruoius blow to AMD (as it delays
their more competitive products and causes the to loose Christmast season),

The affected products are high end server CPUs. Ever heard of a
regular home user Joe 6pack running virtualization stuff???<grin/>
Sales thereof may be affected by the holiday season only in a negative
way. Buying new servers is probably not the first priority of execs
during these days. Ever heard of anyone getting a server as
Xmas/Hanukka/New Year/whatever present?

NNN
 
The affected products are high end server CPUs. Ever heard of a
regular home user Joe 6pack running virtualization stuff???<grin/>
Sales thereof may be affected by the holiday season only in a negative
way. Buying new servers is probably not the first priority of execs
during these days. Ever heard of anyone getting a server as
Xmas/Hanukka/New Year/whatever present?

On the other end the year end is often fiscal year end and money assigned to
be spend but not spent yet should finally. Or that old stupid fallacious
logic could fire: you didn't spend the money this year so you don't need it,
so you won't get'em next year.

rgds
\SK
 
Robert said:
AMD bought it on this one. Perhaps AMD should have had
you go out and address investors. You'd have been a big hit.
Your comment that "both AMD & Intel fixes reduce the perofrmance [sic]
a bit" is like Yousuf coming out with the item about Intel's bug right
after the AMD bug, as if they canceled one another out. Go look at
the financial press,

Whatever. Finacial press is a poor source of techical info.

You think the articles (and press releases) you get on the internet
are a *good* source of information?

Stock price is a very poor indicator of performance.
Your contempt for markets is revealing. A stock price is the
cumulative opinion of many people who follow the stock and who wager
actual money (not usenet bandwidth) on their opinions.

The vast majority of stock is traded without tracking the companies. Ever
heard about algorithmic trading? Then vast majority of human traders base
their opinions on varius indicators not connected directly with company
product. And the reason is simple -- they do not understand what's the product.
The United
States (in particular) has companies like Intel (and, yes, even AMD)
because it so efficiently predicts and rewards success and predicts
and punishes failure through market mechanisms.

Chattering away like this is an interesting pastime, but it doesn't
affect anything of importance. Even much less so now than it used
to. You may not think much of the dimwits who majored in management,
but they can buy and sell as many techies as they need to find out
what's going on.

Then such great manager who does not see the difference between a can od
soft drink and a computer drives the company to the edge of bancrupcy.

Even noticed that those most succesfull technical companies are lead to
their biggest success by people who know the technical stuff. Just notice
Intel or Microsoft. Then compare Intel & AMD -- both started as Fairchild
offsprings, Intel was lead by guys who were semiconductor specialist (behind
being good managers) and AMD was run by Jerry Sanders, who was just
management specialist. Compare the performance of the both.

And great technical companies have trouble if there are taken over by those
management majors you're praising. See Apple, DEC and even Intel at it's
time of little trouble.

The question here is whether AMD will even survive. For one thing,
the stock is selling below book.

Not for the first time.
That makes AMD a takeover target.

Book value of high tech companies depracates quickly, fab now worth 3
billion is not worth half as much 5 years down the road. So either stock
price must plummet much deeper or potential buyer shelling out those
billions must be pretty sure the company will be profitable and be more
profitable than other available ways of investment. Past history of AMD puts
serious doubdts about that.
Would an AMD that was bought in a leveraged buyout continue the
ruinous war with Intel it's undertaken?

They have no other option, unless they buy for much much less than it's
worth now.
I certainly hope not.

AMD is now just two pony ride. No one can turn it around without throwing
out much of the company.
Only
time will tell.

That's for sure

[...]
The sum of the opinion here is that it wishes to minimize the
seriousness of what has happened with AMD.

This last thing is just a little addition. AMD was worst performer of the
league the entire year. The stock was below 8 before the news wrt that bug
struck anyway. So they "worked" on their current evaluation whole year.

rgds
\SK
 
The little lost angel said:
Actually, I think the real difference between the two is that
nobody saw it was a mistake Intel couldn't recover from. However
for AMD, this would look like a killing blow on top of the
underwhelming performance against competition for a new product
generation. While I vaguely remember outcry against Intel for
that bug, I don't remember anybody saying that it's going to
sink Intel. There just wasn't sufficient competition capacity
to takeover a company with over 90% of the market share. Thus
the difference in perceived impact.

In addition to this very valid commercial argument, there is also a
fundamental difference between this AMD "bug" and the Intel FDIV bug:
the AMD bug might slow machines a bit under unusual circumstances.
The Intel FDIV bug gave erronious results in a few cases.
People worry _far_ more about data corruption than machine speed.
The FDIV bug might well have sunk AMD. This will not.

Actually, I find the continued harping on the AMD bug interesting
in and of itself. No-one complains of things they don't consider
a threat. AMD must be considered a serious threat. I'm not sure
why: AMD milked past superiour performance into higher ASPs
which Intel first was happy to follow, then started to undercut
to recover market share.

I believe actual performance is increasingly hard to measure.
MHz hasn't cut it for a long time. Intel has focussed on
bandwidth and cache. AMD has worked on latency. As a result,
each processor excels at some tasks and is only mediocre
at others. There is no one "best" CPU.

-- Robert
 
Robert said:
A number of assertions have been made here about the AMD TLB L3 Bug:

The bug I can forgive. What's really unfortunate is the dishonest
benchmarks. It appears they've handled this as poorly as Intel
handled the FDIV bug.
 
Back
Top