Simple Hardware Clock question

50295 · Feb 9, 2005

Ok, so I just read that the clock synchronizes other hardware
components of a computer szystem; meaning that, because the processor
is faster than the RAM for instance, or the hard-disk, the next CPU
instruction is delayed till the next clock tick, in order to ensure
that each component completes its operation before the next phase. And
for this reason, the clocks often run at relatively slow speeds such as
333MHz - much slower than the 3GHz CPUs that we have now.

If this is so, one may say that: this CPU execute no more than
333,000,000 instructions per second!

Is this so?

Thanks for helping the Noob

Olumide

Alexei A. Frounze · Feb 9, 2005

Ok, so I just read that the clock synchronizes other hardware
components of a computer szystem; meaning that, because the processor
is faster than the RAM for instance, or the hard-disk, the next CPU
instruction is delayed till the next clock tick, in order to ensure
that each component completes its operation before the next phase. And
for this reason, the clocks often run at relatively slow speeds such as
333MHz - much slower than the 3GHz CPUs that we have now.

If this is so, one may say that: this CPU execute no more than
333,000,000 instructions per second!

Is this so?

Yes, if you write lousy software.

There are a number of techniques that help to improve the performance. To
name a few:
- on-chip CPU memory cache
- off-chip/external CPU memory cache
- interrupts (as opposed to continuous polling/busy-waiting)
- DMA
- code optimization to remove any redundancy in both calculations and
memory/device accesses

Got the idea?

Alex

Maxim S. Shatskih · Feb 9, 2005

No. External interface of the CPU can be 333MHz. But the core is 3GHz. The
core is stalled only if accessing the external interface is absolutely
necessary, and the core has nothing more to do.

50295 · Feb 9, 2005

Thanks Alexi!

I undesrstand how numbers (1) and (5) can help, but not the others.
Putting your answer together with Maxim's, is it correct to say all
these techniques do NOT require the external interface?

- Olumide

Del Cecchi · Feb 9, 2005

Thanks Alexi!

I undesrstand how numbers (1) and (5) can help, but not the others.
Putting your answer together with Maxim's, is it correct to say all
these techniques do NOT require the external interface?

- Olumide

You need to distinguish all the different usages of "clock" in a
computer system.

Nick Maclaren · Feb 9, 2005

You need to distinguish all the different usages of "clock" in a
computer system.

Well, at least you aren't asking ME to - I retire in only 10 years.
The original poster MAY be young enough to complete that task.

Regards,
Nick Maclaren.

Alexei A. Frounze · Feb 10, 2005

Thanks Alexi!

I undesrstand how numbers (1) and (5) can help, but not the others.
Putting your answer together with Maxim's, is it correct to say all
these techniques do NOT require the external interface?

2 (off-CPU memory cache) helps just like the other cache. It's basically a
herarchy of caches, each working at its own speed and the closer the cache
to the CPU the faster data retrieval. But if the cache does not contain the
information the CPU needs, the dirty work will have to be done anyway, i.e.
read from the memory.

3 (interrupts and multithreading in general): suppose you're waiting for a
key in your application, and all your system and application software is
single-threaded, i.e. no multiprocessing of any kind, no parallelism. The
easiest and the least effective is a loop like this:
while (!kbhit()) {do_something();}; // <conio.h> used
This simply wastes the CPU time, which could have been used for something
more useful, like parallel calculations in some background activity,
whatever. This is where interrupts help -- instead of waiting in an infinite
loop and doing nothing, you set your keyboard interrupt routine that is
called once per key hit/release, opposed to some millions of calls to
kbhit() in a loop. You advance your state machine upon the keyboard event,
using as little of the CPU time as needed, with no excessive overhead.

4 (DMA): this tiny bit of circutry does memory-to-device I/O transparent to
the CPU, it goes w/o too much of the CPU time overhead because the CPU is
interrupted only at the times when there's some data ready for it or can be
taken from it. Just that, no loops like in the above. Yet DMA usually works
with blocks of data, which again helps to minimize the overhead (you get one
interrupt on a block of bytes as opposed to getting on each byte).

Read some computer architecture book, like Tanenbaum's...

Alex

50295 · Feb 10, 2005

2 (off-CPU memory cache) helps just like the other cache. It's basically a
herarchy of caches, each working at its own speed and the closer the cache
to the CPU the faster data retrieval. But if the cache does not contain the
information the CPU needs, the dirty work will have to be done anyway, i.e.
read from the memory.

3 (interrupts and multithreading in general): suppose you're waiting for a
key in your application, and all your system and application software is
single-threaded, i.e. no multiprocessing of any kind, no parallelism. The
easiest and the least effective is a loop like this:
while (!kbhit()) {do_something();}; // <conio.h> used
This simply wastes the CPU time, which could have been used for something
more useful, like parallel calculations in some background activity,
whatever. This is where interrupts help -- instead of waiting in an infinite
loop and doing nothing, you set your keyboard interrupt routine that is
called once per key hit/release, opposed to some millions of calls to
kbhit() in a loop. You advance your state machine upon the keyboard event,
using as little of the CPU time as needed, with no excessive overhead.

4 (DMA): this tiny bit of circutry does memory-to-device I/O transparent to
the CPU, it goes w/o too much of the CPU time overhead because the CPU is
interrupted only at the times when there's some data ready for it or can be
taken from it. Just that, no loops like in the above. Yet DMA usually works
with blocks of data, which again helps to minimize the overhead (you get one
interrupt on a block of bytes as opposed to getting on each byte).

Read some computer architecture book, like Tanenbaum's...

Alex

Thanks Alexei,

I know about all this - trust me, but I fail to see how the external
cache, or the use of interrupts, or DMA can cause the CPU to execute
more than 1 instruction in a hardware clock cycle. What I'm trying to
say is that I fail to see how external cache, or the use of interrupts,
or DMA constitute an internal interface for/of the CPU. (I really like
Maxim's answer ;-) . Are you there Maxim?)

- Olumide

Maxim S. Shatskih · Feb 10, 2005

cache, or the use of interrupts, or DMA can cause the CPU to execute

more than 1 instruction in a hardware clock cycle. What I'm trying to

Several execution units can execute several instructions per cycle, if they are
not dependent on one another.

Alexei A. Frounze · Feb 10, 2005

Maxim S. Shatskih said:
Several execution units can execute several instructions per cycle, if they are
not dependent on one another.

Right, and now you may have CPUs with several cores or that hyperthreading
feature, so, you can effectively have more than 1 instruction per clock due
to the parallelism. intel x86 CPUs probably have not a lot of useful
instructions that take just 1 clock

What I was trying to say in my previous posts is that even though the
circuitry that is connected to the CPU can be rather slow (effectively
running with slower clocks than that of the CPU), it just doesn't mean the
CPU itself starts running as slow as they do.

Alex

Peter D. · Feb 11, 2005

(e-mail address removed) wrote in comp.os.linux.hardware:

[snip]

I know about all this - trust me, but I fail to see how the external
cache, or the use of interrupts, or DMA can cause the CPU to execute
more than 1 instruction in a hardware clock cycle.

[snip]

There are many clocks in a PC. For example the CPUs clock might be
running seventeen times faster than the motherboard's main bus.

It is quite normal for the CPU to do many things between motherboard
clock ticks. Hence the need for the main cache.

50295 · Feb 11, 2005

Thanks Peter D. That really makes sense! So, the CPU and the Caches use
seperate clock ... interesting ... Do you know where I can find out
more information (e.g. websites) on the number of clocks on the average
PC, and what the clocks do? Most text books leave out this bit of info,
trust me, I've searched.

Thanks again,

- Olumide

Toney · Feb 13, 2005

This has been a most instructive thread for the marginally technical
such as myself.

Is it not generally true that faster disk storage (and enough memory to
prevent swapping) will do more for a system than a marshmellow-toasting
CPU?

I recently bought the bottom rung AMD 64 (2800+) to grace my Asus K8N-E
Deluxe mobo. I figured that a pair of Silicon Image RAID-0'ing
Hitachi 80GB 7200 rpm 8MB cache SATA drives would do more to improve the
typical user's experience than a high-end processor. And I think I was
right. (In case I get caught, yes, this is my son's Winders gaming
box).

If I can ever figure out how to get the Highpoint Rocket Raid 1520 SATA
RAID PCI card I bought working on my beloved Matsonic MSCLE266-F
Debian Sarge box, I'm expecting even greater things from my 796.113 MHz
Samuel 2.

I'm as big a sucker for GHz and L2 cache as the next guy, but I do
believe a faster CPU rarely lives up to expectations, whereas faster
storage never fails to please.

Toney

(e-mail address removed) wrote in comp.os.linux.hardware:

[snip]

I know about all this - trust me, but I fail to see how the external
cache, or the use of interrupts, or DMA can cause the CPU to execute
more than 1 instruction in a hardware clock cycle.

Click to expand...

[snip]

There are many clocks in a PC. For example the CPUs clock might be
running seventeen times faster than the motherboard's main bus.

It is quite normal for the CPU to do many things between motherboard
clock ticks. Hence the need for the main cache.

Stephane Hockenhull · Feb 13, 2005

If this is so, one may say that: this CPU execute no more than
333,000,000 instructions per second!

the internal cache runs at the same speed as the cpu, so the cpu can
execute "complex" calculations in its cache and then send the result to
memory.

for example, doing an interpolation of two values:

r = a + (a[i+1] - a) * fraction_of(i);

with i's value in a register, you read two values a and a[i+1], the
second a is cached so memory isnt accessed, calculate the difference
in to a cpu register (no memory access), multiply by fraction_of(i)
(also in register: no memory access), add a (register again) and
store r to memory

so we only did 3 memory accesses, and as the memory bus is usually wider
(64, 128 or even 256bits) than the data we're working on (32bits), a
and a[i+1] might have been read at the same time, making it just 2
memory access for 8 operations, some of which could take more than one
cpu cycle.

if this code is in a loop and the loop fits in the instruction cache
then no reads will be done for most of the iterations (only the first
will load the instruction cache).

so, the cpu does make more work than, say, 333Mhz but it also does MUCH
less work than its 3.33Ghz clock would allow it to do if it ran on
3.33Ghz memory.

some cpu will even overheat if you make them run too efficiently as
they're expected to be regularly slowed down by running on slow memory.

diablovision · Feb 18, 2005

Alexei said:
if
they are

Right, and now you may have CPUs with several cores or that hyperthreading
feature, so, you can effectively have more than 1 instruction per clock due
to the parallelism. intel x86 CPUs probably have not a lot of useful
instructions that take just 1 clock

What I was trying to say in my previous posts is that even though the
circuitry that is connected to the CPU can be rather slow (effectively
running with slower clocks than that of the CPU), it just doesn't mean the
CPU itself starts running as slow as they do.

All modern CPUS (since about 1980) are pipelined in some form, meaning
that the work of an individual instruction is broken up into many
units, each taking a clock cycle.

A common analogy is doing laundry: there is a washer and a dryer. When
the first load A finishes washing, we can put it in the dryer, but
while A is drying, we can start the next load B in the washer. Then A
finishes drying and B finishes washing. A is now done, and B moves to
drying while the next load C starts washing. If the time for washing
and drying is T, then we achieve 1/T loads throughput, while each load
actually takes 2T to complete.

In modern CPUs like the Athlon or Pentium 4, the pipeline can be as
long as 10 or 20 stages. Therefore even though each instruction takes
10 or 20 cycles, they are pipelined so that we can achieve 1
instr/cycle throughput.

For more information, see:

Computer Architecture: A Quantitative Approach, John Hennessy, David
Patterson

Alexei A. Frounze · Feb 18, 2005

....

All modern CPUS (since about 1980) are pipelined in some form, meaning
that the work of an individual instruction is broken up into many
units, each taking a clock cycle.

A common analogy is doing laundry: there is a washer and a dryer. When
the first load A finishes washing, we can put it in the dryer, but
while A is drying, we can start the next load B in the washer. Then A
finishes drying and B finishes washing. A is now done, and B moves to
drying while the next load C starts washing. If the time for washing
and drying is T, then we achieve 1/T loads throughput, while each load
actually takes 2T to complete.

In modern CPUs like the Athlon or Pentium 4, the pipeline can be as
long as 10 or 20 stages. Therefore even though each instruction takes
10 or 20 cycles, they are pipelined so that we can achieve 1
instr/cycle throughput.

Right, yet clear enough for a housewife to understand.

Alex

Maxim S. Shatskih · Feb 23, 2005

In modern CPUs like the Athlon or Pentium 4, the pipeline can be as

long as 10 or 20 stages. Therefore even though each instruction takes
10 or 20 cycles, they are pipelined so that we can achieve 1
instr/cycle throughput.

More so.

Even P5 Pentium was capable of running 2 instructions in the flow in parallel,
provided they do not depend on one another (operands of second are not altered
by first).

This feature is called "superscalar". Sparc CPU is even more great in such
ability.

The weak point of superscalar is that the decision on paralleling is done in
runtime by CPU hardware, which cannot keep large context.

The Very Long Instruction Word (VLIW) CPU like IA-64 loads this burden to the
compiler. The compiler (which can keep huge context) decides how to parallel
the operations between several CPU cores.

The back sides are huge complexity of compiler and assembler language (nearly
impossible to write manual assembler, too much context to keep in head).

Yet another approach to fast CPUs. Throw away any complexity, use the saved
silicon space for cache and raise the frequency as fast as it is possible.
Pentium 4 and Alpha go this way (Alpha even sacrificed any complexity away from
assembler language - it only has 64bit arithmetics, if you want byte one -
write a subroutine).

Charles Krug · Feb 23, 2005

More so.

The weak point of superscalar is that the decision on paralleling is done in
runtime by CPU hardware, which cannot keep large context.

The Very Long Instruction Word (VLIW) CPU like IA-64 loads this burden to the
compiler. The compiler (which can keep huge context) decides how to parallel
the operations between several CPU cores.

The back sides are huge complexity of compiler and assembler language (nearly
impossible to write manual assembler, too much context to keep in head).

In a former job, I was working with (among other things) TI-'c6x DSPs,
with an open pipeline VLIW architecture.

A TON of effort went into hand optimizing inner loops that exactly fit
inside the on-chip memory. TI provided an intermediate form of assembly
language that helped quite a bit--you could define your independent
instruction sequences that the optimizer could USUALLY arrange
optimally.

But we had abundant war stories about squeezing the last smidgeon of
performance out of ten instructions.

Simple Hardware Clock question

50295

Alexei A. Frounze

Maxim S. Shatskih

50295

Del Cecchi

Nick Maclaren

Alexei A. Frounze

50295

Maxim S. Shatskih

Alexei A. Frounze

Peter D.

50295

Toney

Stephane Hockenhull

diablovision

Alexei A. Frounze

Maxim S. Shatskih

Charles Krug