Are you sure about that? That would seem to me really inefficient and
contribute to propagation delays. The north bridge (which still exists)
would have to receive all the data and forward it to/from the A64 memory
controller, PCI/PCIe/AGP buses rather than handling all this itself? For
example, if we have an AGP access, say from the video card to system memory,
the AGP bus would xfer it's address request to the North Bridge, the North
Bridge would then have to forward this request to the A64 memory controller,
which would then fetch the AGP memory data, send it back to the North Bridge
and then the North Bridge sends it back to the video card. It would even be
worse if we experience a TLB miss in the North Bridge because then we have
to request the GART from the A64 memory controller. DMA access would
similarly be degraded. No matter how fast the bus between the A64 memory
controller and the North Bridge there would still be more latency than if
the North Bridge could handle it itself.
There will be a (very) small latency hit on DMA and AGP memory data
accesses when compared to using an external memory controller.
However that is *WAY* more than offset by the (comparatively large)
decrease in latency on CPU memory requests.
The high-bandwidth/low-latency design of hypertransport, used to link
the I/O chips (call them a "north bridge" if you like, it's not
particularly accurate though) means that you're adding only a few
nanoseconds of latency to forward the request on, usually on the order
of 20-40ns (maybe even less). Remember that this link was designed
with this sort of forwarding of memory requests in mind since that's
exactly what it does in a multiprocessor setup when accessing remote
memory. If we compare this to some common sources of DMA access the
latency becomes pretty negligible. For example, hard drives have a
latency up in the millisecond range, so an extra 30 nanoseconds or so
is totally invisible. Same goes for network cards.
The only place where this really comes into affect is video cards, and
in particular, shared memory video cards. AGP/PCI-Express cards with
built-in memory don't really need to worry much about transferring
data to main memory on the fly since *MOST* of the important data is
kept in local memory on the card itself. The difference in latency
and bandwidth between local memory and remote memory is HUGE, so an
extra 30ns of latency and virtually no hit to bandwidth doesn't end up
changing things much. When you're looking at "really slow" vs. "the
tiniest bit slower", usually you don't worry too much.
However with shared memory video, things get a bit trickier. Here
you're ALWAYS dealing with remote memory and you're always limited by
bandwidth and latency. However again this is a bit of a comparison
between "bad" and "just slightly worse", and the goal in designing
integrated video is ALWAYS to reduce the bandwidth needs and hide
latency, regardless of what platform you're using.