Hadware Support for Protection Bits: what does it really mean?

  • Thread starter Thread starter Maria
  • Start date Start date
M

Maria

Hi,

My understanding to "a hardware support for any operation" is that
this operation does not have to be coded neither by the ordinary user
nor in the OS. It is implemented without *wasting any CPU memory
cycle*.

However, I find it hard to understand how this can be done when
updating the Valid Bit/Dirty Bit/ Use (modified bit)

Let's consider each bit separately

1- Valid Bit: i understand the utility of this bit. But what does
hardware support for "valid bit" stand exactly for? Does it mean
this bit is checked once a page table entry is read and loaded into a
memory management unit (MMU) register? I mean is there an AND operator
to check if this bit is one or zero, so that the CPU is interrupted if
the valid bit is zero.
I looked desperately in google for detailed hardware architecture of
the Memory Management Unit (MMU) but in vain...any link?

Also it claimed that the OS set all page table entries valid bits to
zero once it allocates a page table to a process. Does it have to be an
entry by entry write operation?


2- Dirty Bit/Use bit: those are in particularly important in Virtual
memory. They are part of each page table entry. i really don't
understand how those bits are set without losing CPU cycles. I don't
believe that the DRAM has a set bit line to set/reset these bits. Am I
wrong????

In http://www.stanford.edu/class/cs140/projects/pintos/pintos_5.html
, section 5.1.2.3

the author states
"Most of the page table is under the control of the operating system,
but two bits in each page table entry are also manipulated by the CPU.
On any read or write to the page referenced by a PTE, the CPU sets the
PTE's accessed bit to 1; on any write, the CPU sets the dirty bit to 1.
The CPU never resets these bits to 0, but the OS may do so."


In either case we will be losing CPU memory cycles and as such any
application execution is slowed down since each instruction consumes
some (unnecessary) CPU cycles to update the reference bit and probably
the dirty bit. any comment?

In another reference
http://www.linuxrocket.net/index.cgi?a=MailArchiver&ma=ShowMail&Id=361273

Linus Torvalds says
"The thing is, we should always set the dirty bit either atomically
with the access (normal "CPU sets the dirty bit on write") _or_ we
should set it after the write (having kept a reference to the page)."

What does he mean by set atomically? And how this is done?

Many thanks for your help :)
 
Maria said:
Hi,

My understanding to "a hardware support for any operation" is that
this operation does not have to be coded neither by the ordinary user
nor in the OS. It is implemented without *wasting any CPU memory
cycle*.

However, I find it hard to understand how this can be done when
updating the Valid Bit/Dirty Bit/ Use (modified bit)

Let's consider each bit separately

1- Valid Bit: i understand the utility of this bit. But what does
hardware support for "valid bit" stand exactly for? Does it mean
this bit is checked once a page table entry is read and loaded into a
memory management unit (MMU) register? I mean is there an AND operator
to check if this bit is one or zero, so that the CPU is interrupted if
the valid bit is zero.
I looked desperately in google for detailed hardware architecture of
the Memory Management Unit (MMU) but in vain...any link?

There's no unified or standardized MMU, that's why you can't find it, you
only can come across a particular implementation of the general idea.
Also it claimed that the OS set all page table entries valid bits to
zero once it allocates a page table to a process. Does it have to be
an entry by entry write operation?


2- Dirty Bit/Use bit: those are in particularly important in Virtual
memory. They are part of each page table entry. i really don't
understand how those bits are set without losing CPU cycles. I don't
believe that the DRAM has a set bit line to set/reset these bits. Am I
wrong????

It's not the memory chip that sets some control bits, it's the CPU that sets
them every time it accesses particular memory regions (pages).
In http://www.stanford.edu/class/cs140/projects/pintos/pintos_5.html
, section 5.1.2.3

the author states
"Most of the page table is under the control of the operating system,
but two bits in each page table entry are also manipulated by the CPU.
On any read or write to the page referenced by a PTE, the CPU sets the
PTE's accessed bit to 1; on any write, the CPU sets the dirty bit to
1. The CPU never resets these bits to 0, but the OS may do so."


In either case we will be losing CPU memory cycles

You can't eat your cake and have it. But if the PTE is in the cache... And
what if that was a big page and hence many fewer PTEs involved?
and as such any
application execution is slowed down since each instruction consumes
some (unnecessary) CPU cycles to update the reference bit and probably
the dirty bit. any comment?

There're CPUs that do not set those bits automatically, instead they force
an exception for the software to set the bits. That's much worse. And that's
the inverse of the automatic setting done by the CPU.
In another reference
http://www.linuxrocket.net/index.cgi?a=MailArchiver&ma=ShowMail&Id=361273

Linus Torvalds says
"The thing is, we should always set the dirty bit either atomically
with the access (normal "CPU sets the dirty bit on write") _or_ we
should set it after the write (having kept a reference to the page)."

What does he mean by set atomically? And how this is done?

That is a discussion of a race condition, where between changing of the
state of the two things (dirty bit and some structure) can happen something
else that would modify one of these and hence make the two inconsistent. I
don't want to go deep into the details of that particular problem, but by
atomically doing some things he meant doing them in a way that nothing else
can happen in between. Just that. The issue generally arises when you have
threads that can preempt each other or multiple processes sharing memory.
Many thanks for your help :)

You're welcome.

Alex
 
"The thing is, we should always set the dirty bit either atomically
That is a discussion of a race condition, where between changing of the

SMP Windows sets the dirty bit manually in MiSetDirtyBit routine. UP Windows
relies on the hardware to do this.
 
Maria said:
1- Valid Bit: i understand the utility of this bit. But what does
hardware support for "valid bit" stand exactly for? Does it mean
this bit is checked once a page table entry is read and loaded into a
memory management unit (MMU) register? I mean is there an AND operator
to check if this bit is one or zero, so that the CPU is interrupted if
the valid bit is zero.

Exactly, MMU hardware typically tests a valid bit and decides whether to
proceed with the page access or generate a page fault.
Also it claimed that the OS set all page table entries valid bits to
zero once it allocates a page table to a process. Does it have to be an
entry by entry write operation?

Most likely. Don't be too concerned about the time taken for this step,
which is probably much smaller than the time to create an address space
or to process one page fault (which probably happens very soon after
initializing a page table).
2- Dirty Bit/Use bit: those are in particularly important in Virtual
memory. They are part of each page table entry. i really don't
understand how those bits are set without losing CPU cycles. I don't
believe that the DRAM has a set bit line to set/reset these bits. Am I
wrong????

You are right, there is a cost to update these bits in a page table
entry (PTE). However, it is only incurred when the bit is not already
set, so not more than once (per bit) per time that the OS comes along
and clears the bit as part of some housecleaning operation that again
costs much more than the hardware update of the bit(s). [The Dirty
bit might be set once per CPU, since it could be not-Dirty in a cache
of one CPU long after another CPU has set the bit in memory.]

There are systems that can write small portions of a memory word, and
use that to set the PTE bits. More commonly, setting one bit requires
reading the word, altering it in a register, and writing it back. The
Used bit is set immediately after fetching the PTE as part of mapping
an address, so does not need to be read again. The Dirty bit may be
set much later, when a PTE that has been kept in a cache (TLB) is used
for a store, so (at least in this case) requires a read-alter-rewrite
(RAR) cycle.
In another reference
http://www.linuxrocket.net/index.cgi?a=MailArchiver&ma=ShowMail&Id=361273

Linus Torvalds says
"The thing is, we should always set the dirty bit either atomically
with the access (normal "CPU sets the dirty bit on write") _or_ we
should set it after the write (having kept a reference to the page)."

What does he mean by set atomically? And how this is done?

The read-alter-rewrite pattern to update a bit in the PTE preserves any
other bits in the PTE that the OS may be playing with, except if the OS
can manage to write to the PTE in the middle of the RAR. This is more
likely with another CPU but possible with CPU designs that delay writes
to improve performance. So, the RAR cycle needs to be atomic in the
sense of indivisible. Typically this requires a hardware signal, to
say (to other hardware units) that they may not use some part of the
memory for the duration of the RAR cycle. The MMU hardware needs to
have this built in.

A multiprocessor OS also has to know to use an atomic RAR cycle for
any change it makes to a shared PTE (or risk missing a Dirty bit set).
How to do this depends on the CPU, and sometimes on external
architecture. Typically a special instruction is used, in an
assembly-language subroutine.
 
Maria wrote:

2- Dirty Bit/Use bit: those are in particularly important in Virtual
memory. They are part of each page table entry. i really don't
understand how those bits are set without losing CPU cycles. I don't
believe that the DRAM has a set bit line to set/reset these bits. Am I
wrong????

On some CPUs (e.g. S/370 and its descendants through today's System z)
those bits (called Change and Reference) are attached to the real page
frame, and are not in the PTE. On modern machines they are cached, and
set by microcode in a dedicated memory region not visible to the OS,
but on old machines they really were part of the memory system and set
by hardware.

As others have pointed out, the overhead of setting such a bit only
occurs when it is not already set, and with proper caching techniques
this makes the performance hit invisible.

Btw, two advantages of DIrty&Use bits attached to the real frame are:
(1) The bits get set by I/O operations (e.g. DMA) automatically, as
well by any accesses performed without address translation ("DAT off";
"real mode"), and
(2) When pages are shared, the OS does not have to chain though all
PTEs pointing to the same frame in order to determine dirtiness.

Michel.
 
(1) The bits get set by I/O operations (e.g. DMA) automatically, as

Setting the dirty bits on DMA is usually the OS's task.

For instance, Windows does this in MmProbeAndLockPages call - see its
IoXxxAccess parameters - there are 3 chances:
"you will not update this memory"
"you will update this memory, possibly by DMA you will initiate"
"you will update this memory and promise to fully fill it, so MM can skip
bothering with zeroing".

IIRC the 3rd option was not implemented until XP, but the code for it and the
API is since NT 3.1
 
Maxim S. Shatskih said:
Setting the dirty bits on DMA is usually the OS's task.

the changed and reference bits are properties of the physical instance
of the page.

in the original 360 "key" structure ... cp67 emulated cms protected
shared pages by fiddling the storage protect keys (part of the
original 360 architecture) ... where the current executing state is
carried by the PSW (program status word) which could include a "key"
state. each storage area could have an associated key value. for a
storage to occur the PSW (application execution) key value had to
match the storage key value in order for a storage operation to
complete (aka not only did storage areas have reference and change
state ... but each storage area could also have store and optionally
fetch protection ... which also had to be checked on each instruction
operation). the supervisor could disable store (& optional fetch)
protection by setting the PSW key value to zero (for privilege
kernel/supervisor code).

so certain CMS virtual memory pages were defined as shared. the cp67
kernel ... behind the scenes fiddled both the non-shared and shared
page protection keys as well as fiddling the PSW for CMS execution ...
so that all stores to protected shared pages would fail.

along comes 370 ... and in the original 370 virtual memory
architecture, a shared segment protection feature was defined. for
the morph of cp67/cms (from 360/67) to vm370/cms (370), the cms layout
structure was re-organized so all "shared" pages were located in a 370
segment that could be defined as shared across multiple address spaces
.... and in each virtual address space table entry ... the segment
protect bit was turned on (preventing all instruction executing in
those virtual memories from being able to store in that segment range
of virtual addresses).

at that time, there was going to still be the "key" based storage
protection (inherited from 360), the page change and reference state
bits as well as the new 370 virtual memory storage protect mechanism.

some amount of past posts mentioning storage protect operations:
http://www.garlic.com/~lynn/93.html#18 location 50
http://www.garlic.com/~lynn/93.html#25 MTS & LLMPS?
http://www.garlic.com/~lynn/96.html#4a John Hartmann's Birthday Party
http://www.garlic.com/~lynn/99.html#94 MVS vs HASP vs JES (was 2821)
http://www.garlic.com/~lynn/2000c.html#18 IBM 1460
http://www.garlic.com/~lynn/2002q.html#31 Collating on the S/360-2540 card reader?
http://www.garlic.com/~lynn/2003m.html#15 IEFBR14 Problems
http://www.garlic.com/~lynn/2004c.html#33 separate MMU chips
http://www.garlic.com/~lynn/2004h.html#0 Adventure game (was:PL/? History (was Hercules))
http://www.garlic.com/~lynn/2004q.html#82 [Lit.] Buffer overruns
http://www.garlic.com/~lynn/2004q.html#84 [Lit.] Buffer overruns
http://www.garlic.com/~lynn/2005.html#0 [Lit.] Buffer overruns
http://www.garlic.com/~lynn/2005.html#6 [Lit.] Buffer overruns
http://www.garlic.com/~lynn/2005c.html#18 [Lit.] Buffer overruns
http://www.garlic.com/~lynn/2005h.html#17 Exceptions at basic block boundaries
http://www.garlic.com/~lynn/2005h.html#18 Exceptions at basic block boundaries

so several models are chugging along with their hardware
implementations. the 370/165 engineers then raise an issue ... they
are something like six months behind schedule designing and building
the 165 virtual machine hardware retrofit (370/165s were initially
shipped w/o virtual memory capability). we have an escalation meeting
with architecture and various groups in POK. the 370/165 engineers
claim that they can make up the six months if they can drop several of
the 370 virtual memory features from the implementation (which also
means that all of the other 370 models that have already implemeted
the features will need to remove them from their implementations). it
was eventually decided to go with dropping the features so 370/165
engineers can gain back six months.

one of the features that had to be dropped from the original 370
virtual memory architecture was the original virtual memory segment
protection. that created a problem for CMS ... since the whole CMS
shared page memory protect had been rebuilt around have segment
protection (i.e. lots of different applications could share exact
duplicate physical image of a page w/o fear that one application would
trounce on it and impact applications running in other virtual memory
.... i.e. virtual memory was being used for partitioning and
isolation).

so, vm370/cms group was forced then to retrofit the key-based storage
protection hack that was used in cp67/cms to vm370/cms shared segment
protection.

we go forward a couple years. several of the 370 models come out with
instruction microcode performance assists for vm370/cms operation.
one of the instruction assisted is the PSW and storage key management
instructions. however, the assist rules don't have provisions for the
fiddling done to the PSW and storage keys by the original hack from
CP67 (i.e. if cms applications were to be run with the hardware
performance assists they would loose protection of their shared
pages).

somebody comes up with a bright idea. at that moment, vm370/cms
environments were only single processor machines and cms only had only
defined 16 shared pages. the idea was that cms applications would be
run with the hardware performance assist (with storage protection
actually disabled). then everytime before the underlying kernel did a
task switch from one virtual address to a different virtual address,
the dispatcher would scan the shared cms pages (that previously had
been storage protected) for the change bit. any time such a
"protected" shared page was found to be dirty/changed ... the physical
copy was flushed and the PTE was marked invalid. The switched-to
address space would never see any changes made by an application
running in a different address space. The pages were no longer
actually physically protected from stores ... however, the scope of
any such stores was very limited.

so about the time they were ready to ship this bright new idea ... cms
added support for greatly increasing the number of shared pages. the
original idea was the overhead of scanning the dirty bits on 16 shared
pages on every task switch was less than the performance improvement
gained by using the microcode hardware assist. The problem was that by
the time the support shipped, there was always a minium of 32 shared
pages to scan (and frequently a large number more) and the trade-off
was no longer valide.

the other problem was adding support for multiprocessing. the original
bright idea was based on the fact that while the application was
running, it basically had exclusive control and use of the pages.
with multiprocessor support, that was no longer true, there was
potentially concurrent access to shared pages by equivalent of one
application for each processor. so the bright idea had to be fiddled,
as part of multiprocessor support, a unique set of shared pages were
defined for each processor. now as part of doing a task switch, the
dispatcher had to scan the shared pages from the previous task looking
for modifications (and then flushing and invalidating as needed). in
addition, the dispatcher now had to fiddle the virtual memory tables
to the new, switched-to task ... so its virtual memory tables were
point to the set of shared pages that were specific to the real
processor that the task was being dispatched on.

of course, eventually sharing protection re-appeared as part of
the architecture implementation shipped to customers.

misc. past posts on the whole gengre of the shared page fiddling (and
emulating storage protection by scanning for changed pages and
discarding the changed image ... forcing the page having to be
refreshed from disk)
http://www.garlic.com/~lynn/2000.html#59 Multithreading underlies new development paradigm
http://www.garlic.com/~lynn/2003d.html#53 Reviving Multics
http://www.garlic.com/~lynn/2003f.html#14 Alpha performance, why?
http://www.garlic.com/~lynn/2004p.html#8 vm/370 smp support and shared segment protection hack
http://www.garlic.com/~lynn/2004p.html#9 vm/370 smp support and shared segment protection hack
http://www.garlic.com/~lynn/2004p.html#10 vm/370 smp support and shared segment protection hack
http://www.garlic.com/~lynn/2004p.html#14 vm/370 smp support and shared segment protection hack
http://www.garlic.com/~lynn/2004q.html#37 A Glimpse into PC Development Philosophy
http://www.garlic.com/~lynn/2005.html#3 [Lit.] Buffer overruns
http://www.garlic.com/~lynn/2005.html#5 [Lit.] Buffer overruns
http://www.garlic.com/~lynn/2005c.html#20 [Lit.] Buffer overruns
http://www.garlic.com/~lynn/2005d.html#61 Virtual Machine Hardware
http://www.garlic.com/~lynn/2005e.html#53 System/360; Hardwired vs. Microcoded
http://www.garlic.com/~lynn/2005f.html#45 Moving assembler programs above the line
http://www.garlic.com/~lynn/2005f.html#46 Moving assembler programs above the line
http://www.garlic.com/~lynn/2005h.html#9 Exceptions at basic block boundaries
http://www.garlic.com/~lynn/2005h.html#10 Exceptions at basic block boundaries
http://www.garlic.com/~lynn/2005h.html#13 Today's mainframe--anything to new?
http://www.garlic.com/~lynn/2005j.html#39 A second look at memory access alignment
http://www.garlic.com/~lynn/2005j.html#54 Q ALLOC PAGE vs. CP Q ALLOC vs ESAMAP
http://www.garlic.com/~lynn/2005o.html#10 Virtual memory and memory protection
http://www.garlic.com/~lynn/2006.html#13 VM maclib reference
http://www.garlic.com/~lynn/2006.html#38 Is VIO mandatory?
http://www.garlic.com/~lynn/2006b.html#39 another blast from the past
 
Maxim S. Shatskih said:
Setting the dirty bits on DMA is usually the OS's task.

so there is also a separate issue with cp67 and vm370 providing
emulated virtual machines.

there are the real storage changed and reference bits associated
with the real physical page.

the real kernel uses the real storage changed bits to determine
whether the physical instance in real memory is the same or different
than the image on disk. when the virtual copy is brought in from disk,
the real change bits are set to zero indicating that the copy hasn't
been changed since reading from disk. subsequently the application or
i/o operations may alter the virtual copy in real memory ... which
means that it is no longer the same as the copy on disk.

now there is a problem with simulating a virtual machine environment.
the pages in the virtual machine address space have changed and
referenced bits managed by the kernel running in the virtual machine.
the virtual machine kernel may read a virtual page of its disk into a
page ... which changes the real instance of the page with respect to
the real kernel. the kernel in the virtual machine will then clear its
version of the change (& reference) bit to zero (indicating that the
copy in storage is the same as the copy on its disk).

similarly, the real kernel may remove a virtual machine page from real
memory to disk. when it brings that virtual machine page back into
real memory, it will reset the changed (and reference) bits indicating
that the virtual page in storage is still the same as the copy on the
real kernel's page disk.

we have one set of real changed and reference bits ... but two
different kernels attempting to use them for tracking state about two
different things (whether there has been a change from the copy on the
virtual kernels paging disk as well as whether there has been a change
from the copy on the real kernels paging disk).

so the real kernel maintains two sets of "shadow" changed and
reference bits, one set for the virtual machine kernel and one set for
the real machine kernel. whenever the real kernel changes the real
changed and reference bits ... the values for the real page are OR'ed
with the value in the virtual machine kernel shadow bits, the real
changed and reference bits are cleared to zero ... and the desired
value is assigned to the real kernel's shadow bits.

whenever the virtual machine kernel changes a page's reference and
change bits ... the values are OR'ed with the value in the real
machine kernel shadow bits, the real change and reference bits are
cleared to zero ... and the desired value is assigned to the virtual
machine kernel's shadow bits.

whenever either kernel interrogates changed & reference bits ... it
does it first by first OR'ing the values for the real page with its
shadow bits maintained for that particular kernel.

In effect, only an virtual application instruction explicit store
alteration event is used to turn on the real dirty/changed bits
(occuring when the instruction modifies something in the range of
storage). Administrative management of the bits then never turn on the
bits ... it ONLY zeros the real bits (after first integregating and
OR'ing the bits as appropriate with the appropriate software maintained
shadow bits).

some past posts discussing management of shadow change & reference
bits as part of virtual machine emulation:
http://www.garlic.com/~lynn/95.html#2 Why is there only VM/370?
http://www.garlic.com/~lynn/2005h.html#17 Exceptions at basic block boundaries
 
Back
Top