.hpf hyphen.local
.P1
.de PT
.tl 'BIO01 - BLOCK I/O'\*[CH]'PD-1C302-01'
.tl 'File: bio.c''Section 2'
.tl '''Issue 1, January 1976'
..
.2C
.ne 10
.
.LP
.LG
.B bawrite
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
bawrite(bp)
.br
struct buf *bp;
.sp 1n
.
.LP
.I RETURNS
.
.LP
No value is returned.
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Performs an asynchronous write to a block device.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
Most write operations to a block device under UNIX are either delayed or
performed asynchronously. This means that a process may not experience any
of the latency time or transfer time associated with outputting to a
device. Only delay time due to scarcity of available system buffers will be
experienced. In addition, performing the write asynchronously allows
efficiency gains when multiple references to the same block occur.
Bio.c/bawrite write sets up an asynchronous write for one 512 byte block.,
The process requesting the write does not wait for completion.
.
.LP
When bio.c/bawrite is called, the block to be written has already been
moved to the appropriate device queue by a higher level function so that no
allocation of a buffer to a device queue is needed as for reads. The device
strategy routine (rp.c/rpstrategy, rf.c/rfstrategy, etc.) is called to
queue the block for I/O by bio.c/bawrite. Errors detected by the strategy
routine are posted (in "u_error") by calling bio.c/geterror. Bio.c/bawrite
uses common code from bio.c/bwrite to call the strategy routine and report
any errors. When I/O is completed, the block will be returned to the queue
of available blocks on the freelist (by the higher level routine calling
bio.c/brelse), however, it will also remain linked to the device queue so
that any future reference to that block (by bio.c/getblk) will find it in
the system.
.sp 1m
.ne 10
.
.LP
.LG
.B bdwrite
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
bdwrite(bp)
.br
struct buf *bp;
.sp 1n
.
.LP
.I RETURNS
.
.LP
No value is returned.
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Performs a delayed write to a block device.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
Most write operations need not be done immediately. It is convenient in
most cases to merely return a buffer that is to be written onto a block
device to the pool of free buffers ("bfreelist") until some more convenient
time to write it out. In this way, a subsequent read or write request which
required data within that block will find the data within the system and
extraneous operations can be eliminated. The buffer, however, must be
marked ("b_flags") with an indicator (B_DELWRI), so that if free buffers
are required by other processes, the block will be written out
(asynchronously by bio.c/bawrite) and another chosen. The delayed write
block would then be returned to the free queue but would not be a candidate
for reallocation.
.
.LP
The buffer that is to be written will thus experience the following
movements:
.IP 1. 3
The buffer will be allocated to a device queue by some higher level
function using bio.c/getblk. This means a buffer is removed from the
freelist and placed on the device's queue. (The block may already be in the
system so that there would be no need for allocation.)
.IP 2. 3
The data is then copied into the buffer by the higher level routine.
.IP 3. 3
When the delayed write is issued, the buffer is marked (B_DELWRI in
"b_flags") and placed at the end of the queue of available blocks on the
freelist. The buffer is still, however, linked on the device queue.
.IP 4. 3
The block will linger on the freelist until another request for that block
is made (by bio.c/getblk) or until a free buffer is needed. If a buffer is
on the free list long enough to reach the head of the free queue, and free
buffers are needed, an asynchronous write (bio.c/bawrite) is issued for the
delayed write block and the next available block is chosen. The
asynchronous write will result in the buffer being unlinked from the free
list and placed on the device's I/O queue until the write is completed.
.IP 5. 3
When the write is completed, the buffer is returned to the end of the free
list, however, it is still linked to the device queue so that if that block
is referenced by some process, the block will be found in the system.
.IP 6. 3
The block will linger on the available list as in 3 until either another
free buffer is needed and the buffer has reached the head of the free list
or until some other process requests the data in that buffer. If a free
buffer is needed, the buffer no longer must be written out so that it can
be allocated to the requester as a free buffer. Only at this time does the
block disappear from the device queue.
.IP 7. 3
In order to prevent very active devices on the system from accumulating a
large number of delayed write blocks, the sys3.c/sync function causes all
delayed write blocks to be written out at least at the frequency that the
UPDATE process runs (in multi-user every 30 seconds). This also minimizes
discrepancies between in memory data and device data if the system crashes.
.
.LP
Since magnetic tape is a sequential medium, delayed writes must be
disallowed as blocks must be written sequentially and allowing blocks to
accumulate on the freelist as delayed write blocks runs the risk of
disturbing the order in which the blocks reach the magnetic tape. Delayed
writes are, however, allowed for DEC Tape (TC11) since it is really a
random access device even though it is a sequential medium. This makes it
necessary to flush out delayed write blocks when the DEC Tape is closed
(tc.c/tcclose). (This is done by calling bio.c/bdwrite from bio.c/bflush.)
.sp 1m
.ne 10
.
.LP
.LG
.B bflush
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
bflush(device)
.sp 1n
.
.LP
.I RETURNS
.
.LP
No value is returned.
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Flushes write behind blocks out of the I/O subsystem for a particular
device or for all devices.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
The inertia built into the block I/O subsystem by use of the write behind
feature (delayed write - see bio.c/bdwrite) allows a number of redundant
I/O operations on the same block to be eliminated. This is done by allowing
write behind blocks to be returned to the system's available buffer queue
until a later time when they are written out. This could, however, produce
problems if the system crashes, as those blocks would not be updated since
they were allowed to lie in the I/O subsystem. To flush these write behind
blocks out of the I/O subsystem, the bio.c/bflush function is called by the
the UPDATE (alloc.c/update) process once every 30 seconds. The delayed
write blocks are written out (by bio.c/bflush) asynchronously (see
bio.c/bawrite).
.
.LP
When each dealyed write buffer is written, it is temporarily removed
(bio.t/notavail) from the available list and made busy (B_BUSY in
"b_flags"). It is returned to the end of the available list once the write
has been completed. Since the block is returned to the available list and
its linkage on the device queue ("b_forw", "b_back" on "devtab") is
undisturbed, the block still appears on the device queue. The only
difference is that it has been assured that the block has been updated on
the block device.
.
.LP
Detaching any filesystem or closing a DEC Tape (the TC11 is a random access
device even though it is a sequential media) requires that all write behind
blocks for that device be flushed out. Bio.c/bflush is called by both
sys3.c/sumount and tc.c/tcclose.
.
.LP
If bio.c/bflush is called with "device" number NODEV (-1) then the write
behind blocks associated with devices are flushed out, otherwise only those
write behind blocks associated with "device" are flushed out. The UPDATE
process calls bio.c/bflush with NODEV as an argument. Since the available
queue of buffers ("bfreelist") is examined to find delayed write blocks and
this queue is the one from which all allocation and deallocation of buffers
occurs, the processor's priority must be raised to 6 to prevent interrupts
from changing the status of buffers or altering linkages.
.sp 1m
.ne 10
.
.LP
.LG
.B binit
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
binit()
.sp 1n
.
.LP
.I RETURNS
.
.LP
No value is returned.
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Initializes the block device buffers and determines the number of block
devices on the system.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
UNIX possesses a pool of 512 byte buffers ("buffers") which are used for
buffering reads or writes from any device or can be used by the system for
holding any data and whose size does not exceed 512 .bytes. Associated with
each block device is a device queue ("devtab"), which links together all of
the blocks that have currently been read or written from that device. There
is also a queue of available buffers ("bfreelist") which chain buffers
together. The linkages on the device queue ("devtab") and available queue
are arranged so that a buffer may appear on both queues at the same time.
.
.LP
Each buffer has a header ("buf") which contains linkages, byte count
information, status flags,and a pointer to an associated 512 byte buffer. A
buffer header ("bfreelist") which has no associated buffer is used as the
anchor linking together all available buffers in a ring. All buffer headers
contain two pairs of pointers; "av_forw", "av_back" and "b_forw", "b_back".
Each pair contains a forward and backward pointer so that in searching
through the list, neighboring buffers on any queue may be found
immediately. One pair of pointers ("av_forw" "av_back") is used to link
together a ring anchored by "bfreelist" containing all of the buffers that
are currently available for allocation. The second pair of pointers
("b_forw" and "b_back") are used to link the buffer onto a device queue.
"Bfreelist" also uses these pointers so that it may act as a device queue.
This queue is used for linking together blocks which are allocated for some
purpose other than device I/O and for which it is undesirable to have them
associated with a device.
.
.LP
The device queue also has two pairs of pointers but uses them for different
purposes. The first pair ("b_forw", "b_back") are used to link together all
buffers that have been used or are being used for I/O to the device. This
queue is also arranged as a ring with "devtab" as anchor. The second pair
of pointers ("d_actf", "d_actl") are used to order buffers that are
currently queued to be read or written. The arrangement here varies with
the device but is usually a single thread chain ordered according to the
strategy used to access that device (First Come First Served - FCFS, SCAN,
SSTF, etc.).
.
.LP
A block may appear on both the available list and the "b_forw", "b_back"
list of the device (since the block, though available for use by other
processes was last accessed from the device). This allows the elimination
of many I/O operations since the desired block may be found on the device
queue.
.
.LP
Initializing the block device buffers is done by setting up the pointers in
the buffer headers and allocating each buffer to the free list. The
algorithm used by bio.c/binit is essentially as follows. First, a ring with
no buffers is created with "bfreelist" as its sole member by initializing
all of its pointers to itself. Thereafter, a buffer is allocated to this
queue (by calling bio.c/brelse) and header information is added to each
buffer. Coincident with this, each buffer is allocated to the null device
queue ("b_forw", "b_back" pointers in "bfreeelist") so that the "b_forw"
and "b_back" pointers are initialized. The process is repeated for each of
the system buffers. The header informal that is initialized is as follows:
.IP \~ 2
The "b_dev" entry must be set up so that no buffer starts out as associated
with any device ("b_dev" = -1).
.IP
The buffers ("buffers") are allocated separately from the buffer
headers("buf[]") so that a pointer ("b_addr") in the buffer header must be
set to point to the address of the buffer. Buffer headers and buffers are
allocated as an array and buffer header i ("buf[i]") is associated .with
buffer i ("buffers[i]"). The reason that the headers are not allocated as
part of the buffer is so that physical I/O may be interfaced to the block
I/O subsystem by simply using a special buffer header which points to the
data in the user's process. Also, when debugging a core dump, all of the
information about buffer status can be obtained by dumping the headers
without the necessity of dumping buffers.
.IP
The buffer must be marked ("b_flags") busy (B_BUSY) until all of the
linkages have been properly set up. Bio.c/brelse will reset the busy flag
once the available pointers "av_forw", "av_back" have been set up.
.
.LP
Another function performed at initialization time is to determine how many
block devices there are on the system so that major device numbers may be
checked by higher level function. Table ("bdevsw") contains a 4 word entry
for each device on the system and one blank (zero) entry following all of
these entries to indicate the end of the table. Bio.c/binit scans the table
looking for the first zero entry, counting each entry as it scans. The
total number of block devices is loaded in an external variable
("nblkdev").
.sp 1m
.ne 10
.
.LP
.LG
.B bread
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
bread(dev,blkno)
.sp 1n
.
.LP
.I RETURNS
.
.LP
A pointer to a suffer containing the block "blkno" is returned. (Actually,
a pointer to the buffer header is returned.)
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Performs a synchronous 512 byte read of a block device.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
Since a program cannot operate on data until it is available, only
synchronous reads are used under UNIX (i.e., any higher level request for a
block is forced to roadblock until the data is available). A certain amount
of inertia is built into the block I/O buffering scheme so that there is a
possibility that the desired block "blkno" may already be in one of the
system's buffers. This is more likely to be true if a process is attempting
to read data in 512 byte chunks. If the block is in the system it can be
grabbed before it leaves the system thus saving a read operation. A new
buffer can be allocated from the pool of free buffers ("bfreelist") if the
block is not already in the system. All reads result in a request for 512
bytes of data, even though the higher level function calling for the read
may only need a small portion of this data. The buffer is set up for a 512
byte read (the word count "b_wcount" is set to 256) and the buffer is
marked for a read (B_BREAD set in "b_flags"). In both cases, the buffer is
marked ("b_flags") as busy (B_BUSY) for the period of time that the read is
scheduled to take place and/or while the data is required for use by a
process.
.
.LP
If the block must be read, the buffer is placed on the appropriate block
device queue by calling the device strategy routine. The argument "device"
indicates the major device number so that the proper device strategy
routine may be selected from the Block Device Switch Table ("d_strategy" in
"bdevsw"). There is no possibility of a reference to the table being out of
bounds since the major device number was checked at higher levels of
software (against "nblkdev"). The process that requested the read is
roadblocked until the read, has completed by calling bio.c/iowait. When the
read is complete,d the device interrupt handler will mark the buffer as
having been filled by setting the done indicator (B_DONE) in the "b_flags"
entry of the buffer. Any errors occurring in the read will be reported by
the device interrupt handler.
.sp 1m
.ne 10
.
.LP
.LG
.B breada
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
breada(device,blkno,rablkno)
.sp 1n
.
.LP
.I RETURNS
.
.LP
A pointer to a buffer containing the block "blkno" is returned. (Actually,
a pointer to a buffer header is returned.)
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Performs read ahead on "device".
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
Read ahead is a technique whereby an attempt is made to anticipate where
the next read request on a device will be and to preread that data In this
manner, the program requesting the read will not be subjected to positional
and rotational latency or device queuing, if read ahead is completed before
the next block is requested. There are different strategies that can be
used for-doing read ahead. On UNIX, all reads (not including physical I/O)
result in a full 512 byte block being read from the device. Smaller amounts
of data could be read if a program requested it, however, since disks
transfer times are small in comparison to positional and rotational latency
times, any extra transfer is inconsequential. Also, most DEC disks are
designed around a 512 byte sector and while fewer than 512 bytes may be
specified, the disk controller is busy until a full sector has been
transferred. By reading the full 512 bytes, any subsequent read or write
which references data within that block will not have to be read (if the
block does not leave the system). Besides the advantage gained by reading a
minimum of 512 bytes instead of the desired quantities, the next block is
anticipated and read under certain conditions. Thus, one request will spawn
several read requests to bring data into the system. A routine for finding
a block that is already in memory (bio.c/incore) must be available to
determine whether any reads need be done and the read ahead strategy must
be capable of determining when read ahead should be discontinued so that
superfluous reads are not generated.
.
.LP
The strategy adopted under UNIX is to pursue read ahead as long as a
process is reading (512 byte blocks) sequentially through a file or a
device. When the first non sequential read is requested, read ahead is
discontinued and is not restarted until sequential accesses begin again.
.
.LP
Bio.c/breada carries out the read ahead opera-_tion. Starting and stopping
read ahead and determining which block number in a file or on a device is
the read ahead block ("rablkno") is done by the higher level function
rdwri.c/readi.
.
.LP
In implementing the read ahead strategy, bio.c/breada makes use of
bio.c/incore to determine whether a block is already in memory. For the
desired block "blkno", the bio.c/breada function behaves exactly like
bio.c/bread. That is, a synchronous read is performed and the process
requesting the read is roadblocked until it is completed. Since the desired
block may already be within system, bio.c/incore is called to look for that
block among the buffers on the freelist ("bfreelist"). If the block is
already in memory, bio.c/bread is called to get the buffer. If the desired
block has not already been read by a previous read operation then
bio.c/getblk is called to see if the block is possibly on a device queue
waiting for its turn to be read. If that is not the case, a buffer is
allocated for the read and the appropriate device strategy routine is
called. Bio.c/breada does not wait (yet) for the read to complete. Rather,
it goes through a similar operation for the read ahead block "rablkno".
Bio.c/incore is called to search the free list of buffers ("bfreelist") to
see if the block was read in a previous read operation. Nothing will be
done, of course, if the read ahead block is in memory. If it is not in
memory, bio.c/getblk is called to search the device queue for it or to
allocate a block so that it may be read. The device strategy module is
called to read the read ahead block, however, the buffer will be marked
(B_ASYNC in "b_flags") so that when the read completes the buffer is
returned to the pool of available buffers. Bio.c/breada then waits for the
read of the the desired block to complete. It does not wait for the read
ahead block.
.
.LP
An external variable "raflg" is available for turning of all read ahead on
aildevices. "Rang" is initialized to one, however, by changing it to zero
read ahead is eliminated. As with bio.c/bread any error detection is done
as a result of the interrupt handler indicating an error to bio.c/incore
and a system error (in "u_error") being posted. These errors are of no
concern to bio.c/breada or bio.c/bread and are used only at higher levels
of software to return errors to the user.
.sp 1m
.ne 10
.
.LP
.LG
.B brelse
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
brelse(bp)
.br
struct buf *bp;
.sp 1n
.
.LP
.I RETURNS
.
.LP
No value is returned.
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Releases a buffer to the queue of available buffers on the freelist
("bfreelist").
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
This function takes the buffer "bp" and places it on the queue ("devtab")
of available buffers. Neither the data in the buffer nor it's linkage to
the device queue are destroyed, however, so that the block appears on both
queues. (The available queue "bfreelist" and the queue it was released
from, "devtab".) In this way, the block appears as available for
allocation, yet it retains the identity of the data that resides in the
buffer. It is this fact that allows subsequent references to the block to
eliminate unnecessary 110 operations.
.
.LP
The "av_forw" and "av_back" pointers in the buffer headers link together
available blocks on the freelist ("bfreelist"). By simply inserting the
buffer "bp" at the end of the freelist (it will become the last block on
the "av_forw" chain) it becomes an available buffer. Changing buffer
linkages must not be interrupted so that the processor's priority is raised
to 6 to lock out interrupts from all block devices.
.
.LP
Several conditions must be checked before a block is released to the
available list.
.IP 1. 3
A check must be made to see if there is a request for this block by some
other process (B_WANTED set in "b_flags"). The B_WANTED flag is set if a
process requests a block when it is busy (B_BUSY set in "b_flags"). The
busy flag is set when a buffer is found or allocated by bio.c/getblk and is
not reset until the buffer is released (bio.c/brelse). Thus, any reference
to the block while it is busy will result in the wanted flag being set in
the buffer and the process that references the busy block being
roadblocked. It is the duty of bio.c/brelse to awaken all processes that
are waiting for this buffer (by calling slp.c/wakeup).
.IP 2. 3
If the available queue of buffers on the freelist is empty, then
bio.c/brelse must notify bio.c/getblk that one buffer is now available.
(Bio.c/getblk will roadblock any process that requires a buffer if there
are none available.)
.IP 3. 3
If the buffer being released was never read or written properly (B_ERROR
set in "b_flags") then in order to prevent any other process from finding
the buffer containing bad data the minor device number ("d_minor") of the
device number ("b_dev") is destroyed (set to -1 to destroy any
associativity). The buffer will still be handled as any other block, that
is, it will appear on two queues, but the device number which is checked by
bio.c/getblk is destroyed so that even though the block is still on the
device queue it will not be recognized. It can, however, be allocated as a
free buffer.
.sp 1m
.ne 10
.
.LP
.LG
.B bwrite
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
bwrite(bp)
.br
struct buf *bp;
.sp 1n
.
.LP
.I RETURNS
.
.LP
No value is returned.
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Performs a synchronous 512 byte write on a block device.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
Most write operations in the block I/O subsystem are performed
asynchronously, however, there are several operations (updating
superbiocks, updating i-nodes, updating freelists, etc.) which cannot be
delayed. Functions within the system that require writes have already
allocated a buffer (by calling bio.c/getblk) so that there is no need to
pass a device number to bio.c/bwrite as is done with the read functions.
The device number is already set in the buffer header ("b_dev") so that
bio.c/bwrite need only set the write flag in the buffer header ("b_flags")
and queue the buffer on the device for writing by calling the device
strategy routine. (Actually, since a buffer may be used only for reading or
writing, the absence of the read flag, B_READ, indicates that a write is to
be performed.) Since the write is to be performed synchronously, the
process requesting the write is roadblocked until the write is completed.
.
.LP
Bio.c/bwrite provides common code for bio.c/bdwrite and bio.c/bdwrite,
however, they will be discussed under their respective functions.
.sp 1m
.ne 10
.
.LP
.LG
.B clrbuf
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
clrbuf(bp)
.br
struct buf *bp;
.sp 1n
.
.LP
.I RETURNS
.
.LP
No value is returned.
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Zeros a 512 byte system buffer.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
Clearing a buffer of its contents is really a service provided to higher
level functions and is not used by any of the functions in bio.c. It is
used by the magnetic tape strategy routine to pad out a block with zeros if
a block smaller than 512 bytes is to be written. It is also used every time
a block is allocated to a file in the file system. This is done so that
there is no old data residing in a file if the file's length is not a
multiple of 512 bytes.
.sp 1m
.ne 10
.
.LP
.LG
.B devstart
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
devstart(bp, devloc,devblk,hbcom)
.br
struct buf *bp:
.br
int *devloc;
.sp 1n
.
.LP
.I RETURNS
.
.LP
No value is returned.
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Loads a block device controller's registers to initiate a transfer.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
Since the controller registers on most block devices manufactured by
Digital Equipment Corporation have the same form and relative positioning,
a common routine may be used to load the device registers. The new PDP-11
common controller RH11 has a slightly different format so that a comparable
routine rh.c/rhstart is used for these devices.
.
.LP
The parameters passed and the register which the are loaded are:
.IP 1. 3
"bp" - This is the address of the buffer header which contains information
about the buffer to be written. The header contains location, word count,
memory extension, device number, operation, etc.
.IP 2. 3
"devloc" - This is an address in the device controller. It is the Unibus
address of the cylinder (or possibly sector) register of the controller.
Only controllers which have the following four registers in the following
order may use bio.c/devstart.)
.RS
.IP a. 3
Command and status
.IP h. 3
Word Count
.IP c. 3
Bus Address
.IP d. 3
Cylinder, Sector or Track Address Register.
.RE
.IP 3. 3
"devblk" - This is either the Cylinder, Sector or Track to be loaded into
register 2d above. This value is computed by the device startup routine
(rp.c/rpstart, rk.c/rkstart, etc.).
.IP 4. 3
"hbcom" - This is a flag indicating whether a read or a write is to be
issued to the controller. The Command and Status Register is the last
register to be loaded as it actually initiates the transfer.
.
.LP
Bio.c/devstart is called by block device start routines (rp.c/rpstart,
rf.c/rfstart, etc.).
.sp 1m
.ne 10
.
.LP
.LG
.B getblk
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
getblk(device, blkno)
.sp 1n
.
.LP
.I RETURNS
.
.LP
Returns a pointer to a system buffer. If a device number that is out of
range is passed to getblk, a system panic will occur ("PANIC BLKDEV").
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Determines whether a given block from a device is already in the system and
if not, allocates a buffer.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
Bio.c/getblk is used by any function within the system that must do 512
byte I/O. It retrieves the desired block if it-is already within the system
or allocates a fresh buffer if it is not within the system. With the
linkage setup as described under bio.c/binit, a block remains associated
with the device that it was last used for even though it has been returned
to the available queue. In this way, the presence of a buffer which had
previously been read or written may be detected and unnecessary I/O
eliminated. In addition, write behind operations (see bio.c/bdwrite)
actually return buffers to the available list without being written. This
is done in the hope that the block will be accessed soon afterwards, so
that several writes to the same block will result in only one transfer to
the device. Write behind blocks cannot be allowed to lie in the I/O
subsystem forevthe buffer will remain busy until after the process has
finished using the data When the buffer is released (bio.c/brelse) any
other processes requiring the data in that buffer will be notified (via
slp.c/wakeup). In order to prevent a redundant wakeup being issued by
bio.c/brelse when busy buffers are released, bio.c/iodone resets the
B_WANTED bit before issuing a wakeup. For asynchronous I/O, the buffers are
released immediately by calling bio.c/brelse, so this is not a problem.
.sp 1m
.ne 10
.
.LP
.LG
.B iowait
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
iowait(bp)
.br
struct buf *bp;
.sp 1n
.
.LP
.I RETURNS
.
.LP
No value is explicitly returned, however, bio.c/iowait does cause any I/O
errors to be posted ("u_error").
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Roadblocks a process until the block "bp" has been read or written.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
All synchronous reads or writes must wait for I/O to complete. Waiting for
I/O to complete is done by the higher level routines (bio.c/bread,
bio.c/breada, bio.c/bwrite) calling bio.c/iowait to roadblock the process
requesting the read or write until the I/O is completed. The process is
roadblocked at priority PBIO (-50) to decrease its likelihood of being
swapped. The process is roadblocked until the B_DONE bit in the "b_flags"
entry of the buffer is set by the interrupt handler. If the I/O cannot be
completed by the device driver, the B_DONE bit and the B_. ERROR bit in
"b_flags" is set by the device interrupt handler. Bio.c/iodone is called to
set the completion flag (B DONE) but the interrupt handler sets the error
indication (B_ERROR) itself. Bio.c/iodone sends a wakeup to all processes
waiting (in the bio.c/iowait function) for the buffer. When a process that
is waiting for the buffer is awakened, bio.c/iowait will find the B_DONE
bit set and will call bio.c/geterror to post (in "u_error") any error that
may have occurred.
.sp 1m
.ne 10
.
.LP
.LG
.B physio
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
physio(strategy, bp, device, rdflg)
.br
struct buf *bp;
.br
int (*strategy)();
.sp 1n
.
.LP
.I RETURNS
.
.LP
No value is explicitly returned, however, if an error occurs it is posted
("u_error").
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Performs address mapping and checking for physical I/O.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
Physical unbuffered I/O is the only means by which I/O can be done. There
are a number of advantages and disadvantages to using physical I/O.
.IP 1. 3
In addition to no filesystern mapping being performed, the I/O is
unbuffered. This means that the data is read or written directly from the
user's address space. To allow this two things must be done.
.RS
.IP a. 3
The start address and end address of the user's buffer area must be checked
to see that it lies within the user's virtual address space. For normal
buffered I/O this need not be done. The rdwri.c/iomove function catches any
memory violations when copying data from a user's process into one of the
system's buffers.
.IP b. 3
The bio.c/physio function has no idea as to which block devices are word
oriented so that transfers must specify an even number of bytes.
.RE
.IP 2. 3
Since the I   occurs directly from the user's program, tt program must be
locked in core.
.IP 3. 3
Since I/O occurs directly to or from a user program, all of the advantages
of read ahead and write behind are lost to a program doing physical I/O.
This means that all of the latency (positional and rotational) to perform
the I/O will be experienced by processes doing physical I/O. While it is
true that none of the overhead of copying data from the user's program to a
system buffer is encountered, this is a small amount of time in comparison
with the average positional or even the average rotational latency of the
secondary storage devices that are currently available. There is a point at
which physical I/O will give more throughput to a device than buffered UNIX
I/O, however, this point varies with the speed of the processor (11/40,
11/45, 11/70) and the average rotational latencies of the devices (RK11,
RF11, RSO4, RP03, RP04, etc.). Additional positional delays on spiral reads
may also effect the crossover point on the smaller devices (RK11).
.IP 4. 3
Physical I/O like swap I/O uses a special buffer header for each device
("rptab", "rktab", etc.). Since there is only one of these headers per
(major) device, only one physical I/O operation to each ,controller at a
time can be queued. (Most controllers are busy - i.e. not available while
read or write operations are occurring.)
.
.LP
With the above background in mind the operation of bio.c/physio will be
described.
.
.LP
When any I/O operation is requested, a common function sys2.c/rdwr is
called to determine what the target file (i-node) is. There are three other
quantities which can be determined by sys2.c/rdwr.
.IP 1. 3
The current position in the file. This may be found in the "f_offset[]"
entry in the File Table and is transferred to the "u_offset[]" entry in the
U block to avoid repeated indirect addressing when referencing it.
.IP 2. 3
The virtual address (in user space) where the transfer is to begin can be
obtained from the second argument of the read or write system call. This is
a byte address and is placed in "u_base" for convenience.
.IP 3. 3
The number of bytes to be transferred can be obtained from the third
argument of the read or write system call. This byte count is placed in
"u_count". With the values in "u_base" and "u_count", the virtual area from
or to which I/O is to be directed is defined.
.
.LP
Since all block device controllers work from physical addresses instead of
virtual addresses, it is necessary to relocate "u_base". (For regular 512
byte UNIX I/O, relocation need not be performed because the system buffers
have the same physical and virtual addresses.) Since the operating system
assigns physical space to a process when it is created and since a process
may only initiate I/O on data within its own address space and there are no
holes in physical memory (that is, it is contiguous from beginning to end)
bio.c/physio need not check to see that the buffer area is beyond the
limits of physical memory. It is sufficient to insure that the buffer is
within the virtual address space of the user's process.
.
.LP
One additional restriction is placed on the user's buffer. The buffer must
not be within the text area of the user's process. It may only be within
the data or bss area of the user's process. This is done because requesting
a physical write into the text area of a reentrant program would cause a
Segmentation Violation by the controller. (The ability of a program to
write it's own text out is also lost.) No equivalent restriction exists for
regular 512 byte UNIX I/O, so that a program may read or write into it's
own text area. (If the prgram is reentrant, requesting a write into the
text area will result in a Segmentation Violation since reentrant text is
write protected by the system. The rdwri.c/iomove function would detect
this when the system buffer was copied into the write protected text.) At
present, when user programs are separated into I and D space areas, the
issue of reading or writing the text area will become superfluous.
.
.LP
Checking whether the buffer ,pis within the data, stack or text area can be
done by checking the quantities "u_dsize", "u_ssize" and "u_tsize". These
are the size of the text, data and bss area in memory blocks (64 bytes). In
order to do the checking, bio.c/physio must know how the text, data and
stack areas are loaded (by main.c/estabur) in the system. The text area is
loaded in the low physical (and virtual address space), followed by the
data (and bss) area. The stack area is allocated (physically) directly
below the data area (however, in virtual address terms the stack is in the
high virtual address area). The checks that are made are:
.IP 1. 3
A check is made to see that the user's buffer begins on an even address
boundary and that the byte count specifies an even number of bytes (i.e.,
ends on a word boundary). This done is because most block devices cannot
transfer an odd number of bytes or fetch from an odd address.
.IP 2. 3
A check is made to see that the buffer does not extend beyond the limits of
the virtual address space of the program. This is done by checking to see
that the address of the end of the buffer ("u_base" + "u_count") is greater
than the address of the beginning of the buffer ("u_base").
.IP
A check is also made to see that the buffer is not within the text area.
This is done by examining the address of the user's buffer ("u_base") to
see if it is beyond the last address in the text area ("u_tsize"). Before
this check can be made, the "u_tsize" address must be rounded up to the
nearest 4K virtual memory address. This is due to the way virtual address
space is allocated by main.c/estabur and the way programs are loaded by the
UNIX loader. This address is compared with the starting address of the
buffer "u_base" to see if the buffer begins within the (adjusted) virtual
address space of the text.
.IP 3. 3
Another check tests to see that the transfer does not span the virtual
address gap between the end of the data area.and the beginning of the stack
area. This is done by determining whether the end of the buffer is within
the data area of the program (the start of the buffer was checked above) or
the start of the buffer is within the stack area (extending beyond the
virtual address space of the program was checked in 2).
.
.LP
With the above checks made, it can be guaranteed that the I/O will not
abort due to addressing errors unless there is a genuine hardware problem.
(The I/O may also abort due to an illegal disk address specified - i.e.,
"u_offset[]".)
.
.LP
After the above checks have been made, the buffer header can be set up for
the transfer. Since there is only one buffer header per major device, use
of this buffer is restricted to one process at a time. To accomplish this,
access to the physical buffer header is restricted so that only one process
can do I/O to a particular device at a time. This is achieved by setting
the busy (B_BUSY) flag in the "b_flage entry of the buffer header when a
process enters the address computation part of bio.c/physio. The busy flag
is reset by bio.c/physio only after the I/O has been completed. (The device
interrupt handier will indicate the completion of I/O by setting the B_DONE
flag.
.
.LP
Any processes which tries to do physical I/O to a device on which physical
I/O is already being done (B_BUSY set) is roadblocked and the wanted flag
(B_WANTED) is set in the header. When the process doing physical I/O on the
device has finished using the buffer header all processes waiting to do
physical I/O on that device are awakened and the B_WANTED and B_BUSY flags
are reset. (Under physical I/O, there is no need to copy the data into the
user's address space as for normal I/O, so the buffer header is busy only
as long as it takes to set up and do the I/O.)
.
.LP
Setting up the physical I/O buffer header is done in a manner similar to
that of bio.c/getblk for buffered I/O The chief difference is that memory
extension bits must be calculated and the virtual address specified in the
read must be relocated to a physical address. The physical address is
determined by consulting the User Memory Management registers and
determining what physical area of memory is currently mapped by the virtual
addresses and adjusting the buffer address appropriately. The quantities
that are set in the buffer header are:
.nr PI 18p
.XP
"b_addr" is set to the physical address corresponding to the start of the
user's buffer.
.XP
"b_flags" is set up to contain any memory extension bits, to indicate that
the buffer header is busy (B_BUSY) and to indicate whether a read or write
(the argument wrdfle) is to be performed.
.XP
"b_blkno" is set to the block on the device where the transfer is to start.
This means that reads or whites are restricted to begin on 512 byte
boutitlaries on the device (on record boundaries for magnetic tape).
Requesting information that does not begin on a 512 byte boundary results
in the read or write starting at the nearest 512 byte boundary anyway.
.XP
"b_wcount" is set to the word count for the transfer. This has implications
for use on certain block devices(all disks) as transfers that involve
transfers smaller than a sector size (sector sizes vary among disks from
the word addressable RF11 to the 512 byte larger disks RP03, RK11, R1304,
etc.). On these devices, a request for I/O that is not a multiple of the
sector size will result in zeros being padded in the last sector written.
(This would destroy any data that existed in the remainder of that sector.)
.XP
"b_error" is set to zero. This is for future use in reporting individual
error numbers from devices.
.XP
"bdev" is set to the device number for the transfer "device".
.
.LP
Once all of these parameters have been set up in the buffer header, the
device strategy routine ("d_strategy" in "bdevsw") can be called to queue
the header on the device. It should be noted that unlike buffered I/O whete
the buffers are chained onto the device queue (by bio.c/getblk) so that
they may be found later, the physical I/O headers are not queued on the
"b_forw", "b_back" chain of the device queue.
.
.LP
One additional precaution must be taken when doing physical I/O. That is,
the process must be locked in core so that it is not swapped out while the
I/O is occurring. To do this, the SLOCK bit is set in the appropriate
Process Table entry ("p_flag"). Once the I/O is completed the lock bit is
reset.
.
.LP
Error reporting is done in the same manner as for normal I/O (by calling
bio.c/geterror). Any of the addressing errors checked are reported directly
in "u_error" by setting the system error EFAULT.
.
.LP
Physical I/O uses the residual word count entry ("b_resid") in order to
allow the device drivers to report exactly how many words were read or
written.
.sp 1m
.ne 10
.
.LP
.LG
.B swap
.SM
.sp 1n
.
.LP
.I CALL
.
.LP
swap(blkno, coreaddr, count, rdflg)
.sp 1n
.
.LP
.I RETURNS
.
.LP
Returns a 1 on error.
.ne 4
.sp 1n
.
.LP
.I SYNOPSIS
.
.LP
Performs physical I/O to the swap device.
.ne 4
.sp 1n
.
.LP
.I DESCRIPTION
.
.LP
This is essentially a stripped down version of bio.c/physio (which does I/O
directly from the user's address space). The reason that it is a separate
function is that the physical I/O function is serially reusable (per
device) and multiple requests for I/O to the same device are queued (i.e.,
only one can take place to a device at a time). Using only one common
function might mean that the swap would compete with physical I/O if the
swap device was on the same device as the physical I/O. Another reason for
their separation is that since the system has control of its resources
(i.e., it knows how much memory or swap space is available and where it
is), there is no need to worry about validating any of the addresses or
word counts.
.
.LP
A special buffer header ("swbuf") is used for the swap device. The
information in this header is filled in as is done by bio.c/getblk for
normal I/O. The data filled in is:
.
.LP
"b_flags" - The B_BUSY flag is used to indicate whether the buffer is in
use or not and the B_WANTED flag is set if any other process requires use
of the swapper. (The Scheduler is not the only process that does swapping.
A process may swap itself out.) The B_READ flag is also set based on the
value of "rdflg" passed to bio.c/swap. (A 0 in the B_READ bit position
indicates that a write is to be done.) Address extension bits are computed
for placing in "b_flags" based on the value of "coreaddr". This argument
contains the number of memory blocks (32 words/block) that are to be
swapped out. When a process is swapped it is swapped in one of two forms.
.IP 1. 3
If it is reentrant, the U block and data (data, bss and staci) are swapped
as one piece. When the process was created the reentrant text was created
and remains separately on the swap area so that it need never be swapped
out.
.IP 2. 3
If it is nonreentrant the U block, text and data (data, bss, stack) are
swapped as one piece.
.
.LP
The parameters loaded are:
.nr PI 18p
.XP
"b_dev" - set to the swap device "swapdev" which was specified when the
system was compiled (conf.c).
.XP
"b_addr" - set to the lower 16 bits of the address from or to which data is
to be swapped. The argument "coreaddr" is in granularity of memory block
(32 words/block) and so must be adjusted to a byte address (left shift 6
bits).
.XP
"b_count" - set to the word count for the transfer. It is obtained from the
"count" argument which is in memory blocks (left shift 5 bits for word
granularity).
.XP
"bbikno" - this is the logical block number of the destination from
"blkno".
.
.LP
The process requesting the swap is roadblocked until the swap is completed
and any other process requesting a swap while the swap buffer is busy is
roadblocked until the swap buffer is free. Swapping has the highest
software priority in the system when roadblocking a process(even higher
than normal I/O). Any error occurring during the swap is reported by
returning a nonzero value to the caller and in the case of the Scheduler
any error results in a system panic ("PANIC SWAP DEVICE").
