Professional Documents
Culture Documents
The block driver API has evolved and can be expected to continue to do so. Compared to the 2.2 series kernel, the
2.4 kernel has simplified the low level driver with a cleaner interface to the VFS. The 2.6 kernel has streamlined I/O
operations by adding a new data structure designed for I/O operations in progress. Although an in-depth
understanding of the buffer cache (and its interaction with the paging system) would be useful, it is currently
considered outside the scope of this course. However, we will introduce some of the kernel data structures that are
important. Even though they are not used directly by the driver, their description will give us a little deeper
understanding.
From a hardware point of view the sector is the smallest addressable unit on a block device. This is, in bytes, a
power of two. The typical sector size is 512 bytes. From the viewpoint of the software supporting the file system, the
smallest logically addressable unit is the block. We expect it to hold a number of sectors, which is a power of 2. The
ext2 file system allows this size to be specified when the file system is created and common choices include 1024,
2048, or 4096 bytes. Of course, when we bring a file into memory, we must put the blocks intopages; so we expect
a page to hold some number of blocks, again a power of 2. Page size is typically dictated by the architecture and for
the IA32, this is 4096 bytes (there is a very large page size available, as well).
Linux provides a data structure called the buffer_head to fully describe the file block brought into memory, one
buffer_head per block. This is shown on the next page. Many of the fields would be what we might expect, others
less obvious. Perhaps the major feature to notice is its size, since there is one of these for each file block.
atomic_t b_count; /* users using this block */
kdev_t b_rdev; /* Real device */
unsigned long b_state; /* buffer state bitmap (see above) */
unsigned long b_flushtime; /* Time when (dirty) buffer should be written
*/
struct buffer_head *b_next_free; /* lru/free list linkage */
struct buffer_head *b_prev_free; /* doubly linked list of buffers */
struct buffer_head *b_this_page; /* circular list of buffers in one page */
struct buffer_head *b_reqnext; /* request queue */
struct buffer_head **b_pprev; /* doubly linked list of hashqueue */
char * b_data; /* pointer to data block */
struct page *b_page; /* the page this bh is mapped to */
void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */
void *b_private; /* reserved for b_end_io */
unsigned long b_rsector; /* Real buffer location on disk */
wait_queue_head_t b_wait;
struct list_head b_inode_buffers;/* doubly linked list of inode dirty buffers
*/
};
In the 2.4 kernel, this was also the data item used for I/O operations. It was realized that much of the information in
this struct was extraneous to the support of the actual data structure – and there was one of these for every block in
the transfer. Consequently, in the 2.6 kernel a new data structure, the bio struct, was devised to streamline the block
I/O operations. There is one I/O struct for every I/O operation, but such an operation could involve many blocks. In
our discussion we will focus on one of the fileds within the bio struct called bio_vec. This is the new representation
of the block for supporting I/O operations. Here is the small bio_vec struct:
struct bio_vec {
struct page *bv_page;
unsigned int bv_len;
unsigned int bv_offset;
};
where
● page is a reference to the page on which it is located
● bv_len is the number of bytes in the block
● bv_offset is the position of the block within the page
To see how this is used, we will need to reference three other fields in the bio struct:
bi_io_vec
bi_idx
bi_vcnt
Before embarking on that short discussion, we'll show the entire bio struct on the next page.
/* Number of segments in this BIO after
* physical address coalescing is performed.
*/
unsigned short bi_phys_segments;
/* Number of segments after physical and DMA remapping
* hardware coalescing is performed.
*/
unsigned short bi_hw_segments;
bio_end_io_t *bi_end_io;
atomic_t bi_cnt; /* pin count */
void *bi_private;
When the kernel wants to conduct a block I/O operation it constructs a bio struct. The I/O transfer will involve a list
of bio_vec structures where bi_io_vec points to the start of the list. The number of bio_vec's in that list is bi_vcnt.
The bio_vec that is being currently handled is indexed from the start of the list by bi_idx.
From a userland programmer's point of view the block and character devices look similar because both use the VFS
layer to initiate access. Hence, to use either, one employs similar library calls to do such things as
• open the device
• read or write from it
• close the device
Of course, there are differences as well. From the viewpoint of the device driver developer, the API for both is
similar in intent, but different in the details. The similarities lie in what must be done. In particular, the device driver
programmer must provide:
• an initialization routine to probe for and allocate resources - the entry point corresponding to insmod
• a cleanup routine which frees resources and does any other cleanup - the entry point corresponding to rmmod
• other low level functions whose entry points correspond to using the device once installed
• the kernel with entry point information about these other low level functions by registering the information ( a
data structure) with the kernel
The differences lie in the details of how things are done as well as in some differences in needed functionality.
In this chapter we will only look at the block driver API for kernels in the 2.4 version series. This API changed
somewhat dramatically from the 2.2 series - more so than did the character device API.
Clearly, there is a loose end. In particular, in contrast to the file_operations struct, we see no read or write
functionality. We will come back to that shortly. In the block_device_operations struct, we do see the expected
• open
• release
• ioctl
plus functions that deal with removable media.
Typical block devices are the various kinds of disk drives. From the CPU's viewpoint, these are relatively slow
electromechanical devices. To enhance data transfer efficiency, we would expect some sort of caching/buffering
strategy. Linux employs a dynamic cache system using physical memory left over from what is required for the
kernel and user processes. This leads to a significant difference between character device drivers and block device
drivers:
● If a user program makes read/write library calls to access a character device, the VFS passes these requests on to
the low level read/write functions in the driver
● If a user program makes read/write library calls to access a block device, the VFS does not pass these requests
on to low level read/write functions in the driver. Instead, the block_read() and block_write() functions ( see /
usr/src/linux/fs/block_dev.c) are used so that the user interaction is with the buffer not with the driver.
Clearly, if a block device driver does not have a direct read/write interaction with the user program because of the
intervening buffer, then it must provide functions to keep the buffer appropriately up to date. In short, the device
driver must provide low level read/write functionality to interact with the buffer as needed. These low level
functions are not entry points triggered directly by user programs and hence do not belong in the
block_device_operations struct. Nevertheless, the kernel must know about the block driver's low level read/write
functionality.
A given device will have a its own request queues (typically just one), so that the merging and sorting of requests are
possibilities that makes sense.
disk geometry
It is typically necessary to initialize certain disk geometry parameters; although, in some cases, there are default
values that may be appropriate. These parameters are array members organized by major (and perhaps, minor)
number e.g.
• blk_size[major][minor] - size of device in kbytes
• blksize_size[major][minor] - size of a block in bytes
• hardsect_size[major][minor] - sector size in bytes
The corresponding definitions along with other parameters can be found in /usr/src/linux/drivers/block/ll_rw_blk.c.
We'll see examples of initializing such parameters in the next chapter.
A block device driver transfers data grouped as a large number of adjacent bytes, called a block. Linux requires that
the number of bytes
• be a power of 2
• not exceed the page size
• include an integral number of disk sectors
It is somewhat typical to have a 512 byte sector and a 4096 byte page, implying that acceptable block sizes are 512,
1024, 2048, and 4096 bytes. For each block there is a buffer - an area of RAM that holds a copy of the block's data,
for efficient program access. A buffer is described by a data structure called a buffer head, which contains all salient
data about the buffer.
When the kernel needs to access a disk block, it creates a block device request. This is described in the request
struct (see /usr/src/linuc/include/linux/blkdev.h). A request may contain a number of adjacent blocks, so that the
request struct includes references to the first and last buffer heads for those adjacent blocks. The buffer_head struct
itself points to the next buffer head in a simple linked list.
Requests are placed in a request_queue - in fact, the request struct has a pointer to the request_queue in which it is a
member. Among the fields in the request_queue are references to
• the request function (request_fn) for processing the request queue (see next section)
• a set of functions for merging requests (for disk access efficiency)
• a function for making a request
There is a blk_dev array indexed by major number. Each element of the array represents a particular block device as
a blk_dev_struct. This struct references
• the device's request_queue
• the (atomic) procedure for processing that queue
Here, it is useful to include examples that includes some of the other context. The following fragments are excerpted
from the ramdisk source code which will be covered in the next chapter:
/* Here is the block_device_operations structure holding entry point
references to the driver's open, release, and ioctl. The removable media
functionality is not needed for a ramdisk.
*/
static struct block_device_operations fd_fops =
{
ioctl: radimo_ioctl,
open: radimo_open,
release: radimo_release,
};
/* The following is from the initialization and registration portion of the
code. We see that the register_blkdev() function registers the major number
(MAJOR_NR) and the block_device_operations structure as discussed earlier.
*/
res = register_blkdev(RADIMO_MAJOR, "radimo", &radimo_fops))
/* The ramdisk device driver programmer has already encoded the low level
read/write functionality in a function radimo_request, which is the
request_fn. Recall that the driver is responsible for initializing the
request_queue and also informing the kernel.
*/
blk_init_queue(BLK_DEFAULT_QUEUE(RADIMO_MAJOR), &radimo_request);
/* the expected unregister_blkdev() pairing the register_blkdev() found in
the earlier init_module
*/
res = unregister_blkdev(RADIMO_MAJOR, "radimo");
/* similarly, blk_cleanup_queue() pairs with the blk_init_queue() found in
the earlier init_module()
*/
blk_cleanup_queue(BLK_DEFAULT_QUEUE(RADIMO_MAJOR));
Starting in Section 8.4 we will look at a simplified ramdisk driver and see an example of an actual request_fn. and
see how it traverses the request_queue.
These are triggered by user programs which use the library calls open and close, as usual. Typically when a file is
opened the open command passes a pointer to a file struct and that structure is then bound to the process which
performed the open. Mounting also uses the open command, but the file structure passed is quite different and does
not become bound to the mount process. This 'quite different' file struct has only one field that is relevant i.e. the
f_mode field. The other fields should not be used. The f_mode tells the driver to open in one of two modes:
• read-only via f_mode = FMODE_READ
• read-write via f_mode = FMODE_READ | FMODE_WRITE
After the mount command is finished, the mounted file system remains, of course. Once the file system is mounted,
the kernel manages the files using the low level read and write methods accessed via the driver's request_fn
discussed earlier. Any process which opens a file within the mounted file system will still use the block_read() and
block_write() via the VFS.
Unmounting, done with the umount command, flushes the buffer cache and calls the driver's release. The release
function will be passed NULL as its file pointer since the file struct is not meaningful.
Recall that for character drivers, ioctl() was a catchall for any hardware commands the specific driver might need.
This is also true for the block drivers, but there is also a set of commonly used commands which usually suffice (see
<linux/fs.h>); for example:
• BLKROSET - set device read-only
• BLKGETSIZE - return device size
• BLKFLSBUF - flush buffer cache
• BLKRASET - set the read ahead value
• BLKRAGET - get the read ahead value
There are on the order of a dozen of these ioctl commands. The ioctl() function in the driver could handle these, as
well as any custom commands, in a switch structure as usual.
These are intended for use with removable media. Their functionality is:
• check_media_change - returns 1 if the media has been changed since last access, but otherwise returns 0
• revalidate - typically updates internal status information to reflect new media; called after media change is
detected
The kernel automatically checks for media change on mounting a device. If the driver keeps status information
concerning a removable device, it should check for media change (and revalidate on change) in the open command
as well.
Of course, what is missing in RAM disk examples is a real hardware device - other than RAM. Hence, we see no
role for interrupts etc. Our suggestion for further study beyond this chapter is to:
• work through the linux RAM disk driver (linux/drivers/block/rd.c)
• then work through a real driver such as linux/drivers/block/IDE-floppy.c or one which pertains more directly to
your interests
8.5.1 init_module
We'll first give the code listing and then follow it with a description. The code is well commented, so there will be
some redundancy in the description.
int init_module(void)
{
int res;
/* block size must be a multiple of sector size */
if (radimo_soft & ((1 << RADIMO_HARDS_BITS)1))
{
MSG(RADIMO_ERROR, "Block size not a multiple of sector size\n");
return EINVAL;
}
/* allocate room for data */
radimo_storage = (char *) vmalloc(1024*radimo_size);
if (radimo_storage == NULL)
{
MSG(RADIMO_ERROR, "Not enough memory. Try a smaller size.\n");
return ENOMEM;
}
memset(radimo_storage, 0, 1024*radimo_size);
/* register block device */
res = register_blkdev(RADIMO_MAJOR, "radimo", &radimo_fops);
if (res)
{
MSG(RADIMO_ERROR, "couldn't register block device\n");
return res;
}
blk_init_queue(BLK_DEFAULT_QUEUE(RADIMO_MAJOR), &radimo_request);
/* set hard and soft blocksize */
hardsect_size[RADIMO_MAJOR] = &radimo_hard;
blksize_size[RADIMO_MAJOR] = &radimo_soft;
blk_size[RADIMO_MAJOR] = &radimo_size;
read_ahead[RADIMO_MAJOR] = radimo_readahead;
MSG(RADIMO_INFO, "sector size = %d, block size = %d,
total size = % dKb\n", radimo_hard, radimo_soft, radimo_size);
return 0;
}
8.5.2 cleanup_module
Again, we'll first give the code listing and then follow it with a description.
void cleanup_module(void)
{
int res;
res = unregister_blkdev(RADIMO_MAJOR, "radimo");
if (res)
{
MSG(RADIMO_ERROR, "couldn't unregister block device\n");
return;
}
invalidate_buffers(MKDEV(RADIMO_MAJOR,0));
blk_cleanup_queue(BLK_DEFAULT_QUEUE(RADIMO_MAJOR));
vfree(radimo_storage);
MSG(RADIMO_INFO, "unloaded\n");
}
Recall that this entry point is used by the rmmod command. A step-by-step description of what cleanup_module
does follows:
• Unregisters the module
• Marks the buffer region associated with the device as invalid
• Frees the memory allocated earlier for the device
• Cleans up the request queue
• Sends a message to the log/console indicating that the device has been unloaded
Here is the jump table for other entry points, the file_operations struct:
static struct block_device_operations radimo_fops = {
owner: THIS_MODULE,
ioctl: radimo_ioctl,
open: radimo_open,
release: radimo_release,
};
The entry points referenced in the radimo_fops structure are used when a user program interfaces with the device via
the VFS. For example, the open system call when used for this device will ultimately make use of the function,
radimo_open. All functions referenced must be provided by this driver except for block_read and block_write. As
noted before block_read and block_write use a block buffering system employed by VFS to provide data transfer
between the buffer and the user address space as triggered by the user read and write calls. The data transfer
between the buffering system and the actual device is handled by the request_fn described in the next section. The
request_fn must be provided by this driver.
void radimo_request(void)
{
unsigned long offset, total;
radimo_begin:
INIT_REQUEST;
MSG(RADIMO_REQUEST, "%s sector %lu of %lu\n",
CURRENT>cmd == READ ? "read" : "write", CURRENT>sector,
CURRENT>current_nr_sectors);
offset = CURRENT>sector * radimo_hard;
total = CURRENT>current_nr_sectors * radimo_hard;
/* access beyond end of the device */
if (total + offset > radimo_size * (radimo_hard << 1))
{
/* error in request */
end_request(0);
goto radimo_begin;
}
MSG(RADIMO_REQUEST, "offset = %lu, total = %lu\n", offset, total);
if (CURRENT>cmd == READ)
{
memcpy(CURRENT>buffer, radimo_storage + offset, total);
}
else if (CURRENT>cmd == WRITE)
{
memcpy(radimo_storage + offset, CURRENT>buffer, total);
}
else
{
/* can't happen */
MSG(RADIMO_ERROR, "cmd == %d is invalid\n", CURRENT>cmd);
end_request(0);
goto radimo_begin;
}
/* successful */
end_request(1);
/* let INIT_REQUEST return when we are done */
goto radimo_begin;
}
static int radimo_open(struct inode *inode, struct file *file)
{
MSG(RADIMO_OPEN, "opened\n");
return 0;
}
static int radimo_release(struct inode *inode, struct file *file)
{
MSG(RADIMO_OPEN, "closed\n");
return 0;
}
static int radimo_ioctl(struct inode *inode, struct file *file,
unsigned int cmd, unsigned long arg)
{
unsigned int minor;
if (!inode || !inode>i_rdev)
return EINVAL; minor = MINOR(inode>i_rdev);
switch (cmd)
{
case BLKFLSBUF:
{
/* flush buffers */
MSG(RADIMO_IOCTL, "ioctl: BLKFLSBUF\n");
/* deny all but root */
if (!capable(CAP_SYS_ADMIN)) return EACCES;
fsync_dev(inode>i_rdev);
invalidate_buffers(inode>i_rdev);
break;
}
case BLKGETSIZE:
{
/* return device size */
MSG(RADIMO_IOCTL, "ioctl: BLKGETSIZE\n");
if (!arg) return EINVAL;
return put_user(radimo_size*2, (long *) arg);
}
case BLKRASET:
{ /* set read ahead value */
int tmp;
MSG(RADIMO_IOCTL, "ioctl: BLKRASET\n");
if (get_user(tmp, (long *)arg)) return EINVAL;
if (tmp > 0xff) return EINVAL;
read_ahead[RADIMO_MAJOR] = tmp;
return 0;
}
case BLKRAGET:
{ /* return read ahead value */
MSG(RADIMO_IOCTL, "ioctl: BLKRAGET\n");
if (!arg) return EINVAL;
return put_user(read_ahead[RADIMO_MAJOR], (long *)arg);
}
case BLKSSZGET:
{ /* return block size */
MSG(RADIMO_IOCTL, "ioctl: BLKSSZGET\n");
if (!arg) return EINVAL;
return put_user(radimo_soft, (long *)arg);
}
default:
{
MSG(RADIMO_ERROR, "ioctl wanted %u\n", cmd);
return ENOTTY;
}
}
return 0;
}
8.9 Activities
8.9.1 Activity 1
Download the simplified software, radimo_simp.tgz, and install it. Enable all levels of printing for the simplified
radimo driver. Set it up for use as in Section 8.8. Create some files on the ramdisk. Do those files persist when you
unmount the device and then remount? Do those files persist when you reboot the machine?
8.9.2 Activity 2
Write a user program that employs all of the ioctl commands for the simplified radimo driver. Trace the activity via
the messages available within the driver.
8.9.3 Activity 3