Professional Documents
Culture Documents
1. Introduction
Esquecer é uma necessidade.
Machado de Assis, Verba testamentária.
3. Do we need a scheduler?
The scheduling of I/O requests changes radically when the secondary storage is
so fast that there is no point in trying to hide its latency. Regarding the section
title: do we need a scheduler?, the answer is yes, but we may take advantage of
simpler mechanisms. There will always be need for a scheduler to stop processes
from hogging the processor, since starvation is a nasty thing and must be avoided.
A large part of what an operating system does is predicated on a storage
technology that is some, to several, orders of magnitude slower than the proces-
sor [KELS62, HP12]. If an access to secondary storage takes 105 to 106 processor
cycles, it pays to switch the processor to another process, to hide the disk access
latency. The OS enqueues the request, and the disk controller posts an interrupt to
the processor when the requested block has been copied to DRAM. Having several
processes performing concurrent accesses to disk is a clever trick to hide latency.
Consider how long it takes to save an execution context for a Unix process on
a MIPS processor, which includes the PC, 31 general purpose registers (GPRs), and
HI and LO registers. Saving state through non-cacheable memory needs 34 stores,
at 50 cycles/store, adding up to 1,700 cycles. This is only half the time for a switch
as the same number of registers must be restored for the incoming process. Thus
far, 3,400 cycles, and not counting the 33 floating point/status registers. This is
the minimum time it takes to save and restore state on a MIPS processor. As for
lightweight threads, not the full set of registers may need be saved/restored, but
unless there is hardware support for contexts, several memory references are needed
on each thread switch. Even for x86 processors, with 8-16 integer and a plethora of
vector/media dedicated registers, context switches cost anything but a trifle.
The interrupt service routine for the disk drive causes two mini-context
switches, to save and then restore the processor registers which are needed to exe-
cute the handler, plus some more cycles to drain and then fill up the pipeline with
instructions. Deeper/wider pipelines may take 100s of cycles to drain and fill up.
Under the far from innocent assumptions implied by Table 1, a fair proportion
of the cycles needed to access a magnetic disk are spent on context switches. For
the switch to be worthwhile, there better be several threads/processes vying for
the processor. Unfortunately, current desktop users don’t often have use for more
than two to three concurrent threads/processes [BDMF10]. The situation may not
be different with servers, as many applications are IO bound: a few long running
processes perform many disc references in searching or filling the huge tables common
in current large-scale applications.
To date, the status quo is this: it takes a very long time to access data on
magnetic disks, the OS switches processes whenever there is a disk request in order
to hide the disk latency, there are just a couple of processes to hide said latency. As
an additional twist, most of the lap/desktop systems sold in 2016 have at least four
processor cores, and the majority of them are idle most of the time.
If our desktop system is equipped with storage as fast as flash based SSDs,
things do look very different [BCGL11, YMH12]. If an access to an SSD block costs
5,000 cycles, there is no point in performing a context switch, which itself costs,
almost that same number of cycles. Thus, whenever a process requests a block from
an SSD, it is more efficient to synchronize the processor to the device by polling
rather than by an interrupt. If the processor performs the copy from/to the device
to/from memory, not making use of DMA, the block transfer may take less time
than with a DMA transfer followed by an interrupt to signal the end of transfer.
Also, the cache hierarchy may be better utilized as the interrupt service routine
pollutes the cache(s); after a context switch, the entering processes finds an empty
cache, and by the time it has been refilled, that process may be switched out again.
By using a storage medium that is three orders of magnitude faster, a large
section of an OS can be done away with: (i) there is no need for a two-layer disk
driver; (ii) there is no need for a queue of disk requests; (iii) there is no need for a
context switch to hide the disk latency; and (iv) there is no need for (some of) the
interrupt service routine(s) and all the attending priority and timing intricacies.
p = f (v)
7. In conclusion
For over half a century the design of operating systems was predicated on fast
primary memory (RAM) and slow secondary memory (magnetic disks). With the
arrival of non-volatile RAM, this premise no longer holds and large sections of the
OS can evolve to a much simplified implementation.
An arbitrarily ordered and non-exhaustive list includes the following changes.
First, fast solid state drives eliminate the need for interrupts since polling is more
efficient than a context switch plus interrupt service routine. Second, if the data
interface of SSDs is re-designed so the transfer unit is one byte or one word, then
the block interfaces that percolate through several layers of the OS may also be
simplified away. Third, as there is no need to hide the long access latencies of disks
by time multiplexing the processor, the scheduler can be re-designed. Fourth, as the
secondary memory is replaced by non-volatile RAM (NVRAM) primary memory, the
function of paging changes into a mechanism for memory allocation plus a separate
mechanism for enforcing security, and both of these can be provided by segmentation.
Fifth, file systems are conceived to match the organization magnetic disks, and
without these, files may be thought of as segments that stay in memory for a long
time. Sixth, the process abstraction can be re-thought as time-multiplexing the
processor loses the importance it once had. Seventh, non-volatility introduces its
own artifacts such as non-volatile dangling pointers and the very persistence of
sensitive data and these must be dealt with.
We propose segmentation as a more efficient means to provide virtual-to-
physical memory mapping and security. We present the design for a segmentation
buffer (similar to a TLB) for the MIPS processors. With this design completed,
we will implement a segmented operating system and then compare its efficiency to
that of a paging system.
References
[Alm16] Eduardo C Almeida. Memory footprint of large TCP-H benchmarks.
Personal communication, UFPR, Sep 2016.
[Bad13] Anirudh Badam. How persistent memory will change software systems.
IEEE Computer, 46(8):45–51, Aug 2013.
[BBBD13] M Bjørling, P Bonnet, L Bouganim, and N Dayan. The necessary death
of the block device interface. In Proc 6th Biennial Conf on Innovative
Data Systems Research (CIDR13), 2013.
[BCD72] A Bensoussan, C T Clingen, and R C Daley. The Multics virtual mem-
ory: Concepts and design. Comm of the ACM, 15(5):308–318, May
1972.
[BCGL11] K Bailey, L Ceze, S D Gribble, and H M Levy. Operating system impli-
cations of fast, cheap, non-volatile memory. In Proc USENIX Conf on
Hot Topics in Operating Systems, pages 2–2, 2011.
[BDMF10] G Blake, R G Dreslinski, T Mudge, and K Flautner. Evolution of thread-
level parallelism in desktop applications. In ISCA’10: 37th Intl Symp
on Computer Arch, pages 302–313, Jun 2010.
[CCA+ 11] J Coburn, A M Caulfield, A Akel, L M Grupp, R K Gupta, R Jhala,
and S Swanson. NV-Heaps: Making persistent objects fast and safe with
next-generation, non-volatile memories. SIGPLAN Not., 46(3):105–118,
Mar 2011.
[CDC+ 10] A M Caulfield, A De, J Coburn, T I Mollow, R K Gupta, and S Swan-
son. Moneta: a high-performance storage array architecture for next-
generation, non-volatile memories. In Proc 43rd IEEE/ACM Int Symp
on Microarchitecture (MICRO’10), pages 385–395, 2010.
+
[CNF 09] J Condit, E B Nightingale, C Frost, E Ipek, B Lee, D Burger, and
D Coetzee. Better I/O through byte-addressable, persistent memory. In
Proc ACM 22nd Symp Operating Systems Principles (SIGOPS), pages
133–146, 2009.
[Gre93] Paul Green. Multics virtual memory – tutorial and reflections. Es-
say, Multics Project, 1993. ftp://ftp.stratus.com/vos/multics/
pg/mvm.html.
[HAMS08] S Harizopoulos, D J Abadi, S Madden, and M Stonebraker. OLTP
through the looking glass, and what we found there. In Proc ACM Int
Conf on Management of Data (SIGMOD’08), pages 981–992, 2008.
+
[HCG 15] P Hornyack, L Ceze, S Gribble, D Ports, and H M Levy. A study
of virtual memory usage and implications for large memory. In Proc
Workshop on Interactions of NVM/FLASH with Operating Systems and
Workloads, 2015.
[HP12] John L Hennessy and David A Patterson. Computer Architecture: A
Quantitative Approach. Morgan Kaufmann, 5th edition, 2012.
[KELS62] T Kilburn, D B G Edwards, M J Lanigan, and F H Sumner. One-level
storage system. In IRE Trans on Electronic Computers, EC-11, pages
223–235, 1962.
[Lee06] Edward A Lee. The problem with threads. IEEE Computer, 39(5):33–42,
May 2006.
[LZY+ 10] B C Lee, P Zhou, J Yang, Y Zhang, B Zhao, E Ipek, O Mutlu, and
D Burger. Phase-change technology and the future of main memory.
IEEE Micro, 30(1):143–143, Jan 2010.
[MIP05] MIPS. MIPS32 architecture for programmers, volume III: The MIPS32
privileged resource architecture. Rev. 2.50, MIPS Technologies, 2005.
[RBB+ 08] S Raoux., G W Burr, M J Breitwisch, C T Rettner, Y-C Chen, R M
Shelby, M Salinga, D Krebs, S-H Chen, H-L Lung, and C H Lam. Phase-
change random access memory: a scalable technology. IBM J. Res. Dev.,
52(4):465–479, Jul 2008.
[RT74] Dennis M Ritchie and Ken Thompson. The UNIX time-sharing system.
Comm of the ACM, 17(7):365–375, Jul 1974.
[SB14] Karin Strauss and Doug Burger. What the future holds for solid-state
memory. IEEE Computer, 47(1):24–31, Jan 2014.
[SS15] Kosuke Suzuki and Steven Swanson. The non-volatile memory technol-
ogy database (nvmdb). Technical Report CS2015-1011, Dept of Com-
puter Science & Engineering, Univ of California, San Diego, May 2015.
[YMH12] J Yang, D B Minturn, and F Hady. When poll is better than interrupt.
In Proc 10th USENIX Conf on File and Storage Technologies, pages 1–7,
2012.