The Multics Scheduler |
IntroductionThe Multics design distinguishes between dispatching a CPU and scheduling processes, but the implementation is all done in a single module. Dispatching switches a CPU from one process to another; this should be as efficient as possible, so we precompute a list of eligible processes to switch to. Scheduling computes the order of this list. Jerry Saltzer's 1966 PhD thesis, Traffic Control in a Multiplexed Computing System, defined this distinction, and introduced the concepts of processes, block, and wakeup. The implementation of the Multics scheduler is in the module pxss, Process Exchange Switch Stack. Its ALM source code is available online. [THVV] EligibilityIn order to be dispatched, a process has to be in the eligible queue. Among other things, this means that the two essential pages needed for execution are present in memory. Scheduling policy mostly amounts to deciding which process to make eligible next, if indeed any at all, and how much CPU time the process can use before losing eligibility. Processes in the eligible queue are typically in the states
If a process calls block, for example to wait for TTY input or some other long-term event outside of the kernel's control, it is removed from the eligible queue, and not threaded in any queue. If a process's time slice ends it is threaded in some ready-but-not-eligible queue described below. Ordinarily processes are added to the tail of the eligible queue. They tend to float toward the top as other processes lose eligibility. An IO bound process (use little CPU, do many waits...) tends to wind up at the top of the queue, a desirable situation: One wants the processes that consume CPU time to be at the bottom of the queue, to soak up CPU, so the CPU does not idle during IO bursts. Foreground-Background SchedulerThe first Multics scheduler, circa 1967, was descended from the one in use on CTSS. It was sometimes described as a "Greenberger-Corbató exponential scheduler." The scheduler had N queues, something like 6. When the scheduler wanted to choose a process to run, it took the one at the head of the highest priority empty queue. A process that started requesting service would be put at the tail of the highest priority queue. Processes from that queue were given a small time slice, called the QUANTUM. If the timer went off and the process was still running, it was sent to the tail of the next lower queue, and when processes from that queue were run, they got a time slice twice as long (thus, exponentially longer time slices). The lowest priority queue was called the "background" queue. Thus, long-running jobs would not run until higher priority jobs were done; the scheduler would then run these background jobs for long blocks of time. A job that blocked on disk I/O was skipped over but didn't lose its position in queue; a job that blocked on terminal I/O started over at the top queue when it received an input wakeup. There are more details, but that's a general idea. [THVV] Percentage SchedulerIn 1975, Bob Mullen implemented the Multics "Workclass" scheduler, originally done to provide different percentages of the machine to different groups of users. It was not that each user got a few percent, but that a group was to get some total percent. (Scheduling of time for users within the group was done by what is called an FB-n algorithm, meaning n levels of foreground and background priorities.) This scheduler was overlaid on the original FB-n scheduler, in the sense that scheduling within a group was done by the FB-n scheduler. The workclass scheduler decided to make a process "eligible" based on its workclass being the most worthy; the ready queue for each workclass was sorted as an FB-n queue. The administrative interface for the workclass scheduler was written by Tom Casey and the Scheduler/Dispatcher algorithms by Bob Mullen. The percentage allocations could be varied by shift. There could be 16 workclasses. ObjectiveAllow a site administrator to assign each user to one of 16 groups, and to assign a percentage of the machine to each group. The percentages add up to 100 percent. Each Multics project is assigned to one of the groups, and so all user processes derive their work class assignment at login. The minimal expectation is that if all groups are presenting an offered load which exceeds their allotment, the amount received will match what was specified. Other implicit expectations are assumed and drive parts of the implementation described below:
Idealized AlgorithmSuppose A is the time slice that can be awarded, or what should be roughly the same, the responsiveness desired from the system. As the time is used by anyone, add time credits to all groups in proportion to each groups assigned percentage. Thus in general, for every second of time used by anyone, a total of one second worth of credits are distributed. (SCATTER CREDITS) Decide which job to run on the basis of which group has the most credits at the moment. As jobs run, subtract time credits from their group. This and SCATTER CREDITS essentially conserve the number of credits in the system. SCATTER CREDITS, the run decision, and the subtraction of time used for the using group address the primary goal, percentages specified by the admin. When a job is started running, subtract some fixed amount A from its group's credits. When the job is done, add A back into its group's credits. These two operations, on average, conserve the number of credits in the system. Together they address point 4 above. A group's credits can never exceed 4*A, nor go below 0. Thus SCATTER CREDITS says group.credits = min (group.credits + delta, 4*A). And as a job runs, group.credits = max (group.credits - used, 0). These two operations destroy and create credits, and limit the integration time of the system. If for some extended period only one group demands (and gets) service, the other groups will not build up more than 4*A credits, and the active group's credits will not go below zero. Thus if other groups suddenly became active the first group will be back in the ball game fairly quickly, more quickly if it is a 60% group, less quickly if it is a 10% group. This addresses point 3 above, a short integration time (4*A); also point 2, if some group is not using its credits, additional credits are passed out to demanding groups on the same ratio as originally specified. One could argue that the max # of credits (the cap) should be proportional to the groups' percentages, rather than the same across all groups. There are two problems with that. First it would mean the integration time, the memory of the system, could be longer for some groups than others, making behavior less predictable. Second, it would work to the advantage of the larger groups to allow them a larger buildup or credits. It will be (relatively) better for the smaller groups to not do that --- this addresses point 5 above. Note this is only an issue if there is some measurable idle time. If there is no idle time, the lower % group will have no advantage. Compromise for ResponsivenessIt is a result of queueing theory that if you place 25% of the users in 4 queues each assigned 25% of the cpu time, the response time for all users will be 4 times worse than if they shared a single queue running at 100% (under comparable heavy loads). This implies that schedulers dividing up cpu-time --- or bandwidth of any kind --- cannot avoid drastically degrading response time. Because this phenomenon was visible during testing, an additional queue was created for ready non-eligible process that had just interacted. This "interactive-queue" was examined before the workclass queues and allowed all users to receive their initial time-slices in first-come-first-served order regardless of recent usage by their workclass or the percentage assigned to it. Of course the credits for their workclass were decremented in the same manner as above. Because the first time slice was short, it was rare for a workclass to exceed its percentage due to this loophole. On the one hand this maintained good response time to simple commands; on the other, response to longer commands is not subject to stringent user expectations. Without the "interactive-queue" the rest of the percentage scheduler would have been doomed by elementary queueing theory. Even today it is worth asking of any new scheduling algorithm, "What does it do to average response time?" Deadline SchedulerIn 1975-1976, Bob Mullen added features to the scheduler to
The version of the scheduler is called the "Deadline Scheduler." The deadline scheduler used the workclass structures to implement a wholly different scheduler in which neither the FB-n nor the percentage parameters were used. Considering that most people do not understand the FB-n algorithm, the top level view was that the scheduler could be operated in either "percent" mode or "deadline" mode. Using some workclasses with virtual deadlines along with a few with realtime deadlines was a convenient way to tune benchmarks with response time requirements which varied for different user scripts. However realtime workclasses can also be used in conjunction with percentage groups, and such a use was common at customer sites. Virtual DeadlinesHere each workclass was given two pairs of numbers; R1 and R2 could be though of as response times, and Q1 and Q2 were time quanta. When a process received an interactive wakeup the scheduler read the clock and set process.deadline. = clock() + workclass.R1 Then the process was sorted by deadline into a queue of ready (but not eligible) processes from all workclasses. The process at the head of this deadline queue would be the next process to be made eligible. There was no attempt to meet the deadline, nor to delay running until the deadline; the deadlines were just a way to sort the work to be done. Thus the deadlines were virtual, meaning not real. When the process became eligible it was given a quantum of Q1, and placed at the end of the eligible queue. When its time slice expired it was resorted into the ready queue, after another clock reading: process.deadline. = clock() + workclass.R2 When it was made eligible again, it would receive a quantum of Q2. The allocation of R2 responses and Q2 quanta continued until the next interactive wakeup. Realtime DeadlinesRealtime deadlines were not deadlines for completion of work, but for the start of work. If a workclass was identified as realtime its processes were sorted into a special "realtime" ready queue by deadline. Until the deadline arrived no new eligibility would be granted to a realtime process, unless there was nothing else to do. However when the deadline arrived, a realtime process was moved to the head of the eligible queue (ignoring any restriction on max_eligible, working set size, etc.). Subsequent deadlines and time slices were computed as for virtual deadlines, but the sorting and dispatching used the realtime ready queue. Remarks on RealtimeIn general it would be a mistake to have too many realtime processes. The assumption was that the administrator would not have too many and the amount of processor time committed to them (asymptotically at most Q2/(R2+Q2) fraction of a processor, less due to IO waits...) would not be large. For example to keep a double buffered line printer busy that required 50 ms of CPU time to fill a buffer that would generate 2 seconds of printing, one would set: R1 = 1.9 sec R2 = 5.0 sec Q1 = .06 sec Q2 = .05 sec One needs to know that the refresh-buffer wakeup from the printer driver to the printer process gave interactive credit to the process. Thus under ordinary operation the printer process would start running within 1.9 sec and complete running before using .06 sec. For about 3% of a processor the printer would never halt if there was work to do. If the printer went into a loop, or someone tried to use the printer process to do a compile, the process would receive additional time slices very slowly, for a rate of 1% of a processor. Similar applications included RJE lines and backup processes. The present author would use a realtime workclass for himself in the early stages of tuning a benchmark, in order to never be locked out of the system due to poor response time. |