Scalable Scheduling for Linux
Open Source Project Description
and
Call for Participation

Hubertus Franke, Mike Kravetz, Bill Hartner
IBM Linux Technology Center
{frankeh,kravetz,bhartner}@us.ibm.com

Abstract

In this open source project we plan to examine the scalability of the current scheduler on large-way SMP systems. We believe that the current scheduler design will work well for medium-sized SMP systems. For systems larger than that, we want to evaluate alternative scheduler designs to help understand what changes can be made to Linux in order for it to scale up to large systems. We welcome the participation of other individuals working in this area, both for assistance in coding and benchmarking related to the project's goals, as well as for their insights and suggestions about scheduling and scalability under Linux.

Introduction

The primary interest in Linux in most commercial environments is its use as a server operating system. Generally speaking, Linux has been used in clustered, thus horizontally-scaled, environments such as web-serving, where the workload is readily partionable across a number of 1 or 2 processor systems. These systems are typical of the first tier of a 3-tier system in the e-business environment. If Linux is to be used in the second and third tiers, then we must demonstrate that Linux can also scale vertically as well as horizontally. By vertical scaling, we mean scaling to large-way SMP systems such as 4-ways, 8-ways and higher. At the present time, Linux has been shown to scale moderately on a 4-way system [SMPScale], but scalability to 8-way and beyond has not yet been established for e-business workloads. One factor that may limit scalability beyond 4-way is the structure of the current Linux scheduler.

The current Linux scheduler has the following characteristics (these are examined in much more detail in the next section of this white paper):

There is a single run queue for the entire system.
When a scheduling selection needs to be made, every runnable task in the system is examined as a candidate to be run while the runqueue lock is held.

The result of this structure is that if the number of runnable tasks is large, then there is the potential that examining the run queue can become a bottleneck in the system.

The current Linux scheduler was primarily designed for workloads where the number of runnable threads is small. This assumption, while valid in many environments, can be incorrect for some enterprise-class workloads. For example, as shown in previous work [JThread, SMPScale], there are workloads where the current scheduler can consume 20-30% of total system time. Also, some early results [LKMtalk] based on measurements of contention for the scheduler's runqueue lock indicate that a single Linux run queue may not be appropriate for large-way SMP systems.

From a queuing theoretical perspective, the run queue represents a single server and access to it is serialized through the spinlock. The clients are represented by the processors requiring access to the run queue for scheduling purposes. In such a scenario, a scalability problem might manifest itself due to three reasons:

Increase in the number of processors, thus increasing the number of clients.
Increase in the lock hold time, thus increasing the service time.
Increase in the service requests rate by the clients.

As either the number of processors, the lock hold time or the service request rate increases, there is an increased potential for lock contention. This can ultimately lead to scalability problems, as the lock contention implies busy waiting on the run queue spinlock.

The general solution that is commonly deployed in situations of lock contention is to reorganize the protected data structure so that the average lock hold time is reduced. If that does not result in the desired reduction in lock contention, the data structure is broken up or partitioned into smaller parts, each protected by their own separate lock.

For the Linux scheduler, there are fundamentally two ways to address these issues:

Reduce the number of tasks examined per scheduling decision (for example by replacing the current Linux run queue by a priority queue).
Build multiple runqueues into the scheduler (for example, one per processor).

In some sense, these two techniques are different ways of accomplishing the same thing. If the scheduler queue were a priority queue, then only the highest non-empty priority needs to be scanned, reducing the number of tasks necessary to examine per scheduling decision. The time spent holding the run queue lock may already have been sufficiently decreased (by the introduction of the priority queue) so that the run queue is no longer a bottleneck in the system. If not, we can introduce a run queue per processor. In that case the number of tasks in each of the queues is smaller, and the number of tasks examined per scheduling decision is also reduced.

Statement of Work

In this open source project, for which we solicitate participation, we propose examining different run queue organizations and schedulers from the standpoint of efficiency both in the uniprocessor and multiprocessor cases as well as for workloads with a small number of runnable tasks and a large number of runnable tasks. Our goal is to produce a scalable scheduler that also performs acceptably in the cases where the current Linux scheduler was designed to be optimal.

As a first direction in this project we plan to:

Establish the scalability of the current scheduler beyond 4-way SMP systems.
Implement a set of incremental changes to the current single queue scheduler, such as priority queues, table based scheduling, etc.
Implement a multiple runqueue scheduler with similar queuing properties as the current scheduler.
Apply similar changes as in (2) to the multi-queue scheduler.
Determine a set of benchmark that are applicable for scheduler evaluations.

By studying the obtainable scalability in each of the above categories we hope to first establish how well and to what extent the current Linux kernel scales. Secondly, we want to determine how changes to mechanisms and algorithms can contribute to improving the scalability of the kernel.

Review of the Current Linux Scheduler

Before engaging into redesigning the scheduler, a review of the current scheduler is due. This review is based on the recent 2.4.0-test kernels.

The basic schedulable unit in Linux is the task. A task consolidates state related to the address space, memory management, signal management, open files and privileges. In essence, it presents the state related to a process in traditional UNIX environment. Linux does not distinguish between processes and threads. However, it enables sharing of certain state information between processes, such as the memory management and address space, through the clone system call, thus enabling light weight thread support. Both, processes and threads are presented as task structures and from a scheduling point of view are not distinguished.

The current scheduler supports preemption of tasks executing in user space, but does not allow preemption of a task while executing in system space, i.e. while executing kernel code, such as a system call. Instead, code segments that have long kernel residency can increase responsiveness by checking for scheduling requirements at appropriate locations. Time for the purpose of scheduling is measured in units of timer ticks. Timer ticks are generated at an architecture dependent frequency (e.g. X86=100Hz, Sparc=1000Hz) by an interrupt mechanism. A task is assigned a certain time quantum, measured in ticks, for which it can execute, before its time slice will be preempted. Priority preemption can occur at any time when either in user space or when exiting the kernel space.

The current scheduler bases its scheduling decision on various attributes of a task:

policy: The various scheduling policies supported by Linux. Of particular interest is the distinction between real-time tasks and others (SCHED_OTHER).

nice: The nice value assigned to a task by fork inheritance or by the sys_setpriority() system call with values ranging from -19..20.

rt_priority: The priority of a real-time task.

mm: A pointer to the memory management data structure (e.g. page tables, virtual memory areas (vma)).

processor: The processor number on which the task is currently running or where it ran last

has_cpu: Binary flag to indicate whether the task is currently executing or not. It is set and cleared by the scheduler code. Tasks that are executing are not eligible for scheduling.

counter: Number of time ticks until the task will be preempted. The value is decremented for a running task, each time the timer service routine executes.

need_resched: Set if a task needs to be rescheduled (see preemption).

The processor of the currently executing context is identified by the cpu_curr() macro. The current macro identifies the task of the currently executing context.

There are two main scheduling functions:

schedule(void):
This function is called in two scenarios:
1. Synchronously by a piece of kernel code to yield the processor to another task (e.g. schedule_timeout)
2. Preemptive on the return path from an interrupt (e.g. a reschedule-IPI from another processor, IO completion, system call). The current->need_resched is checked and if set, the schedule function is called.
reschedule_idle(struct task_struct *tsk)
This function is called in wake_up_process() when a task is first created or is reentered into the run_queue after an I/O or sleep operation. It tries to find a suitable idle processor and if found sends an IPI to it (see preemptive schedule) to set the need_resched flag. If there are no idle processors, then the goodness value of tsk is compared with that of the tasks running on the other processors and a priority preemption will occur if tsk has a higher goodness() value than any of the other processes. Furthermore, the task with the lowest goodness() value will be selected for preemption.

The scheduler utilizes a single run queue (runqueue_head) and a single spinlock (runqueue_lock) to protect access to this queue. The queue is unordered which makes insertion and deletion of a task straight forward and efficient. However, in order to select a new task to run, the scheduler, when executing on a particular processor, locks and traverses the entire run queue and creates a goodness value (aka weight) for each task tsk. Weight values range from 0..MAX for regular tasks (SCHED_OTHER) and 1000+ for real time tasks. MAX is determined by the architecture dependent NICE_TO_TICKS macro, but yields values well below 1000. Hence, real time tasks are always given preference over regular tasks. There are two types of affinity taken into account for SCHED_OTHER tasks. The first is cache affinity; if the invoking processor is equal tsk->processor, then the weight is increased by a value of PROC_CHANGE_PENALTY (value 15 or 20 dependent on the architecture). The second is memory management affinity; namely if the memory management object of current is the same as that of tsk, then the weight is increased by 1 accounting for overhead associated with switching page tables (e.g. TLB flushes). When the run queue consists of only SCHED_OTHER tasks which have expired their time quantum (counter), then all tasks in the system are traversed and their counter value is recalculated. The rational is that tasks that have been inactive longer should receive some preference when woken up.

In general, the current scheduler is doing a good job in avoiding lock contention when the system is predominately executing user level processes that do not create heavy kernel activity (defined as yielding, I/O, process creation etc.). In this case, the majority of scheduling request originate with preemption. In order to avoid lock contention on preemption, the current kernel disperses the timer interrupts on different processors through out a single scheduling cycle (T), so that they pop at equidistant intervals throughout T.

References

[JThread] "Java Technology, Threads, and Scheduling in Linux", R. Bryant, B. Hartner, Java Technology Update, Volume IV, Issue 1, January 2000, archived at:
http://www-4.ibm.com/software/developer/library/java2/index.html

[SMPScale] "SMP Scalability Comparisons of Linux Kernels 2.2.14 and 2.3.99," R. Bryant, B. Hartner, Q. He, and G. Venkitachalam, Proceedings of the 4th Annual Showcase and Conference, Atlanta, GA, October 10, 2000, pp. 195-208. Also available at:
http://www.usenix.org/publications/library/proceedings/als2000/bryantscale.html

[LKMtalk] "Lockmeter: Highly-Informative Instrumentation for Spin Locks in the Linux Kernel," foils from talk at the 4th Annual Linux Showcase and Conference, Atlanta. Available at:
http://oss.software.ibm.com/developerworks/opensource/linux/presentations/lockmeter/als2000/index.html

policy:	The various scheduling policies supported by Linux. Of particular interest is the distinction between real-time tasks and others (SCHED_OTHER).
nice:	The nice value assigned to a task by fork inheritance or by the sys_setpriority() system call with values ranging from -19..20.
rt_priority:	The priority of a real-time task.
mm:	A pointer to the memory management data structure (e.g. page tables, virtual memory areas (vma)).
processor:	The processor number on which the task is currently running or where it ran last
has_cpu:	Binary flag to indicate whether the task is currently executing or not. It is set and cleared by the scheduler code. Tasks that are executing are not eligible for scheduling.
counter:	Number of time ticks until the task will be preempted. The value is decremented for a running task, each time the timer service routine executes.
need_resched:	Set if a task needs to be rescheduled (see preemption).

Scalable Scheduling for Linux Open Source Project Description and Call for Participation