In this open source project we plan to examine the scalability of the current scheduler on large-way SMP systems. We believe that the current scheduler design will work well for medium-sized SMP systems. For systems larger than that, we want to evaluate alternative scheduler designs to help understand what changes can be made to Linux in order for it to scale up to large systems. We welcome the participation of other individuals working in this area, both for assistance in coding and benchmarking related to the project's goals, as well as for their insights and suggestions about scheduling and scalability under Linux.
The primary interest in Linux in most commercial environments is its use as a server operating system. Generally speaking, Linux has been used in clustered, thus horizontally-scaled, environments such as web-serving, where the workload is readily partionable across a number of 1 or 2 processor systems. These systems are typical of the first tier of a 3-tier system in the e-business environment. If Linux is to be used in the second and third tiers, then we must demonstrate that Linux can also scale vertically as well as horizontally. By vertical scaling, we mean scaling to large-way SMP systems such as 4-ways, 8-ways and higher. At the present time, Linux has been shown to scale moderately on a 4-way system [SMPScale], but scalability to 8-way and beyond has not yet been established for e-business workloads. One factor that may limit scalability beyond 4-way is the structure of the current Linux scheduler.
The current Linux scheduler has the following characteristics (these are examined in much more detail in the next section of this white paper):
The current Linux scheduler was primarily designed for workloads where the number of runnable threads is small. This assumption, while valid in many environments, can be incorrect for some enterprise-class workloads. For example, as shown in previous work [JThread, SMPScale], there are workloads where the current scheduler can consume 20-30% of total system time. Also, some early results [LKMtalk] based on measurements of contention for the scheduler's runqueue lock indicate that a single Linux run queue may not be appropriate for large-way SMP systems.
From a queuing theoretical perspective, the run queue represents a single server and access to it is serialized through the spinlock. The clients are represented by the processors requiring access to the run queue for scheduling purposes. In such a scenario, a scalability problem might manifest itself due to three reasons:
The general solution that is commonly deployed in situations of lock contention is to reorganize the protected data structure so that the average lock hold time is reduced. If that does not result in the desired reduction in lock contention, the data structure is broken up or partitioned into smaller parts, each protected by their own separate lock.
For the Linux scheduler, there are fundamentally two ways to address these issues:
In some sense, these two techniques are different ways of accomplishing the same thing. If the scheduler queue were a priority queue, then only the highest non-empty priority needs to be scanned, reducing the number of tasks necessary to examine per scheduling decision. The time spent holding the run queue lock may already have been sufficiently decreased (by the introduction of the priority queue) so that the run queue is no longer a bottleneck in the system. If not, we can introduce a run queue per processor. In that case the number of tasks in each of the queues is smaller, and the number of tasks examined per scheduling decision is also reduced.
In this open source project, for which we solicitate participation, we propose examining different run queue organizations and schedulers from the standpoint of efficiency both in the uniprocessor and multiprocessor cases as well as for workloads with a small number of runnable tasks and a large number of runnable tasks. Our goal is to produce a scalable scheduler that also performs acceptably in the cases where the current Linux scheduler was designed to be optimal.
As a first direction in this project we plan to:
By studying the obtainable scalability in each of the above categories we hope to first establish how well and to what extent the current Linux kernel scales. Secondly, we want to determine how changes to mechanisms and algorithms can contribute to improving the scalability of the kernel.
Before engaging into redesigning the scheduler, a review of the current scheduler is due. This review is based on the recent 2.4.0-test kernels.
The basic schedulable unit in Linux is the task. A task consolidates state related to the address space, memory management, signal management, open files and privileges. In essence, it presents the state related to a process in traditional UNIX environment. Linux does not distinguish between processes and threads. However, it enables sharing of certain state information between processes, such as the memory management and address space, through the clone system call, thus enabling light weight thread support. Both, processes and threads are presented as task structures and from a scheduling point of view are not distinguished.
The current scheduler supports preemption of tasks executing in user space, but does not allow preemption of a task while executing in system space, i.e. while executing kernel code, such as a system call. Instead, code segments that have long kernel residency can increase responsiveness by checking for scheduling requirements at appropriate locations. Time for the purpose of scheduling is measured in units of timer ticks. Timer ticks are generated at an architecture dependent frequency (e.g. X86=100Hz, Sparc=1000Hz) by an interrupt mechanism. A task is assigned a certain time quantum, measured in ticks, for which it can execute, before its time slice will be preempted. Priority preemption can occur at any time when either in user space or when exiting the kernel space.
The current scheduler bases its scheduling decision on various attributes of a task:
policy: | The various scheduling policies supported by Linux. Of particular interest is the distinction between real-time tasks and others (SCHED_OTHER). |
nice: | The nice value assigned to a task by fork inheritance or by the sys_setpriority() system call with values ranging from -19..20. |
rt_priority: | The priority of a real-time task. |
mm: | A pointer to the memory management data structure (e.g. page tables, virtual memory areas (vma)). |
processor: | The processor number on which the task is currently running or where it ran last |
has_cpu: | Binary flag to indicate whether the task is currently executing or not. It is set and cleared by the scheduler code. Tasks that are executing are not eligible for scheduling. |
counter: | Number of time ticks until the task will be preempted. The value is decremented for a running task, each time the timer service routine executes. |
need_resched: | Set if a task needs to be rescheduled (see preemption). |
The processor of the currently executing context is identified by the cpu_curr() macro. The current macro identifies the task of the currently executing context.
There are two main scheduling functions:
The scheduler utilizes a
single run queue (runqueue_head)
and a single spinlock (runqueue_lock)
to protect access to this queue. The queue is unordered which makes insertion
and deletion of a task straight forward and efficient. However, in order to
select a new task to run, the scheduler, when executing on a particular
processor, locks and traverses the entire run queue and creates a
goodness value
(aka weight) for each task
tsk. Weight values range from 0..MAX
for regular tasks (SCHED_OTHER)
and 1000+ for real time tasks. MAX is determined by the architecture dependent
NICE_TO_TICKS macro, but yields values
well below 1000. Hence, real time tasks are always given preference over
regular tasks. There are two types of affinity taken into account for
SCHED_OTHER tasks. The first is cache
affinity; if the invoking processor is equal
tsk->processor,
then the weight is increased by a value of
PROC_CHANGE_PENALTY
(value 15 or 20 dependent on the architecture). The second is memory management
affinity; namely if the memory management object of
current is the same as that of
tsk, then the weight is increased
by 1 accounting for
overhead associated with switching page tables (e.g. TLB flushes). When the run
queue consists of only SCHED_OTHER
tasks which have expired their time quantum
(counter), then all tasks in the
system are traversed and their counter
value is recalculated. The rational is that tasks that have been inactive
longer should receive some preference when woken up.
In general, the current scheduler is doing a good job in
avoiding lock contention when the system is predominately executing user level
processes that do not create heavy kernel activity (defined as yielding, I/O,
process creation etc.). In this case, the majority of scheduling request
originate with preemption. In order to avoid lock contention on preemption, the
current kernel disperses the timer interrupts on different processors through
out a single scheduling cycle (T), so that they pop at equidistant intervals
throughout T.
[JThread] "Java
Technology, Threads, and Scheduling in Linux", R. Bryant, B. Hartner, Java
Technology Update, Volume IV, Issue 1, January 2000, archived at:
[SMPScale] "SMP Scalability Comparisons of
Linux Kernels 2.2.14 and 2.3.99," R.
Bryant, B. Hartner, Q. He, and G. Venkitachalam, Proceedings of the 4th
Annual Showcase and Conference, Atlanta, GA, October 10, 2000, pp.
195-208. Also available at:
[LKMtalk] "Lockmeter: Highly-Informative
Instrumentation for Spin Locks in the Linux Kernel,"
foils from talk at the 4th Annual Linux Showcase and Conference,
Atlanta. Available at:
http://www-4.ibm.com/software/developer/library/java2/index.html
http://www.usenix.org/publications/library/proceedings/als2000/bryantscale.html
http://oss.software.ibm.com/developerworks/opensource/linux/presentations/lockmeter/als2000/index.html