Mike Kravetz, Hubertus Franke
IBM Linux Technology Center
kravetz@us.ibm.com, frankeh@us.ibm.com
Version 0.2, April 2001
The purpose of this implementation is to explore what effect multiple run-queues have on the scalability of the Linux scheduler. The design goals of the implementation are the following:
/* * runqueue_data * One data item per CPU in the system. Size should be a multiple * of cache line size, and array of items should start on a cache line * boundary. */ typedef union runqueue_data { struct rq_data { int nt_running; /* # of tasks on runqueue */ int max_na_goodness; /* maximum non-affinity */ /* goodness value of */ /* 'schedulable' task */ /* on this runqueue */ struct task_struct * max_na_ptr; /* pointer to task which */ /* has max_na_goodness */ unsigned long max_na_cpus_allowed; /* copy of cpus_allowed */ /* field from task with */ /* max_na_goodness */ struct list_head runqueue; /* list of tasks on runqueue */ int running_non_idle; /* flag to indicate this cpu */ /* is running something */ /* besides the idle thread */ spinlock_t runqueue_lock; /* lock for this runqueue */ } rq_data; char __pad [SMP_CACHE_BYTES]; } runqueue_data_t; extern runqueue_data_t runqueue_data[]; #define runqueue_lock(cpu) runqueue_data[(cpu)].rq_data.runqueue_lock #define runqueue(cpu) runqueue_data[(cpu)].rq_data.runqueue #define max_na_goodness(cpu) runqueue_data[(cpu)].rq_data.max_na_goodness #define max_na_cpus_allowed(cpu) \ runqueue_data[(cpu)].rq_data.max_na_cpus_allowed #define max_na_ptr(cpu) runqueue_data[(cpu)].rq_data.max_na_ptr #define nt_running(cpu) runqueue_data[(cpu)].rq_data.nt_running #define running_non_idle(cpu) runqueue_data[(cpu)].rq_data.running_non_idleIn addition to the the new run-queue data structure, new fields are added to the aligned_data data structure. The new definition of aligned_data is as follows:
/* * aligned_data * CPU specific scheduling data. One data item for each CPU * in the system. Size should be a multiple of cache line size, * and array of items should start on a cache line boundary. */ typedef union aligned_data { struct schedule_data { struct task_struct * curr; /* current task on this CPU */ cycles_t last_schedule; /* time of last scheduling */ /* decision */ #ifdef CONFIG_SMP int curr_na_goodness; /* non-affinity goodness */ /* value of current task */ #endif } schedule_data; char __pad [SMP_CACHE_BYTES]; } aligned_data_t; extern aligned_data_t aligned_data[]; #define cpu_curr(cpu) aligned_data[(cpu)].schedule_data.curr #ifdef CONFIG_SMP #define curr_na_goodness(cpu) aligned_data[(cpu)].schedule_data.curr_na_goodness #endif #define last_schedule(cpu) aligned_data[(cpu)].schedule_data.last_scheduleThe above declarations are made in the sched.h header file. The actual definitions (allocating storage) are made in the file sched.c as follows:
/* * We align per-CPU scheduling data on cacheline boundaries, * to prevent cacheline ping-pong. Initialized in sched_init. */ aligned_data_t aligned_data [NR_CPUS] __cacheline_aligned; runqueue_data_t runqueue_data [NR_CPUS+1] __cacheline_aligned;The use of these new data structures and fields will be described in the following sections.
static inline int task_to_runqueue(struct task_struct *t) { int rq; if ((t->policy & ~SCHED_YIELD) != SCHED_OTHER) { rq = REALTIME_RQ; } else { rq = t->processor; } return(rq) }Associated with each of run-queue is a separate lock used to synchronize operations on that run-queue. In general, the scheduler code only needs to obtain the lock associated with the run-queue on which it is operating. Normally this is a CPU specific run-queue. Therefore, multiple instances of the scheduler can be running in parallel on separate CPUs.
In some cases it may be desirable to hold more than one CPU specific run-queue lock at a time. In these cases there is always a primary run-queue lock which must be held. While holding the primary lock, it may be desirable (but not absolutely necessary) to acquire a secondary run-queue lock. In such cases the spin_trylock() routine is used to avoid deadlock. Descriptions of this locking methodology will be evident in subsequent sections of this document.
In this implementation of a multi-queue scheduler, we attempt to preserve the global scheduling decisions/behavior of the existing Linux scheduler. Therefore, this new scheduler must have more global knowledge than a traditional multi-queue scheduler. To facilitate this global knowledge, we keep track of something called non-affinity goodness (na_goodness) for specific tasks in the system. na_goodness is defined as simply the goodness value of a task without any consideration for CPU or memory map affinity. In the current goodness() routine, the goodness value of a task is determined by:
/* * Search CPU specific runqueue */ list_for_each(tmp, &runqueue(this_cpu)) { p = list_entry(tmp, struct task_struct, run_list); if (local_can_schedule(p)) { int weight = local_goodness(p, prev->active_mm); if (weight > c) { if (!next->has_cpu) { /* * prev_next must point to a * schedulable task. */ prev_next_weight = c; prev_next = next; } c = weight; next = p; } else if (weight > prev_next_weight) { prev_next_weight = weight; prev_next = p; } } }Note that after the above search of the local CPU specific run-queue, the following variables are set:
To make a global scheduling decision, this multi-queue scheduler will examine the max_na_goodness values of all run-queues in the system. This examination is performed without holding the locks of remote run-queues. Since multiple examinations of these max_na_goodness values may be needed, we create a list of these values on the local stack. Code to create this list is something like the following:
/* * As calculated above, c does not contain a CPU affinity boost. * We must add this boost before comparing to tasks on other * runqueues. Only add PROC_CHANGE_PENALTY if c is a positive * goodness value. */ if (c > 0) { c += PROC_CHANGE_PENALTY; } /* * Copy max_na_goodness values from CPU specific runqueue * structures to the list on our stack. */ rrq = -1; tmp_na_goodness = c; for (rq = 0; rq < smp_num_cpus; rq++) { rcpu = cpu_logical_map(rq); if (rcpu == this_cpu) { stack_list[rcpu] = NA_GOODNESS_INIT; continue; } if (!this_cpu_allowed(max_na_cpus_allowed(rcpu), this_cpu)) { stack_list[rcpu] = NA_GOODNESS_INIT; continue; } if (max_na_goodness(rcpu) <= c) { stack_list[rcpu] = NA_GOODNESS_INIT; continue; } stack_list[rcpu] = max_na_goodness(rcpu); if (stack_list[rcpu] > tmp_na_goodness) { rrq = rcpu; tmp_na_goodness = stack_list[rcpu]; } }NA_GOODNESS_INIT is defined to be an artificially low value. When building this local list, we discard any values less than c (the highest goodness value on our local CPU specific run-queue with CPU affinity taken into account). In addition, we examine the cpus_allowed field and discard any tasks which are not allowed to run on this CPU. If no remote run-queue contains a max_na_goodness value which is higher than c, then we know that the task pointed to by next is the best choice for this CPU. However, if we do find a max_na_goodness value on a remote run-queue which is higher than c, then potentially there exists a task on another run-queue which is more desirable to run on this CPU. We choose the remote run-queue with the highest max_na_goodness value and do the following:
/* * First try to lock the remote runqueue and verify * the max_na_goodness value. */ if (spin_trylock(&runqueue_lock(rrq))) { if (max_na_goodness(rrq) > c && this_cpu_allowed(max_na_cpus_allowed(rrq), this_cpu)) { /* * Move a remote task to our run-queue */ if (!next->has_cpu) { prev_next = next; } next = max_na_ptr(rrq); c = max_na_goodness(rrq); del_from_runqueue(next); next->processor = this_cpu; add_to_runqueue(next); /* * We have stolen a task from another * runqueue, quit looking. */ spin_unlock(&runqueue_lock(rrq)); break; } spin_unlock(&runqueue_lock(rrq)); }Note that we only attempt to acquire the remote run-queue lock via spin_trylock(). If we are unable to immediately obtain the lock, we ignore this remote run-queue in this scheduling decision. If we are able to obtain the remote run-queue's lock, we then need to verify that there is still a task with a sufficiently high max_na_goodness value that is allowed to run on the current CPU (since we can only be certain of obtaining accurate values while holding the lock). If these conditions are still met, the task pointed to by max_na_ptr (which has the high na_goodness value) is moved to the local run-queue. This task is then designated as the next task to run on this CPU.
If we are either unable to lock the remote run-queue or after obtaining the lock we determine that the remote run-queue does not contain a task meeting the necessary requirements, then the remote run-queue is ignored in this scheduling decision. In this case, the list of max_na_goodness values (residing on our stack) is searched to find the remote run-queue with the next highest max_na_goodness value. The process of trying to obtain the remote run-queue lock and verifying remote max_na_goodness (and cpus_allowed) values is then repeated for the remote run-queue with the next highest value. This process is repeated until either:
schedule() - Before dropping the CPU specific run-queue lock, these values are updated with the na_goodness value of the highest priority schedulable task. By definition the prev and next tasks are not schedulable because the has_cpu field is set in their task structure. While looking for the next task to run, schedule() keeps track of the next highest priority task on the local run-queue. If the next highest task on the local run-queue is the idle task, then max_na_goodness is set to NA_GOODNESS_INIT and max_na_ptr is set to NULL. add_to_runqueue() - When adding a task to a CPU specific run-queue, a check is made to determine if the na_goodness value of the added task is greater than max_na_goodness for the run-queue. If it is, then the fields are updated to reflect the values of the added task. del_from_runqueue() - When removing a task from a CPU specific run-queue, a check is made to determine if max_na_ptr points to the task being removed. If it does, then max_na_ptr is set to NULL and max_na_goodness is set to NA_GOODNESS_INIT.
The first thing the reschedule_idle() routine does is check for a fast path of execution. This fast path can be taken if the CPU associated with the target task is idle. If the CPU is idle, then we ensure that the target task is on the CPU specific run-queue and inform the CPU that there is a task ready to be scheduled. This is exactly what the current reschedule_idle() code does today. Things get more complicated when this fast path is not taken and we must search for a target CPU on which to schedule the task.
Like the schedule() routine, the reschedule_idle() routine would like to make a global scheduling decision. In the current version of the Linux scheduler, global decisions can easily be made because there is a global run-queue lock. While holding this global run-queue lock, it is possible to examine the state of all CPUs in the system. While the run-queue lock is held, the list of currently running tasks will remain constant. We are afforded no such luxury in this multi-queue scheduler implementation. Rather, we only know for certain the state of the CPU associated with the target task (because the associated run-queue lock is held).
To make a global scheduling decision in reschedule_idle(), the state of all CPUs in the system must be examined. Such state information is contained in the list of CPU specific aligned_data data structures. Like the schedule() routine above, it is desirable to first examine these fields without holding CPU specific run-queue locks. Also, like the schedule() routine above we may wish to examine this data multiple times. Hence, we create a local list on our stack which contains this state information:
/* * Create a list of current na_goodness values on our stack. * Only values less than the non-affinity goodness value of * p should be considered for preemption. */ saved_na_goodness = na_goodness(p) - 1; /* preemption_goodness() > 1 */ tmp_min_na_goodness = saved_na_goodness; curr_cycles = get_cycles(); target_cpu = -1; for (i = 0; i < smp_num_cpus; i++) { cpu = cpu_logical_map(i); if (!can_schedule(p, cpu)) { stack_list[cpu] = saved_na_goodness; continue; } if (curr_na_goodness(cpu) == NA_GOODNESS_INIT) { /* * Indicates an idle task. For idle tasks, determine * the amount of time they have been idle. Use the * negative of this value in the list. Hence, we * first choose the CPU that has been idle the longest. */ tmp_cycles = curr_cycles - last_schedule(cpu); if (tmp_cycles > INT_MAX) { stack_list[cpu] = INT_MIN; } else { stack_list[cpu] = (int)-tmp_cycles; } } else { stack_list[cpu] = curr_na_goodness(cpu); /* * Add in PROC_CHANGE_PENALTY for remote CPUs */ if (cpu != tsk_cpu) { stack_list[cpu] += PROC_CHANGE_PENALTY; } } /* * Look for the lowest value */ if (stack_list[cpu] < tmp_min_na_goodness) { target_cpu = cpu; tmp_min_na_goodness = stack_list[cpu]; } }The values contained in this local list correspond to how appropriate it is to run the target task on the corresponding CPU. The lower the value, the more appropriate it is to run the target task on that CPU. Note that for idle CPUs a corresponding negative value is in the list. In fact, the longer a CPU has been idle the more negative the number. CPUs running non-idle tasks have positive numbers corresponding to the na_goodness value of the task currently running on that CPU. When scanning the list of values we only need to take note of values less than the na_goodness value of the target task. This is because the target task can only possibly preempt tasks with a lower goodness value.
From the local list, the CPU associated lowest value is chosen and the following code is run.
if (target_cpu == tsk_cpu && preemption_goodness((tsk = cpu_curr(target_cpu)), p, target_cpu) > 1) { /* * If target_cpu is tsk_cpu, then no additional * locking is required (we already have the CPU * specific runqueue locked). We also know that * this CPU can not be idle, otherwise the fast * path at the beginning of this routine would * have been executed. Therefore, simply send * the IPI if required. */ if (!task_on_runqueue(p)) { add_to_runqueue(p); } tsk = cpu_curr(target_cpu); tsk->need_resched = 1; if (target_cpu != this_cpu) { smp_send_reschedule(target_cpu); } return; } /* * Try to lock runqueue and verify na_goodness value. */ else if (spin_trylock(&runqueue_lock(target_cpu))) { tsk = cpu_curr(target_cpu); if ((tsk == idle_task(target_cpu)) || (preemption_goodness(tsk, p, target_cpu) > 1)) { /* * Target CPU is idle, or it is running * a task with lower priority than p. * Therefore, move p to target runqueue. */ if (task_on_runqueue(p)) { del_from_runqueue(p); } p->processor = target_cpu; add_to_runqueue(p); /* * Send and IPI to target CPU, unless the * CPU is idle and the need_resched flag * has already been set. */ need_resched = tsk->need_resched; tsk->need_resched = 1; if ((target_cpu != this_cpu) && ((tsk != idle_task(target_cpu)) || !need_resched)){ smp_send_reschedule(target_cpu); } spin_unlock(&runqueue_lock(target_cpu)); return; } spin_unlock(&runqueue_lock(target_cpu)); }Note we only attempt to get the lock associated with a remote run-queue via spin_trylock(). If we are unable to immediately obtain the lock, we skip the remote CPU in this scheduling decision. After obtaining the lock we check to ensure that the remote CPU is either idle, or is running a task which the target task can preempt. If this is not the case, then the remote CPU is skipped in this scheduling decision. If the remote CPU looks like a good place for the target task, then we add the target task to the run-queue of the remote CPU and inform the remote CPU that it potentially needs to schedule the new task.
If we skip a remote CPU for any of the above reasons, then we go on to the entry in our local list with the next lowest value and repeat the above process. When all CPUs with values less than the na_goodness value of the target task have been exhausted, then the target task will remain (or be put) on its original run-queue. A check will then be made to determine if the target task can preempt the task currently running on its original run-queue.
int rq = task_to_runqueue(t); spin_lock(&runqueue_lock(rq)); while (task_to_runqueue(t) != rq) { spin_unlock(&runqueue_lock(rq)); rq = task_to_runqueue(t); spin_lock(&runqueue_lock(rq)); }
The schedule() routine must take into account the real-time run-queue while searching for the next task to run. By definition, a real-time task has higher priority (goodness) than any other (SCHED_OTHER) task. Therefore, the schedule() routine must examine the real-time run-queue before examining its own CPU specific run-queue. If a schedulable task exists on the real-time run-queue, then that task must be chosen to run next. Ordering of real-time tasks within the real-time run-queue will remain as it is today. This preserves support for the required real-time scheduling policies.
Like CPU specific run-queues, there is a lock associated with the real-time run-queue. However, unlike CPU specific run-queue locks there are situations where a CPU specific run-queue lock is held and the real-time run-queue lock must be obtained. In these situations we define a lock hierarchy and state that the CPU specific lock must be acquired before the real-time lock. Such a locking hierarchy prevents deadlocks while obtaining run-queue locks.
As described in the sections above, this implementation attempts to make global scheduling decisions to emulate the behavior of the current scheduler. As a result, each scheduling decision requires examination of multiple instances of CPU specific data. It should therefore be obvious that as more CPUs are introduced into the system more data must be examined at each scheduling decision. Also, this CPU specific data is contained in separate CPU specific cache lines. This means that on every scheduling decision, each of these cache lines must be read by the CPU making the decision. As the number/rate of scheduling decisions increases, these cache lines will become a source of contention at the hardware memory management level. For these reasons, we know that the scalability of this implementation will eventually break down.