The Democratic Kernel Scheduler#
One of the hardest classes in BYU’s computer science program is CS 345: Operating Systems Design. When I took the class last Fall, I learned how difficult kernel development is and how complex the puzzles of operating system internals are. One of the most famous of these puzzles is that of the scheduler. Before taking the class, I had a few years of experience with High Performance Computing schedulers such as Slurm and PBS Pro. This type of scheduling software is different from that of an operating system, but helped me realize that this was a highly important and difficult puzzle to solve.
CS 345 involves a set of projects of varying difficulty. A scaffolding OS is provided, and a feature of the OS is implemented in each project. Two of the seven projects (#2 and #5) are to implement different types of schedulers: first a prioritized round-robin queue, then later a fair-share system. I enjoyed these two projects more than any others in the class, which left me with a quest to create a unique scheduling algorithm. Hence, the Democratic Kernel Scheduler (DKS)
What is a Scheduler?#
Wikipedia broadly defines a scheduler in the context of computing as follows: ‘In computing, scheduling is the method by which work is assigned to resources that complete the work. The work may be virtual computation elements such as threads, processes or data flows, which are in turn scheduled onto hardware resources such as processors, network links or expansion cards”. More specific to the kernel and tasking, a process scheduler is defined as follows: “The scheduler is an operating system module that selects the next jobs to be admitted into the system and the next process to run.”
Essentially, the scheduler holds a list of tasks (processes) that are ready to be executed. This is called the ready queue. Whenever a task is swapped out (either cooperatively or preemptively), the scheduling algorithm yields the next task to run and places the previously executing process back in the ready queue if it is not finished. Algorithms are frequently designed to allow for priorities so that more important processes are always run before less important ones. Sometimes they are built around a sharing system, in which a group of processes is granted a certain percentage of the CPU’s available throughput.
The Democratic Scheduler#
My idea for a Democratic Scheduler consists of a sort model of democracy. Instead of some algorithm directly deciding which process to run next based on a priority system or shares, the running processes get to vote for whichever process they would like to grant CPU time to next. This system could be implemented in a variety of ways, especially with tweaks to certain aspects of the algorithm. For example, a tie-breaking method would be needed, which could be implemented in a number of ways. Additionally, it could be difficult to implement and leveraging it would most likely require a shift in programming techniques.
In any democratic system, a method of casting one’s vote is required. In Ancient Rome, a person would mark a small wooden tablet and place it inside a box in secret. In some settings, people vote by raising their hand. In the United States, a name is either selected or written in and then submitted to the government. In every democratic setting, there is a central authority that counts votes, a set of voters that meet eligibility requirements, and a mechanism for casting votes. In the proposed democratic scheduler, each of these requirements is both included and necessary.
In democratic government, voting occurs when there is a vacant position, whether that is because of a term limit, a death, or a resignation. Since this doesn’t happen all that often, it is possible to host large-scale elections in which many days are taken to collect and count votes. A person is asked for a vote because of their eligibility, and they provide a response to the request, which is the ballot. This system of ballot requests is difficult to model for computer process because of their single-tracked nature. It may be possible using signals, but would be hard to implement.
Rather, I propose a simpler method of casting votes. Operating system kernels contain a data structure filled with information about every alive process. This structure is called a process table and contains an entry for each process called a Process Control Block (PCB). I propose adding a single value to the PCB, which is a pid representing the process’s vote. The process is then granted access to a system call that will allow it to modify only its own PCB vote value.
The next part of the voting process that we must consider is that of eligibility. In most nations, it is common for each individual to be granted a single vote, given that they meet some minimum requirements. For example, in the United States, a person must be 18 years old and be a citizen of the country to vote. Likewise, in the Democratic Scheduler, a rule must be implemented to decide who gets to vote and when. Since there are multiple ways this could be implemented, I consider a few options.
One way voting eligibility could be designed is by user. For example, user1 and user2 are each granted one vote, so each of their processes casts a vote in their PCB and the scheduler picks one of the process votes per user as one that will be used. This method would ensure balance between users, so that a single user can’t just spin off a large number of processes and therefore take almost complete control of the CPU. However, it would put more stress on the kernel by requiring it to make a choice between processes belonging to each user every time it reschedules.
The other obvious way to implement eligibility is on a per-process basis. This simple method would involve the kernel treating each process as a valid and independent voting party for every scheduling cycle. As mentioned in the per user section, this may result in users spinning off large numbers of processes to get as many votes for themselves as possible. To battle this possibility, it may be appropriate to enforce a process quantity limit per user or per group.
Another important piece of an election is the counting mechanism. This system must be secure, accurate, and fast. For this reason, ballot counting machines are common in government elections. Similarly, the Democratic Scheduler must be secure and fast, and as accurate as possible.
To count the votes, the kernel must loop through all the PCBs and take note of each process’s vote. Then, it must find the pid with the most votes and run it next, while staging the rest of the votes for their respective turns depending on their vote frequency. Fortunately, this operation is pretty quick because the list of pids is usually fairly short (the max on Linux is 32768). However, the sorting phase may take longer. For this reason, it makes sense to only run the scheduling algorithm when necessary, constructing a plan for processes to run. Then, when a process state changes or a new process is created, the scheduler can rebuild its plan.
One very important item to consider is that of tie breakers. It is highly possible for many processes to receive the same number of votes. In a system where the default vote for a newly created process is its own pid, it makes sense that it may be common for every process to be equally ranked on the voting system. Because of this, a mechanism for breaking ties is required. I propose that instead of implementing a specialized system for the Democratic Kernel Scheduler, an already-existing scheduler be used with this voting-based system simply as a wrapper. For instance, separate priority queues could be used for each vote ranking.
Adjustment to programming style#
To work with the proposed PCB variable voting system, it will be necessary for programmers to adapt to take advantage of this voting-based system. I proposed a system call that alters the variable, which could be defined as
vote(pid). Programmers would need to write code to analyze the running processes on a system and select one to vote for and pass to the call. Because of the difficulty of this task, it may be uncommon for developers to actually use this feature. The default vote for a process will be its own pid, so if no specially-developed code with voting calls is being run, the operating system should function as normal. But on more advanced and complex systems (such as HPC environments), this could present a great advantage to those who take the time to learn how it works.
Although the Democratic Kernel Scheduler seems like a sort of weird idea, it could prove to be useful in some situations. I believe it would be smart to make it an optional feature in kernels so that administrators can turn it off and on when necessary.
The following is an example of a situation where the Democratic Scheduler would prove useful for someone writing a parallel program.
Consider a program that consists of two processes (A and B). Let’s say process A has finished calculating some value that it needs to combine with a value that process B is working on computing. Normally, some sort of semaphore or other locking mechanism would be used to remove process A from the ready queue until process B has completed, and then move it back to the ready queue. This method gives up all of process A’s granted CPU time, releasing it to be broken up among all the processes in the ready queue. Instead, a programmer who is aware of the Democratic Scheduler would simply use the
vote(pid) system call with process B’s pid to grant all of process A’s execution time to process B, effectively doubling the speed of process B. Then, when process B finishes, process A would change its vote back to itself again.
The American System#
Modeling a kernel scheduler after a democracy made me think of similarities and differences between my system and the US government. The following are a couple ideas that may be applied to the scheduler.
The United States electoral system is loved and hated by many. The idea is that everyone casts a vote, then the votes are grouped together geographically, and regions are granted a number of votes that they split proportionally based on the votes within the area. Areas with more people are granted more votes and smaller areas get fewer. This idea could be used in the Democratic Scheduler. As mentioned above, it could be possible to grant users or groups a share of the votes on a computer. This concept is similar to that of Slurm accounts.
A long-running process that happens to have the most votes could completely take over the CPU. For this reason, this design should not be used in RTOS situations, especially where it is essential (with real consequences) that certain processes be granted priority solely due to their important nature.
However, it is important that processes with lower numbers of votes are given at least some chance at runtime. To do this, I propose the idea of incumbency. The kernel should keep track of the number of consecutive times a process is given CPU time. After it has reached a certain amount of time, the kernel will declare the process an incumbent and pick another process to run. The incumbent process is then forced to wait for a number of scheduling cycles before it can run again.
I am pretty new to kernel development, so the development of this concept is rather slow and I may never complete it. However, I have started figuring out how to integrate this into the Linux kernel. More to come here.
Although the concept of a Democratic Scheduler may sound silly, I believe that this idea has actual potential to help developers of software for shared systems who would like to maximize their CPU runtime.