GSoC: Add SMT/HT awareness to DragonFlyBSD scheduler

Hi, As I promised in one of my previous e-mails, I will post on the list some of my discussions with Matthew regarding the scheduling subsystems, may be it will be useful to someone else: :Let say we have a user CPU bound process (batchy one). The :bsd4_schedulerclock will notice this and will mark a need for user :rescheduling (need_user_resched();). This flag is only checked in :the bsd4_acquire_curproc (which is called when a process returns from :kernel-space....the code from here is clear to me) and from lwkt_switch(). :My question is, where in the code is called the lwkt_switch, to switch to :another thread if you have that CPU bound process running? Suppose that CPU :bound process is never blocking and never enters the kernel....which :statement from the code is pushing it out of the CPU by calling :lwkt_switch() ? There are two basic mechanisms at work here, and both are rather sensitive and easy to break (I've broken and then fixed the mechanism multiple times over the years). The first is that when a LWKT thread is scheduled to a cpu that sets a flag for that cpu indicating that a LWKT reschedule may be needed. The scheduling of a LWKT thread on a cpu always occurs on that cpu (so if scheduled from a different cpu an IPI interrupt/message is sent to the target cpu and the actual scheduling is done on the target cpu). This IPI represents an interrupt to whatever is running on that cpu, thus interrupting any user code (for example), which triggers a sequence of events which allow LWKT to schedule the higher priority thread. The second mechanism is when a userland thread (LWP) is scheduled. It works very similarly to the first mechanism but is a bit more complex. The LWP is not directly scheduled on the target cpu. Instead the LWP is placed in the global userland scheduler queue(s) and a target cpu is selected and an IPI is sent to that target cpu. (see the 'need_user_resched_remote' function in usched_bsd4.c, which is executed by the IPI message). The IPI is only sent if the originating cpu determines that the LWP has a higher priority than the LWP currently running on the target cpu. There is a third mechanism for userland threads related to the helper thread (see 'sched_thread' in usched_bsd4.c). The helper thread is intended to only handle scheduling a userland thread on its cpu when nothing is running on that cpu at all. There is a fourth mechanism for userland theads (well, also for LWKT threads but mainly for userland threads), and that is the dynamic priority and scheduler timer interrupt mechanic. This timer interrupt occurs 100 times a second and adjusts the dynamic priority of the currently running userland thead and also checks for round-robining same-priority userland threads. When it detects that a reschedule is required it flags a user reschedule via need_user_resched(). Under this forth mechanism cpu-bound user processes will tend to round-robin on 1/25 second intervals, approximately. There is a fifth mechanism that may not be apparent, and that is the handling of an interactive user process. Such processes are nearly always sleeping but have a high priority, but because they are sleeping in kernel-land (not userland), they will get instantly scheduled via the LWKT scheduler when they are woken up (e.g. by a keystroke), causing a LWKT reschedule that switches to them over whatever user thread is currently running. Thus the interactive userland thread will immediately continue running in its kernel context and then when attempting to return to userland it will determine if its dynamic user priority is higher than the current designated user thread's dynamic priority. If it isn't it goes back onto the usched_bsd4's global queue (effectively doesn't return to userland immediately), if it is then it 'takes over' as the designated 'user' thread for that spot and returns to userland immediately. The key thing to note w/ the fifth mechanism is that it will instantly interrupt and switch away from the current running thread on the given cpu if that thread is running in userland, but then leaves it up to the target thread (now LWKT scheduled and running) to determine, while still in the kernel, whether matters should remain that way or not. :Another question: it is stated that on the lwkt scheduler queue is only a :user process at a time. The correct statement is: only a user process that :is running in user-space, also may be other user-processes that are running :in kernel-space. Is that right? Yes, this is correct. A user thread running in kernel space is removed from the user scheduler if it blocks while in kernel space and becomes a pure LWKT thread. Of course, all user threads are also LWKT threads, so what I really mean t say here is that a user thread running in kernel space is no longer subject to serialization of user threads on that particular cpu. When the user thread tries to return to userland then it becomes subject to serialization again. It is, in fact, possible to run multiple userland threads in userland via the LWKT scheduler instead of just one, but we purposefully avoid doing it because the LWKT scheduler is not really a dynamic scheduler. People would notice severe lag and other issues when cpu-bound and IO-bound processes are mixed if we were to do that. -- All threads not subject to the userland scheduler (except for the userland scheduler helper thread) run at a higher LWKT priority than the (one) thread that might be running in userland. There are two separate current-cpu notifications (also run indirectly for remote cpus via an IPI to the remote cpu). One called need_lwkt_resched() and applies when a LWKT reschedule might be needed, and the other is need_user_resched() and applies when a user LWP reschedule might be needed. Lastly, another reminder: Runnable but not-currently-running userland threads are placed in the usched_bsd4's global queue and are not LWKT scheduled until the usched_bsd4 userland scheduler tells them to run. If you have ten thousand cpu-bound userland threads and four cpus, only four of those threads will be LWKT scheduled at a time (one on each cpu), and the remaining 9996 threads will be left on the usched_bsd4's global queue. :When it detects that a reschedule is :required it flags a user reschedule. But that CPU bound process will :continue :running. Who actually interrupts it? The IPI flags the user reschedule and then returns. This returns through several subroutine levels until it gets to the actual interrupt dispatch code. The interrupt dispatch code then returns from the interrupt by calling 'doreti'. See /usr/src/sys/platform/pc64/x86_64/exception.S The doreti code is in: /usr/src/sys/platform/pc64/x86_64/ipl.s The doreti code is what handles poping the final stuff off the supervisor stack and returning to userland. However, this code checks gd_reqflags before returning to userland. If it detects a flag has been set it does not return to userland but instead pushes a trap context and calls the trap() function with T_ASTFLT (see line 298 of ipl.s). The trap() function is in platform/pc64/x86_64/trap.c This function's user entry and exit code handles the scheduling issues related to the trap. LWKT reschedule requests are handled simply by calling lwkt_switch(). USER reschedule requests are handled by releasing the user scheduler's current process and then re-acquiring it (which can wind up placing the current process on the user global scheduler queue and blocking if the current process is no longer the highest priority runnable process). This can be a confusing long chain of events but that's basically how it works. We only want to handle user scheduling related events on the boundary between userland and kernelland, and not in the middle of some random kernelland function executing on behalf of userland. :I wasn't able to figure out what happens exactly in this case: :- when a thread has its time quantum expired, it will be asked for a :reschedule (flags for a reschedule only, it doesn't receive any IPI). That :thread will remain on the lwkt queue of the CPU was running on? Or will end :up on the userland scheduler queue again to be subject of a new schedule?. The time quantum is driven by the timer interrupt, which ultimately calls bsd4_schedulerclock() on every cpu. At the time this function is called the system is, of course, running in the kernel. That is, the timer interrupt interrupted the user program. So if this function flags a reschedule the flag will be picked up when the timer interrupt tries to return to userland via doreti ... it will detect the flag and instead of returning to userland it will generate an AST trap to the trap() function. The standard userenter/userexit code run by the trap() function handles the rest. :This question doesn't include the case when that thread is blocking :(waiting I/O or something else) - in this case the thread will block in the :kernel and when will want to return to userland will need to reacquire that :cpu was running on userland). : :I put this question because in the ULE FreeBSD, they are always checking :for a balanced topology of the process execution. If they detect an :imbalance they migrate threads accordingly. In our case, this would be :induced by the fact that processes will end up, at some moment in time, on :the userland queue and would be subject of rescheduling (here the :heuristics will take care not to imbalance the topology). In our case if the currently running user process loses its current cpu due to the time quantum running out, coupled with the fact that other user processes want to run at a better or same priority, then the currently running user process will wind up back on the bsd4 global queue. If other cpus are idle or running lower priority processes then the process losing the current cpu will wind up being immediately rescheduled on one of the other cpus. :Another issue is regarding the lwkt scheduler. It is somehow a static :scheduler, a thread from one cpu can't migrate on the another. We intend to :leave it that way, and all our SMT heuristics to be implemented in the :userland scheduler or do you have some ideas that we would gain some :benefits in the SMT cases? A LWKT thread cannot migrate preemptively (this would be similar to 'pinning' on FreeBSD, except in DragonFly kernel threads are always pinned). A thread can be explicitly migrated. Threads subject to the userland scheduler can migrate between cpus by virtue of being placed back on the bsd4 global queue (and descheduled from LWKT). Such threads can be pulled off the bsd4 global queue by the bsd4 userland scheduler from any cpu. In DragonFly kernel threads are typically dedicated to particular cpus but are also typically replicated across multiple cpus. Work is pre-partitioned and sent to particular threads which removes most of the locking requirements. In FreeBSD kernel threads run on whatever cpus are available, but try to localize heuristically to maintain cache locality. Kernel threads typically pick work off of global queues and tend to need heavier mutex use. So, e.g. in DragonFly a 'ps ax' you will see several threads which are per-cpu, like the crypto threads, the syncer threads, the softclock threads, the network protocol threads (netisr threads), and so forth. -- Now there are issues with both ways of doing things. In DragonFly we have problems when a kernel thread needs a lot of cpu... for example, if a crypto thread needs a ton of cpu it can starve out other kernel threads running on that particular cpu. Threads with a user context will eventually migrate off the cpu with the cpu-hungry kernel thread but the mechanism doesn't work as well as it could. So, in DragonFly, we might have to revamp the LWKT scheduler somewhat to handle these cases and allow kernel threads to un-pin when doing certain cpu-intensive operations. Most kernel threads don't have this issue. Only certain threads (like the crypto threads) are heavy cpu users. I am *very* leery of using any sort of global scheduler queue for LWKT. Very, very leery. It just doesn't scale well to many-cpu systems. For example, on monster.dragonflybsd.org one GigE interface can vector 8 interrupts to 8 different cpus with a rate limiter on each one... but at, say, 10,000hz x 8 that's already 80,000 interrupts/sec globally. It's a drop in the bucket when the schedulers are per-cpu but starts to hit bottlenecks when the schedulers have a global queue. In FreeBSD they have numerous issues with preemptive cpu switching of threads running in the kernel due to the mutex model, even with the fancy priority inheritance features they've added. They also have to explicitly pin a thread in order to access per-cpu globaldata, or depend on an atomic access. And FreeBSD depends on mutexes in critical paths while DragonFly only need critical sections in similar paths due to its better pre-partitioning of work. DragonFly has better cpu localization but FreeBSD has better load management. However, both OSs tend to run into problems with interactivity from edge cases under high cpu loads. Enjoy reading:), Mihai Carabas