Recent MMU, Cached I/O, and Scheduler work now in master

One of our wonderful GSOC projects this summer resulted in Mihai Carabas adding a cpu topology awareness framework to DragonFly and doing work on the scheduler to make it more topology aware. This started several of us on a performance benchmarking binge with several people, particularly Francois Tigeot running postgres/pgbench tests on a 12x2 (24 thread) Xeon box and me running tests on a smaller 4x2 (8 thread) Xeon box and our larger 48-core opteron box. In the last month the master branch has gone through some radical changes. All the work is in but some still experimental and requires a sysctl to turn on. * PMAP MMU optimizations for 64-bit systems. We noticed that when postgres servers are used with very large shared memory areas, either with SYSV SHM or MMAP, each postgres server process (which fork instead of thread) has to fault-in tens of thousand of pages. When you multiple by a potentially large number of postgres server processes this turns into millions of faults. In addition, each process is maintaining its own complete copy of the page table. This optimization works for SYSV SHM as well as any large shared or read-only mmap of anonymous or file-backed data. The optimization causes the actual page table pages themselves to be cached in the backing VM object (thus not subject to destruction when processes using the mappings fork() or exit), and the individual MMU maps for each process actually share the page tables by mapping shared page table pages. This removes nearly ALL page faults from a warmed-up postgres server, even if there are hundreds of postgres server processes forked and even when it does fresh fork()s. In addition, most of the page tables for these processes are now shared (even though they were forked and not threaded), thus making far better use of cpu memory caches. sysctl machdep.pmap_mmu_optimize=1 (still experimental) * Read shortcut through the VM system (integrated w/HAMMER for now). This doubles the performance of read() system calls from the cache which would otherwise cause the buffer cache to cycle (when the VM page cache is big enough to cache the data set but the buffer cache is not). In this situation the cycling of the buffer cache causes a large number of SMP MMU invalidations due to the constant adjusting of VM pages mappings in kernel memory. With this shortcut cached file data read with read() is copied out using the DMAP instead of the buffer cache, not only improving read() performance but also significant improving all activities on multi-core systems due to the reduced kernel page smashing. sysctl vm.read_shortcut_enable=1 (still experimental) * Scheduler rewrite. Mihai Carabas made large strides in scheduler performance on larger servers with his cpu topology awareness framework and his work on our user thread scheduler. However, there were still significant limitations in the scheduler due to its original design. The original scheduler was essentially single-threaded, using a global spinlock to protect a single global scheduling run queue. This lead to a number of SMP related bottlenecks with the scheduler as well as complicated the algorithms. I have now completed a rewrite of the scheduler that incorporates Mihai's cpu topology infrastructure and rewrites the algorithms to utilize the new scheduler framework. The new scheduler utilizes per-cpu queues and fine-grained per-cpu spinlocks. There are no global spinlocks, removing that bottleneck. The new scheduler rewrites the cpu topology algorithms to implement a top-down (whole-machine -> socket -> core -> hyperthread) scheduling implementation, performing three major algorithmic actions: (1) It generates a load factor at all levels and load-balances the assignment of processes to cpus in a topology-aware framework. This means that if you have 4 processes running in a 4x2 (8-thread) environment, they will be scheduled to cores and not to competing hyperthreads. If you have two cpu sockets and two processes, one will be scheduled to each socket to make best use of their caches. (2) It will try to avoid migrating processes when possible, and when not possible it will try to keep them nearby from a topological standpoint. (3) It will detect process block/wakeup events which e.g. tie two processes together, and will try to move the process pairs closer to each other using that information. For example, if you have many postgres clients and servers on a large server, enough to load down all cores, the client and server pairs will be localized to the same socket, thus making use of chip caches to facilitate communications between the two processes. (the scheduler changes are now the default in master) * Finally, default values for many things on 64-bit machines have been adjusted upward significantly to make proper use of available resources. There were numerous caps that had been inherited from the 32-bit code that are now gone or greatly raised, particular with regards to SYSV shared memory and the buffer cache. The result is an IMMENSE improvement in postgres benchmarks as well as across-the-board improvements in performance under load. We pretty much outstrip the other BSDs now and we get fairly close (though do not quite beat) the higher-end linux benchmarks. In addition, the new scheduler algorithms effect many other system activities, such as source code builds (which make heavy use of pipes), web servers, and even interactive vs batch processing. Francois will post updated graphs today or tomorrow showing the immense progress we've made. -Matt Matthew Dillon <dillon at backplane.com>