[Mesa-dev] [PATCH 00/13] Threaded Gallium for RadeonSI

Hi, This series adds an optional module into gallium/util that wraps around pipe_context and moves execution of all pipe_context calls into a separate thread. It puts a lot of new requirements on the driver, especially on thread- safetiness of pipe_context functions, and even expects different behavior from pipe_context in some cases, so it may be non-trivial to enable. All of it is necessary to have a perfectly scalable threaded execution. (Any new drivers should be built around it from the beginning) The performance improvement isn't very high (it's just hiding overhead of pipe_context only), but I can tell you and I have tested a lot of apps with this, it really doesn't sync the thread with majority of apps except for SwapBuffers. It can do these: - unsychronized buffer mappings don't sync - ordinary buffer mappings are promoted to unsynchronized when it's safe - full buffer invalidations are implemented as reallocations and don't sync - partial buffer invalidations are implemented as copy_buffer and don't sync - get_query_result doesn't sync when the threaded context has seen flush() (i.e. get_query_result is contextless in that case) Missing: - deferred fences - mainly Bioshock Infinite might benefit - texture mappings (meaning CPU access) always sync, texture_subdata doesn't sync for small uploads only, but we can make all texture uploads asynchronous by simply copying what is done for buffers Note that it has a very low overhead when it's always synchronous (i.e. not multithreaded), because it's really fast to enqueue and execute calls. The worst case scenario might be -3% performance (just guessing here). All requirements on Gallium drivers and other information can be found in the header file: https://cgit.freedesktop.org/~mareko/mesa/tree/src/gallium/auxiliary/util/u_threaded_context.h?h=gallium-threaded2#n26 RadeonSI enables threaded Gallium by default for OpenGL Core and Compatibility profiles and all OpenGL ES variants. There is a small performance concern for RadeonSI: If non-contiguous VRAM mappings are not supported (amdgpu - kernel 4.11 and older, radeon - all kernels), the performance difference might be negative, because buffer invalidations are done unconditionally, meaning that there can be more live and mapped VRAM buffers. It's difficult to tell whether any real apps are affected in a measurable way. Here are performance numbers: APPS: MORE IS BETTER Alien Isolation: +16% Bioshock Infinite: +13% Borderlands 2: +12% Civilization 5: +12% Civilization 6: +10% CS:GO: +8% ET Legacy: +12% Openarena: +27% Talos Principle (high details, 1680x1050 internal resolution): +17% glmark2: no change in the final score When games are GPU-bound: no change Because of not taking advantage of deferred fences, Bioshock runs 80% of time asynchronously and 20% of time synchronously. All other games run 100% of time asynchronously. x11perf: MORE IS BETTER x11perf: Test: 500px PutImage Square: -3% x11perf: Test: Scrolling 500 x 500 px: +16% x11perf: Test: Char in 80-char aa line: +13% x11perf: Test: PutImage XY 500x500 Square: +1% x11perf: Test: Fill 300 x 300px AA Trapezoid: NO CHANGE x11perf: Test: 500px Copy From Window To Window: +14% x11perf: Test: Copy 500x500 From Pixmap To Pixmap: -1% x11perf: Test: 500px Compositing From Pixmap To Window: +21% x11perf: Test: 500px Compositing From Window To Window: +18% gtkperf: LESS IS BETTER gtkperf: GTK Widget: Total Time: -2% gtkperf: GTK Widget: GtkComboBox: +7% gtkperf: GTK Widget: GtkCheckButton: -15% gtkperf: GTK Widget: GtkRadioButton: -13% gtkperf: GTK Widget: GtkToggleButton: -2% gtkperf: GTK Widget: GtkComboBoxEntry: -1% gtkperf: GTK Widget: GtkTextView - Scroll: NO CHANGE gtkperf: GTK Widget: GtkTextView - Add Text: NO CHANGE gtkperf: GTK Widget: GtkDrawingArea - Circles: -9% gtkperf: GTK Widget: GtkDrawingArea - Pixbufs: -3% Hence the decision to enable it by default. Please review. Marek