This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

malloc per-thread cache: benchmarks

From: DJ Delorie <dj at redhat dot com>

To: libc-alpha at sourceware dot org

Date: Tue, 24 Jan 2017 16:10:53 -0500

Subject: malloc per-thread cache: benchmarks

Authentication-results: sourceware.org; auth=none

The purpose of this email is to summarize my findings while benchmarking the new per-thread cache (TCache) optimization to glibc's malloc code. Please respond to this email with comments on performance; a future email will contain the patch and start patch-related conversations. Executive summary: A per-thread cache reduces malloc call time to about 83% of pristine, with a cost of about 1% more RSS[*]. TCache doesn't replace Fastbins; combined performance is better than either separately. Performance testing was done on an x86_64 system, as most of my workloads won't fit on a 32-bit system :-) [*] but 0.5% *less* RSS if you ignore synthetic benchmarks :-) ---------------------------------------- The rest of the mail is the raw data. It's a bit "busy" number-wise and best viewed with a fixed font and a wide screen because of all the tables. The benchmarks were done by running workloads captured from various real-world apps and some synthetic benchmarks, using the tools and simulator in the dj/malloc branch. These first three charts show the breakdown of how many cycles each API call used, for both a pristine glibc build and a build with TCache enabled. "Total" is the total number of cycles used by the entire test, the other colums are mean cycles per API call. The RSS column indicates memory used. Note that the increase in mean calloc/realloc times is due to some overhead (which needs to be done anyway) being moved from malloc to those; since malloc is called far more often this is a net win. -------------------- Pristine -------------------- Workload Total malloc calloc realloc free RSS 389-ds-2 182,211,121,679 165 707 239 123 1,012,798 dj 1,237,241,201 443 1,796 0 173 28,896 dj2 10,886,657,226 183 482 223 110 40,480 mt_test_one_alloc 8,737,515,199 15,593 55,927 0 943 1,820,204 oocalc 2,051,407,045 179 392 718 118 153,160 qemu-virtio 1,172,123,759 343 573 558 226 858,240 qemu-win7 892,111,785 401 676 658 211 690,810 customer-1 105,139,327,336 198 2,036 401 131 2,800,478 customer-2 3,328,843,534 186 2,298 559 147 99,600 -------------------- TCache -------------------- Workload Total malloc calloc realloc free RSS 389-ds-2 165,443,672,318 90 712 204 108 1,013,340 dj 858,955,060 310 1,818 0 118 27,509 dj2 9,143,739,161 138 508 230 94 40,338 mt_test_one_alloc 7,469,894,433 13,292 54,600 0 841 2,096,968 oocalc 1,428,492,279 120 411 778 85 153,038 qemu-virtio 1,053,619,586 296 608 518 208 859,659 qemu-win7 809,406,757 331 701 630 199 694,089 customer-1 88,805,361,692 153 2,187 407 124 2,807,520 customer-2 2,641,419,852 132 2,687 688 131 97,196 -------------------- Change -------------------- Workload Total malloc calloc realloc free RSS 389-ds-2 91% 55% 101% 85% 88% 100% dj 69% 70% 101% 68% 95% 95% dj2 84% 75% 105% 103% 85% 100% mt_test_one_alloc 85% 85% 98% 89% 115% 115% oocalc 70% 67% 105% 108% 72% 100% qemu-virtio 90% 86% 106% 93% 92% 100% qemu-win7 91% 83% 104% 96% 94% 100% customer-1 84% 77% 107% 101% 95% 100% customer-2 79% 71% 117% 123% 89% 98% Mean: 83% 74% 105% 101% 86% 101% This chart shows what effects fastbins and tcache have independently, and with both combined. Things to note: the contribution from fastbins (the FB+ and TC:FB+ colums) show that fastbins continue to contribute to performance even with tcache enabled, but its contributions are less with tcache. Likewise, the TC+ and FB:TC+ columns show that adding tcache is a win, with or without fastbins. Neither = both fasbins and tcache are disabled. FB+ = relative time when fastbins are added to "neither" TC+ = when tcache is added to "neither" FB:TC+ = time change from "fastbin only" to "fastbins+tcache" (i.e. when TC is added to FB) TC:FB+ = likewise, when fastbins are added to tcache. TC/FB = ratio of "how fast with just tcache" to "how fast with just fastbins". Test Neither Fastbins TCache Both FB+ TC+ FB:TC+ TC:FB+ TC/FB 389-ds-2 191,664,220,060 182,211,121,679 172,441,402,072 165,443,672,318 95% 90% 91% 96% 95% dj 1,716,034,269 1,237,241,201 833,276,021 858,955,060 72% 49% 69% 103% 67% dj2 11,852,642,821 10,886,657,226 9,470,287,095 9,143,739,161 92% 80% 84% 97% 87% mt_test_one_alloc 8,776,157,170 8,737,515,199 7,359,052,251 7,469,894,433 100% 84% 85% 102% 84% oocalc 2,343,558,811 2,051,407,045 1,455,081,145 1,428,492,279 88% 62% 70% 98% 71% qemu-virtio 1,354,220,960 1,172,123,759 1,129,676,630 1,053,619,586 87% 83% 90% 93% 96% qemu-win7 950,748,214 892,111,785 811,040,794 809,406,757 94% 85% 91% 100% 91% customer-1 120,712,604,936 105,139,327,336 112,007,827,423 88,805,361,692 87% 93% 84% 79% 107% customer-2 3,725,017,314 3,328,843,534 2,994,818,396 2,641,419,852 89% 80% 79% 88% 90% Mean: 89% 78% 83% 95% 88% This last chart shows the relative RSS overhead for the various algorithms. The OH: columns are the overhead (actual RSS minus ideal RSS) for each configuration, with the + columns showing overhead for each optimization relative to the "neither" value. The last column is the overhead for the "neither" case relative to ideal. Note that the synthetic benchmarks (dj, dj2, and mt_test_one_alloc) have unusual results because they're designed to create worst-case scenarios. Workload Ideal Neither Fastbins TCache Both OH:N OH:FB OH:TC OH:B FB+ TC+ B+ OH% 389-ds-2 735,268 1,013,122 1,012,798 1,013,277 1,013,340 277,854 277,530 278,009 278,072 99.9% 100.1% 100.1% 137.8% dj 53 33,577 28,896 30,238 27,509 33,524 28,843 30,185 27,456 86.0% 90.0% 81.9% 63352.8% dj2 17,023 40,853 40,480 40,582 40,338 23,830 23,457 23,559 23,315 98.4% 98.9% 97.8% 240.0% mt_test_one_alloc 90,533 1,839,58 1,820,204 2,176,172 2,096,968 1,749,055 1,729,671 2,085,639 2,006,435 98.9% 119.2% 114.7% 2032.0% oocalc 90,151 153,543 153,160 152,909 153,038 63,392 63,009 62,758 62,887 99.4% 99.0% 99.2% 170.3% qemu-virtio 697,211 855,511 858,240 860,147 859,659 158,300 161,029 162,936 162,448 101.7% 102.9% 102.6% 122.7% qemu-win7 634,275 689,756 690,810 691,965 694,089 55,481 56,535 57,690 59,814 101.9% 104.0% 107.8% 108.7% customer-1 2,510,785 2,803,894 2,800,478 2,806,886 2,807,520 293,109 289,693 296,101 296,735 98.8% 101.0% 101.2% 111.7% customer-2 75,579 99,889 99,600 98,026 97,196 24,310 24,021 22,447 21,617 98.8% 92.3% 88.9% 132.2% Mean: 98.2% 100.8% 99.4% 7378.7%