This is a technical piece safely skipped by most readers.

Apple throws resources at eye candy frippery in the OS, while leaving critical areas in serious “AWOL reliability” territory. More Apple Core Rot.

Your author spent about 14 hours tracking down an OS X performance bug while testing very high server loads (48 client threads from two machines with 12 cores total on local LAN gigabit against a highly optimized Tomcat web server). The test scenario involved 5000 to 15,000 client hits against the server per second, reaching up to 87MB/sec in delivering ~2K to ~40K HTML files to the client machines.

In a nutshell, the OS X networking stack enters a pathological performance situation which essentially shuts down all networking capability for ~30 seconds at a time (“AWOL ~30 seconds”). That is, with the default networking buffer sizes (ncl=131072 seems to be the default buffer size = 256MB memory). The performance bug was reproduced using the server on an 8-core 3.3 GHz Mac Pro, 2-core MacBook Pro, 4-core MacBook Pro and 4-core MacBook Pro Retina (16GB for the laptops, 64GB for the MacPro, total memory not relevant, ample to spare). Observed on OS X 10.10.1 and 10.8.5, so it is not a new bug.

When the system locks up its networking stack, netstat shows something like this (100% in use was also seen).

diglloydMP:MPG lloyd$ netstat -m

24615/24615 mbufs in use:

24565 mbufs allocated to data

50 mbufs allocated to socket names and addresses

712/712 mbuf 2KB clusters in use

19884/19884 mbuf 4KB clusters in use

2730/2730 mbuf 16KB clusters in use

131754 KB allocated to network (99.8% in use)

0 KB returned to the system

0 requests for memory denied

1038 requests for memory delayed

226 calls to drain routines

Ruling out many things and tearing out much hair, it became clear that the problem was in the OS itself. Much experimentation found that increasing the networking buffer memory to 512MB (ncl=262144) resolved the issue, at least with 48 client threads over local gigabit LAN hitting the server from a total of 12 cores on 2 clients.

Doubling the memory for the networking buffers almost entirely (but not quite) solves the problem:

sudo nvram boot-args="ncl=262144" (reboot required)

Note that ncl is a maximum and that the system dynamically allocates memory as needed up to that maximum, so that netstat -mm will show much smaller memory usage until a load is applied. Attempting to use 384K buffers hosed the networking stack. ncl=262144 might be the hard limit.

With the larger buffers in place, the system was able to handle the test load, but attempting to use more buffer space makes the networking stack fail entirely (dead). In short, OS X can barely handle gigabit ethernet speeds with a high volume of relatively small requests (4K to 40K typical). A toy OS for serious use. This explains some head scratchers MPG has seen in the past: a fundamentally broken OS X networking stack that goes AWOL for ~30 seconds at a time if the load is too high.

With ncl=262144 (256K buffers X 2K per buffer = 512MB memory) and 48 client threads over local gigabit LAN 99.6% utilization was seen, with no AWOL networking stack. The figures shown below are not the highest utilization observed, but are close.

netstat -mm class buf active ctotal total cache cached uncached memory name size bufs bufs bufs state bufs bufs usage ———- —– ——– ——– ——– —– ——– ——– ——— mbuf 256 83190 14688 86000 on 345 2465 3.6 MB cl 2048 19213 609 19822 purge 0 609 1.2 MB bigcl 4096 52099 0 52099 purge 0 0 0 16kcl 16384 10922 0 10922 on 0 0 0 mbuf_cl 2304 19213 19213 19213 purge 0 0 42.2 MB mbuf_bigcl 4352 52099 52099 52099 purge 0 0 216.2 MB mbuf_16kcl 16640 10922 10922 10922 on 0 0 173.3 MB 17654/83190 mbufs in use: 17307 mbufs allocated to data 347 mbufs allocated to packet headers 65536 mbufs allocated to caches 19213/19822 mbuf 2KB clusters in use 52099/52099 mbuf 4KB clusters in use 10922/10922 mbuf 16KB clusters in use 447022 KB allocated to network (99.6% in use) 0 KB returned to the system 0 requests for memory denied 0 requests for memory delayed 4 calls to drain routines

Z C writes:

About your article, and what things are shaping into, it seems lessons have not been learned. I worked for over 25 years in Mainframe datacenters. When IBM introduced Z/OS, replacing MVS and consolidating the move to 64b architecture, they too came out with very frequent upgrades to their OS. Many BIG PROBLEMS emerged in enterprises and companies that invest millions of $$$ in IT, and we, the tech. systems guys, were struggling with stupid bugs and serious performance/workload issues. The icing on the cake came when we upgraded our CPU and it came with a microcode so advanced... that did not support our current OS version (about 2 releases, 1,5 years behind the then latest version)… What was to be a simple 16H weekend intervention turned into a nonstop 72 hour party, with weeks of aftermath…

This was the beginning of the end for me in the IT business... Seems that the need to push people to buy new HW moves this? I understand that maintaining legacy products is expensive, and that change is good. But abandoning the use of CD/DVD drives is one thing, another is forcing changes in the OS so as to sell new HW. I think everybody who is or has been in the IT world is preoccupied now at what Apple will do next with OSX.

MPG: history repeats itself in core issues. OS X Yosemite is not exactly “Vista”, but maybe we’re headed that way.