Aus Wiki-WebPerfect

Receive side scaling (RSS) is a network driver technology that enables the efficient distribution of network receive processing across multiple CPUs in multiprocessor systems.

Note Because hyper-threaded CPUs on the same core processor share the same execution engine, the effect is not the same as having multiple core processors. For this reason, RSS does not use hyper-threaded processors.

A physical host can have Hyper-V enabled and still use RSS. VMQ is only enabled on a NIC when there is a vSwitch attached to it. If you have two NICs in a server it is perfectly acceptable to have the NICs unteamed, one attached to a vSwitch and the other used for host networking traffic. When using this configuration, one NIC will use VMQ and the other will use RSS.

VMQ leverages the same hardware queues on the NIC as RSS, the ones that interrupt different cores. However with VMQ the filters and logic associated with queue distribution of the packets is different. Rather than one physical device utilizing all the queues for its networking traffic, these queues are balanced among the host and VMs on the system.





Important / to consider

Do not use Core 0! (Because of the Parent Partition (Management OS),...)

Do not use Hyperthreaded Cores!

NUMA-Distance should be 0!

To fully utilize the CPUs, the number of RSS Receive Queues should be equal to or greater than Max Processors!





RSS

Receive Side Scaling (RSS) is a very important aspect in networking on Windows. RSS makes sure that incoming network traffic is spread among the available processors in the server for processing. If we do not use RSS, network processing is bound to one processor which will limit approximately at 4GBps. Nowadays every NIC has RSS enabled by default, but the configuration is not optimized. Every NIC is configured with “Base Processor” 0, meaning it will start processing on processor 0 together with the others NICs but more importantly it’s the default processor for Windows processes as well.

To optimally configure RSS we want to start at processor 1 so we don’t interfere with processes landing default on processor 0. On the other hand, we also want to keep the NUMA node assignment that RSS traffic is not using a processor in NUMA node 0 while the NIC is bound to NUMA node 1.



RSS Profiles

Closest: Logical processor numbers near the network adapter’s base RSS processor are preferred. Windows may rebalance logical processors dynamically based on load.

ClosestStatic: No dynamic load balancing, such as distributing but not load balancing at runtime.

NUMA: Assigns RSS processors in a round robin basis across every NUMA node to enable applications that are running on NUMA servers to scale well.

NUMAStatic: Default behavior. RSS processor selection is the same as for NUMA scalability without dynamic load balancing.

Conservative: RSS uses as few processors as possible to sustain the load. This option helps reduce the number of interrupts.







VMQ

With the entrance of virtualization on servers and RSS we hit some limitations because RSS wasn’t built for this “new world”. Long story short, when having a virtual switch on the system where physical adapters are bound to, RSS is disabled. This caused scalability issues that needed to be solved… hello VMQ.

The real benefit of VMQ is realized when it comes time for the vSwitch to do the routing of the packets. When incoming packets are indicated from a queue on the NIC that has a VMQ associated with it, the vSwitch is able the direct hardware link to forward the packet to the right VM very very quickly – by passing the switches routing code. This reduces the CPU cost of the routing functionality and causes a measurable decrease in latency.

Unlike RSS where incoming traffic is spread among processors, VMQ will create a queue for each mac-address on the system and link that queue to a processor core. This means that a mac-address, which translates to a NIC of a VM or a host vNIC does not have more process power than 1 processor core. This limits the network traffic speed in the same way it does when we do not enable RSS to about 4GBps.







VMQ-Settings setzen

# VMQ aktivieren Set - NetAdapterVMQ -Name < NIC - Name > - Enabled # RSS aktivieren Set - NetAdapterRss -Name < NIC - Name > - Enabled # VMQ-Settings setzen Set - NetAdapterVmq -Name < NIC - Name > - BaseProcessorNumber < erster Core , nicht Core 0 > - MaxProcessors < anzahl Cores > - MaxProcessorNumber < letzer Core > - NumaNode < NUMA - Node >

Die Anzahl Queues können ebenfalls mit dem Parameter -NumberOfReceiveQueues gesetzt werden. Hier wird empfohlen die Einstellungen auf Automatisch zu belassen.

Wenn genug Cores pro CPU vorhanden sind wird für eine 1Gig NIC -> 8 Queues empfohlen und für eine 10Gig NIC und höher -> 16 Queues.





RSS-Settings Beispiel (optimal, bei insgesamt 32 logischen Cores)

Wir haben einen Server mit folgender Hardware:

2 CPUs (sprich 2 Sockets) mit je 8 Cores (+ Hyperthreading aktiviert = 16 logische Cores pro NUMA-Node)

2 NICs (in einem Team/Bond)

Number of Receive Queues = 127



Die Optimale Verteilung der Queues wäre folgende:

Blau markiert die physischen Cores.

Es dürfen keine Hyperthreadete Cores verwendet werden!



Die erste NIC verteilen wir auf den ersten NUMA-Node (NUMA-Node 0), aber wir lassen Core 0 aus, sprich der erste Core im NUMA-Node 0)

Die zweite NIC verteilen wir auf den zweiten NUMA-Node (NUMA-Node 1), aber wir lassen Core 0 aus, sprich der erste Core im NUMA-Node 1)

RSS-Settings setzen

Set - NetAdapterRss -Name NIC01 - Enabled Set - NetAdapterRss -Name NIC02 - Enabled Set - NetAdapterRss -Name NIC01 - BaseProcessorNumber 2 - MaxProcessors 7 - MaxProcessorNumber 14 - NumaNode 0 Set - NetAdapterRss -Name NIC02 - BaseProcessorNumber 18 - MaxProcessors 7 - MaxProcessorNumber 30 - NumaNode 1









RSS-Settings Beispiel (optimal, bei insgesamt 56 logischen Cores)

Wir haben einen Server mit folgender Hardware:

2 CPUs (sprich 2 Sockets) mit je 14 Cores (+ Hyperthreading aktiviert = 28 logische Cores pro NUMA-Node)

2 NICs (in einem Team/Bond)

Number of Receive Queues = 127



Die Optimale Verteilung der Queues wäre folgende:

Blau markiert die physischen Cores.

Es dürfen keine Hyperthreadete Cores verwendet werden!



Die erste NIC verteilen wir auf den ersten NUMA-Node (NUMA-Node 0), aber wir lassen Core 0 aus, sprich der erste Core im NUMA-Node 0)

Die zweite NIC verteilen wir auf den zweiten NUMA-Node (NUMA-Node 1), aber wir lassen Core 0 aus, sprich der erste Core im NUMA-Node 1)

RSS-Settings setzen

Set - NetAdapterRss -Name NIC01 - Enabled Set - NetAdapterRss -Name NIC02 - Enabled Set - NetAdapterRss -Name NIC01 - BaseProcessorNumber 2 - MaxProcessors 13 - MaxProcessorNumber 26 - NumaNode 0 Set - NetAdapterRss -Name NIC02 - BaseProcessorNumber 30 - MaxProcessors 13 - MaxProcessorNumber 54 - NumaNode 1







More Informations:

RSS: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/introduction-to-receive-side-scaling

VMQ Best Practise: https://blogs.technet.microsoft.com/networking/2016/01/04/virtual-machine-queue-vmq-cpu-assignment-tips-and-tricks/

Optimizing Network: http://www.darrylvanderpeijl.com/windows-server-2016-networking-optimizing-network-settings/

Other Offloading Features: https://msdn.microsoft.com/en-us/library/windows/hardware/dn567650(v=vs.85).aspx

SAP: https://blogs.msdn.microsoft.com/saponsqlserver/2012/01/11/network-settings-network-teaming-receive-side-scaling-rss-unbalanced-cpu-load/