This is the part 2 of our introduction to ESXTOP were we’ll go over how to use replay mode with ESXTOP as well as what stats to look for when performing basic troubleshooting in ESXi. Be sure to check out part 1 for more information on how to access the ESXTOP tool and use interactive and batch mode.

Using Replay Mode

Replay mode allows ESXTOP to replay the resource statistics collected when running VM-Support. The information can be manipulated just like when using interactive mode. This is very useful if you’re in a situation where you want a VMware specialized Engineer or colleague to review the performance information on your host. You can simply run VM-Support, create the performance logs, and send them the file to review.

In the example below we will create a VM-Support file to analyze with ESXTOP’s Replay mode. To collect information using VM-Support, browse to the location where you’d like the file to be created. Note that this file can become pretty large in size so make sure you plan appropriately.

Simply use the VM-Support command with the –p switch to specify that we want to collect performance snapshots and include the –a switch followed by “PerformanceSnapshot:vsi” to list only the performance snapshot manifest. Also, we will use the –i switch to specify the seconds between intervals (VMware recommends 10 in most cases) and the –d switch to specify the duration of the performance monitoring. So, if we wanted to collect information for 300 seconds the syntax would look like the following:

vm-support –p –a PerformanceSnapshot:vsi –i 10 –d 300 –w. 1 vm - support – p – a PerformanceSnapshot : vsi – i 10 – d 300 – w .

Once the process completes, a .tgz file is created containing our performance data. In our example we will replay the information, using ESXTOP, from the file we’ve created. We will need to extract contents of the file, to do this we run the following command:

Tar –xzf ./esx-esx01.TGLAB.LCL-2016-03-10—17.16.tgz 1 Tar – xzf . / esx - esx01 . TGLAB . LCL - 2016 - 03 - 10 — 17.16.tgz

Now we can see that the files were extracted:

Now we can open up ESXTOP in replay mode by using the –R (make sure its capitalized) switch and specifying the extracted folder:

esxtop –R esx-esx01.TGLAB.LCL-2016-03-10—17.16 1 esxtop – R esx - esx01 . TGLAB . LCL - 2016 - 03 - 10 — 17.16

ESXTOP opens the information in interactive mode. We can now use the interactive mode commands to navigate our way around and diagnose the performance issue:

You may run into an issue when running replay mode where it displays the error “all vm-support snapshots have been used”. You will need to browse into the extracted folder and run the reconstruct.sh script. ESXTOP replay mode should work after that:

./reconstruct.sh 1 . / reconstruct . sh

ESXTOP Basic Performance Troubleshooting

Now that you know how to use each mode of ESXTOP, how do you go about troubleshooting performance issues at the virtualization layer? Well, you need to know what to look for. If you press the ‘h’ key while in interactive mode, you can see the different display views that you can use to diagnose whatever performance issue you might suspect. Below is a basic list of some of the stats to look for when troubleshooting. I’ve also included the threshold values to be cautious of for each stat. However, I really recommend reading up more on each resource, there are many great guides out there that dive into detail on how to diagnose performance issue using ESXTOP:

CPU

%RDY- Indicates the percentage of time a VM was ready to run but could not because there wasn’t enough CPU resources available. Could be due to too many vCPUs, vSMP VMs or a CPU limit enforced on a VM. Threshold: Higher than 10.

%SWPWT – Indicates the percentage of time a VM has to wait for the host to swap memory. This could be a sign of overcommitted memory. Threshold: Higher than 5.

%MLMTD– Indicates the percentage of time a VM or world was not scheduled because of a limit setting. Unless a limit on a resource pool or VM was purposely configured by design, there shouldn’t be anything higher than 0 in this field. Threshold: Greater than 0.

%CSTP – Indicates the percentage of time a VM spends in a ready, co-deschedule state. This field really only applies to virtual machines that are using vSMP and indicates that one vCPU is being used a lot more than the other vCPU allocated to the VM. Threshold: Higher than 3.

Memory

MCTLSZ (MB)– Indicates the amount of physical memory the ESXi Host is reclaiming by balloon driver. Could possibly be a sign of overcommitted memory. Threshold: Greater than 0.

ZIP/s (MB/s)– Indicates the amount of memory that is compressed per second on the host. If the host is compressing memory pages it’s an indicator of memory contention issues and is usually due to overcommitted memory. Threshold: greater than 0.

UNZIP/s (MB/s) – Indicates the amount of memory that is decompressed per second on the host. Can be a sign of overcommitted memory. Threshold: greater than 0.

SWCUR (MB)– Memory that that is being swapped by the VM or resource pool. Points to overcommitted memory. Threshold: greater than 0.

CACHEUSD (MB)– Amount of memory being compressed by the ESXi host. Could be an indication of overcommitted memory. Threshold: greater than 0.

SWW/s and SWR/s– Indicates the rate at which the ESXi Host read or writes to the disk from or to swapped memory. Possible cause would be overcommitted memory. Threshold: greater than 0.

Network

%DRPTX – Dropped packages transmitted. Higher values than 0 could be a sign of high network utilization. Threshold: greater than 0.

%DRPRX – Dropped packages received. Higher values than 0 could be a sign of high network utilization. Threshold: greater than 0.

Used-by and Team-PNIC– These two fields are very useful to distinguishing which physical NIC a VM is using.

Disk

DAVG/cmd – Indicates the average device Latency per command at the device driver level. High values point to storage performance issues. Threshold: Over 25

ABRTS/s – Commands aborted per second. Aborts are issued from the guest OS when storage stops responding. The Windows OS has a default time out of 60 seconds. Possible cause could be an issue with the storage fabrics or array. Threshold: Anything over 0

KAVG/cmd– Average VMKernel latency per command. A high value indicates I/O is being throttled between guest OS and storage, best bet is to check with vendor for performance tuning options or an updated firmware release. Threshold Over 2

GAVG/cmd– Average guest operating system latency per command. This value is calculated by the sum of the DAVG and KAVG. Threshold: Over 25

Resets/s– Command reset’s per second. A reset command get issued when the operation fails to reach the target. Threshold: Anything over 0

Wrap-Up

When troubleshooting performance issues in vSphere, the above tools and tricks can be immensely helpful, so don’t forget about it, when you find yourself in a troubleshooting situation.

How about you? What are your favorite tools and methods for troubleshooting vSphere performance issues? Feel free to let us know in the comments section below!

Thanks for reading!