Performance Analysis Methodology

A performance analysis methodology is a procedure that you can follow to analyze system or application performance. These generally provide a starting point and then guidance to root cause, or causes. Different methodologies are suited for solving different classes of issues, and you may try more than one before accomplishing your goal.

Analysis without a methodology can become a fishing expedition, where metrics are examined ad hoc, until the issue is found – if it is at all.

Methodologies documented in more detail on this site are:

The USE Method: for finding resource bottlenecks

The TSA Method: for analyzing application time

Off-CPU Analysis: for analyzing any type of thread wait latency

Active Benchmarking: for accurate and successful benchmarking

The following briefly summarizes methodologies I've either created or encountered. You can print these all out as a cheetsheet/reminder.

Summaries

I first summarized and named various performance methodologies (mostly developed by me) for my USENIX LISA 2012 talk: Performance Analysis Methodology (PDF, slideshare, youtube, USENIX), then later documented them in my Systems Performance book, and the ACMQ article Thinking Methodically about Performance, which was also published in Communications of the ACM, Feb 2013. More detailed references are at the end of this page.

The following is my most up to date summary list, with methodologies enumerated. These begin with anti-methods, which are included for comparison, and not to follow.

Anti-Methodologies Blame-Someone-Else Anti-Method Find a system or environment component you are not responsible for Hypothesize that the issue is with that component Redirect the issue to the responsible team When proven wrong, go to 1 Streetlight Anti-Method Pick observability tools that are: familiar

found on the Internet

found at random

Run tools

Look for obvious issues

Drunk Man Anti-Method

Change things at random until the problem goes away

Random Change Anti-Method

Measure a performance baseline Pick a random attribute to change (eg, a tunable) Change it in one direction Measure performance Change it in the other direction Measure performance Were the step 4 or 6 results better than the baseline? If so, keep the change; of not, revert Goto step 1

Passive Benchmarking Anti-Method

Pick a benchmark tool Run it with a variety of options Make a slide deck of the results Hand the slides to management

Traffic Light Anti-Method

Open dashboard All green? Assume everything is good. Something red? Assume that's a problem.

Methodologies Ad Hoc Checklist Method ..N. Run A, if B, do C Problem Statement Method What makes you think there is a performance problem? Has this system ever performed well? What has changed recently? (Software? Hardware? Load?) Can the performance degradation be expressed in terms of latency or run time? Does the problem affect other people or applications (or is it just you)? What is the environment? What software and hardware is used? Versions? Configuration? RTFM Method (Read The Fine Manual) How to research performance tools and metrics: Man pages Books Web search Co-workers Prior talk slides/video Support services Source code Experimentation Social Media Scientific Method Question Hypothesis Prediction Test Analysis OODA Loop Observe Orient Decide Act Workload Characterization Method Who is causing the load? (PID, UID, IP addr, ...) Why is the load called? (code path) What is the load? (IOPS, tput, type) How is the load changing over time? (time series line graph) Drill-Down Analysis Method Start at highest level Examine next-level details Pick most interesting breakdown If problem unsolved, go to 2 Process of Elimination Divide the target into components Choose a test which: Can exonerate many untested components (ideally, half of those remaining)

Is quick to perform

Perform test

Were the tested components exonerated?

Yes: go to 2

No: problem found?

Yes: done



No: how many components were tested?



one: target = tested component; go to 1





multiple: go to 2

Not sure: consider components untested; go to 2 and choose a different test

Time Division Method

Measure operation time (or latency) Divide time into logical synchronous components Continue division until latency origin is identified Quantify: estimate speedup if problem fixed

(I previously called this the "Latency Analysis Method")

5 Whys Performance Method

Given delivered performance, ask, "why?", then answer this question ..5 Given previous answer, ask, "why?", then answer this question

By-Layer Method

Measure latency in detail (eg, as a histogram) from:

Dynamic languages Executable Libraries Syscalls Kernel: FS, network Device drivers

Investigate the lowest layer that latency is introduced

Tools Method

List available performance tools (optionally add more) For each tool, list its useful metrics For each metric, list possible interpretation Run selected tools and interpret selected metrics.

USE Method

For every resource, check:

Utilization Saturation Errors

RED Method

For every service or microservice, check:

Request rate Errors Duration

CPU Profile Method

Take a CPU profile (especially a flame graph) Understand all software in profile > 1%

Off-CPU Analysis

Profile per-thread off-CPU time with stack traces Coalesce times with like stacks Study stacks from largest to shortest time

Stack Profile Method

Profile thread stack traces, on- and off-CPU Coalesce Study stacks bottom-up

TSA Method

For each thread of interest, measure time in operating system thread states. Eg:

Executing

Runnable

Swapping

Sleeping

Lock

Idle

Investigate states from most to least frequent, using appropriate tools

Active Benchmarking Method

Configure the benchmark to run for a long duration While running, analyze performance using other tools, and determine limiting factors

Method R

Select user actions that matter for the business workload Measure causes of response time for user actions Calculate best net-payoff optimization activity

If sufficient gain, tune

If insufficient gain, suspend tuning until something changes

Goto 1

Performance Evaluation Steps

State the goals of the study and define system boundaries List system services and possible outcomes Select performance metrics List system and workload parameters Select factors and their values Select the workload Design the experiments Analyze and interpret the data Present the results If necessary, start over

Capacity Planning Process

Instrument the system Monitor system usage Characterize workload Predict performance under different alternatives Select the lowest cost, highest performance alternative

Intel Hierarchical Top-Down Performance Characterization Methodology

Are UOPs issued?



If yes:





Are UOPs retired?







If yes: retiring (good)









If no: investigate bad speculations





If no:





Allocation stall?







If yes: investigate back-end stalls









If no: investigate front-end stalls

Performance Mantras

Don't do it Do it, but don't do it again Do it less Do it later Do it when they're not looking Do it concurrently Do it cheaper

Benchmarking Checklist

Why not double? Did it break limits? Did it error? Does it reproduce? Does it matter? Did it even happen?