Power efficiency is an important consideration for anyone concerned with business costs or environmental issues. In this article, let’s look at how to use the Linux CPUfreq subsystem and in-kernel governors to change the processor’s operating frequency to improve your system’s power efficiency without having a major impact on performance. However, power efficiency tuning has its limits based on the actual hardware of this series).

The Linux CPUfreq subsystem

Starting with the 2.6.0 Linux kernel, you can dynamically scale processor frequencies through the CPUfreq subsystem. When processors operate at a lower clock speed, they consume proportionately less power and generate less heat. This dynamic scaling of the clock speed gives some control in throttling the system to consume less power when not operating at full capacity.

The CPUfreq structure makes use of governors and daemons for setting a static or dynamic power policy for the system. The dynamic governors, which I discuss later in this article, can switch between CPU frequencies based on CPU utilization to allow for power savings while not sacrificing performance. These governors also allow for some user tuning so you can customize and easily change the frequency scaling. In addition, the sched_mc_power_savings and sched_smt_power_savings settings make use of consolidating threads to save power.

C states and P states

Before we begin the CPUfreq discussion, let’s review C and P states.

C states: Almost all idle

C states, with the exception of the C0 state where the processor is running, are idle states in which the processor will unclock and shut down components to save power. The deeper the C state, the more power saving steps are taken—steps like stopping the processor clock or stopping interrupts from coming in. These states can provide power savings when the system is idle.

There is also a mode called C1E (alternately known as Enhanced C1 or C1 Enhanced Mode) that can help power savings at idle. C1E tries to provide more power savings than the traditional C1 state (which only halts the clock signal) by also lowering the voltage and frequency. In fact, C1E has the ability to lower the voltage/frequency faster than any of the CPUfreq governors.

Not all processors have all these options, but to use C states and C1E, make sure that you have the BIOS options CPU C State and C1E (or anything similar) enabled to get the greatest power savings at idle. Some systems support a C3 and even a C6 deep-sleep state.

Remember, the deeper the C state, the more power savings.

P states: In operation

P states are operational states that relate to CPU frequency and voltage. The higher the P state, the lower the frequency and voltage at which the processor runs. The CPUfreq governors use P states to change frequencies and lower power consumption.

You need to have the Processor Performance States BIOS option (or anything like it) enabled on your system to use P states and the CPUfreq governors. Figure 1 is a simple diagram of C and P states.

Figure 1. C and P states

Prerequisites for the CPUfreq subsystem

Before you can make use of the CPUfreq subsystem, you need the prerequisites described in this section. CPUfreq is enabled by default in RHEL 5.2 (it’s normally enabled as well in other distributions). One quick way to check if CPUfreq is already enabled is to look in the /sys filesystem. If you see the cpufreq directory listed under /sys/devices/system/cpu/cpu*/cpufreq/, then your system currently has CPUfreq enabled. If you do not see this directory listed, follow these instructions to ensure that you have the required pieces.

First, make sure that your processor can support frequency scaling. See the list of hardware that supports the CPUfreq subsystem in resources on the right.

Next, look at your kernel config file. All of the required configurations are usually set by default for the RHEL 5.2 kernel, but you may want to change some of the settings to achieve the desired startup state for your system. The following options are located in the CPU Frequency scaling section of the config file:

CONFIG_CPU_FREQ CONFIG_CPU_FREQ

This option must be set to y to make use of the kernel’s CPU frequency scaling.

CONFIG_CPU_FREQ_GOV_PERFORMANCE,

CONFIG_CPU_FREQ_GOV_POWERSAVE, CONFIG_CPU_FREQ_GOV_USERSPACE,

CONFIG_CPU_FREQ_GOV_ONDEMAND, CONFIG_CPU_FREQ_GOV_CONSERVATIVE

These options are for each of the available CPUfreq governors. To use a governor, set the config option to y or m . If you set the option to y , that governor’s module will be built into the kernel. If you set the option to m , you will have to load the module yourself for each boot by issuing one or all of the following commands:

modprobe cpufreq_performance modprobe cpufreq_powersave modprobe

cpufreq_userspace modprobe cpufreq_ondemand modprobe cpufreq_conservative

Alternatively, if you chose m , you can have the module loaded at boot time by adding the governor modules to /etc/rc.local. Also note that you can set either the userspace or the performance governor to the default by setting either CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE or CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE to y .

Also, to use sched_mc_power_savings and sched_smt_power_savings , which are discussed later, make sure that options CONFIG_SCHED_MC and CONFIG_SCHED_SMT are set to y under the Processor type and features section of the config file.

For any changes in the config file to take effect, you must rebuild and boot your kernel. You probably know how to do that, but if you’re fuzzy on any details, check any of the many sources for instructions on rebuilding a Linux kernel (see resources section for suggestions).

Here come the governors

There are five in-kernel governors available for use with the CPUfreq subsystem. These governors set the processor frequency based on certain criteria; some dynamically change the frequency as inputs are changed either by the system or the user. This articles focuses on RHEL 5.2, which is based on the 2.6.18 kernel, so all of these governors are available for use. Let’s meet them. (Part 2 and Part 3 in this series takes you into deeper detail on the governors.)

Performance governor: Highest frequency

The performance governor statically sets the processor to the highest frequency available. You can adjust the range of frequencies available to this governor. As the name implies, this governor’s goal is to get the maximum performance out of a system by setting the processor clock speed to the maximum level and leaving it there. This governor does not attempt to provide any power savings by default, although you can tune the governor to change the frequency it selects.

Powersave governor: Lowest frequency

On the flip side, the powersave governor statically sets the processor to the lowest available frequency. Again you can adjust the range of frequencies available to this governor. The purpose of this governor is to run at the lowest speed possible at all times. Obviously this can affect performance in that the system will never rise above this frequency no matter how busy the processors are.

In fact, this governor often does not save any power since the greatest power savings usually come from the savings at idle through entering C states. Using the powersave governor will prolong a running process since it will be running at the slowest frequency; therefore, it will take longer for the system to go idle and get the C state savings.

Userspace governor: Manual frequencies

Next there is the userspace governor, which allows you to select and set a frequency manually. This governor also works with processor frequency daemons running in userspace to control frequency (we’ll talk more about daemons and provide examples in Part 2). This governor is useful for setting a unique power policy that is not preset or available from the other governors; you can also use it to experiment with policies.

Note that the userspace governor itself does not dynamically change the frequency; rather, it allows you or a userspace program to dynamically select the processor frequency.

Ondemand governor: Frequency change based on processor use

Introduced in the 2.6.10 kernel, the ondemand governor was the first in-kernel governor to dynamically change processor frequency based on processor utilization. The ondemand governor checks the processor utilization and if it exceeds the threshold, the governor will set the frequency to the highest available. If the governor finds the utilization to be less than the threshold, it steps down the frequency to the next available. If the system continues to be underutilized, the governor will continue stepping down the frequency until the lowest available is set.

You can control the range of frequencies available, the rate at which the governor checks utilization on the system, and the utilization threshold.

Conservative governor: A more gradual ondemand

Based on the ondemand governor, the conservative governor (which was introduced in the 2.6.12 kernel) is similar in that it dynamically adjusts frequencies based on processor utilization; however, the conservative governor behaves a little differently and allows for a more gradual increase in power. The conservative governor checks the processor utilization and if it is above or below the utilization thresholds, the governor steps up or down the frequency to the next available instead of just jumping to the highest frequency as ondemand does.

You can control the range of frequencies available, the rate at which the governor checks utilization on the system, the utilization thresholds, and the frequency step rate.