Skip to content
  • Mel Gorman's avatar
    7a25759e
    cpuidle: Select polling interval based on a c-state with a longer target residency · 7a25759e
    Mel Gorman authored
    
    
    It was noted that a few workloads that idle rapidly regressed when commit
    36fcb429 ("cpuidle: use first valid target residency as poll time")
    was merged. The workloads in question were heavy communicators that idle
    rapidly and were impacted by the c-state exit latency as the active CPUs
    were not polling at the time of wakeup. As they were not particularly
    realistic workloads, it was not considered to be a major problem.
    
    Unfortunately, a bug was reported for a real workload in a production
    environment that relied on large numbers of threads operating in a worker
    pool pattern. These threads would idle for periods of time longer than the
    C1 target residency and so incurred the c-state exit latency penalty. The
    application is very sensitive to wakeup latency and indirectly relying
    on behaviour prior to commit on a37b969a ("cpuidle: poll_state: Add
    time limit to poll_idle()") to poll for long enough to avoid the exit
    latency cost.
    
    The target residency of C1 is typically very short. On some x86 machines,
    it can be as low as 2 microseconds. In poll_idle(), the clock is checked
    every POLL_IDLE_RELAX_COUNT interations of cpu_relax() and even one
    iteration of that loop can be over 1 microsecond so the polling interval is
    very close to the granularity of what poll_idle() can detect. Furthermore,
    a basic ping pong workload like perf bench pipe has a longer round-trip
    time than the 2 microseconds meaning that the CPU will almost certainly
    not be polling when the ping-pong completes.
    
    This patch selects a polling interval based on an enabled c-state that
    has an target residency longer than 10usec. If there is no enabled-cstate
    then polling will be up to a TICK_NSEC/16 similar to what it was up until
    kernel 4.20. Polling for a full tick is unlikely (rescheduling event)
    and is much longer than the existing target residencies for a deep c-state.
    
    As an example, consider a CPU with the following c-state information from
    an Intel CPU;
    
    	residency	exit_latency
    C1	2		2
    C1E	20		10
    C3	100		33
    C6	400		133
    
    The polling interval selected is 20usec. If booted with
    intel_idle.max_cstate=1 then the polling interval is 250usec as the deeper
    c-states were not available.
    
    On an AMD EPYC machine, the c-state information is more limited and
    looks like
    
    	residency	exit_latency
    C1	2		1
    C2	800		400
    
    The polling interval selected is 250usec. While C2 was considered, the
    polling interval was clamped by CPUIDLE_POLL_MAX.
    
    Note that it is not expected that polling will be a universal win. As
    well as potentially trading power for performance, the performance is not
    guaranteed if the extra polling prevented a turbo state being reached.
    Making it a tunable was considered but it's driver-specific, may be
    overridden by a governor and is not a guaranteed polling interval making
    it difficult to describe without knowledge of the implementation.
    
    tbench4
    			     vanilla		    polling
    Hmean     1        497.89 (   0.00%)      543.15 *   9.09%*
    Hmean     2        975.88 (   0.00%)     1059.73 *   8.59%*
    Hmean     4       1953.97 (   0.00%)     2081.37 *   6.52%*
    Hmean     8       3645.76 (   0.00%)     4052.95 *  11.17%*
    Hmean     16      6882.21 (   0.00%)     6995.93 *   1.65%*
    Hmean     32     10752.20 (   0.00%)    10731.53 *  -0.19%*
    Hmean     64     12875.08 (   0.00%)    12478.13 *  -3.08%*
    Hmean     128    21500.54 (   0.00%)    21098.60 *  -1.87%*
    Hmean     256    21253.70 (   0.00%)    21027.18 *  -1.07%*
    Hmean     320    20813.50 (   0.00%)    20580.64 *  -1.12%*
    
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
    7a25759e
    cpuidle: Select polling interval based on a c-state with a longer target residency
    Mel Gorman authored
    
    
    It was noted that a few workloads that idle rapidly regressed when commit
    36fcb429 ("cpuidle: use first valid target residency as poll time")
    was merged. The workloads in question were heavy communicators that idle
    rapidly and were impacted by the c-state exit latency as the active CPUs
    were not polling at the time of wakeup. As they were not particularly
    realistic workloads, it was not considered to be a major problem.
    
    Unfortunately, a bug was reported for a real workload in a production
    environment that relied on large numbers of threads operating in a worker
    pool pattern. These threads would idle for periods of time longer than the
    C1 target residency and so incurred the c-state exit latency penalty. The
    application is very sensitive to wakeup latency and indirectly relying
    on behaviour prior to commit on a37b969a ("cpuidle: poll_state: Add
    time limit to poll_idle()") to poll for long enough to avoid the exit
    latency cost.
    
    The target residency of C1 is typically very short. On some x86 machines,
    it can be as low as 2 microseconds. In poll_idle(), the clock is checked
    every POLL_IDLE_RELAX_COUNT interations of cpu_relax() and even one
    iteration of that loop can be over 1 microsecond so the polling interval is
    very close to the granularity of what poll_idle() can detect. Furthermore,
    a basic ping pong workload like perf bench pipe has a longer round-trip
    time than the 2 microseconds meaning that the CPU will almost certainly
    not be polling when the ping-pong completes.
    
    This patch selects a polling interval based on an enabled c-state that
    has an target residency longer than 10usec. If there is no enabled-cstate
    then polling will be up to a TICK_NSEC/16 similar to what it was up until
    kernel 4.20. Polling for a full tick is unlikely (rescheduling event)
    and is much longer than the existing target residencies for a deep c-state.
    
    As an example, consider a CPU with the following c-state information from
    an Intel CPU;
    
    	residency	exit_latency
    C1	2		2
    C1E	20		10
    C3	100		33
    C6	400		133
    
    The polling interval selected is 20usec. If booted with
    intel_idle.max_cstate=1 then the polling interval is 250usec as the deeper
    c-states were not available.
    
    On an AMD EPYC machine, the c-state information is more limited and
    looks like
    
    	residency	exit_latency
    C1	2		1
    C2	800		400
    
    The polling interval selected is 250usec. While C2 was considered, the
    polling interval was clamped by CPUIDLE_POLL_MAX.
    
    Note that it is not expected that polling will be a universal win. As
    well as potentially trading power for performance, the performance is not
    guaranteed if the extra polling prevented a turbo state being reached.
    Making it a tunable was considered but it's driver-specific, may be
    overridden by a governor and is not a guaranteed polling interval making
    it difficult to describe without knowledge of the implementation.
    
    tbench4
    			     vanilla		    polling
    Hmean     1        497.89 (   0.00%)      543.15 *   9.09%*
    Hmean     2        975.88 (   0.00%)     1059.73 *   8.59%*
    Hmean     4       1953.97 (   0.00%)     2081.37 *   6.52%*
    Hmean     8       3645.76 (   0.00%)     4052.95 *  11.17%*
    Hmean     16      6882.21 (   0.00%)     6995.93 *   1.65%*
    Hmean     32     10752.20 (   0.00%)    10731.53 *  -0.19%*
    Hmean     64     12875.08 (   0.00%)    12478.13 *  -3.08%*
    Hmean     128    21500.54 (   0.00%)    21098.60 *  -1.87%*
    Hmean     256    21253.70 (   0.00%)    21027.18 *  -1.07%*
    Hmean     320    20813.50 (   0.00%)    20580.64 *  -1.12%*
    
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
Loading