Thursday, January 2, 2014

Monitoring CPU Utilization Under Hyper-threading

The question of accurately measuring processor utilization with hyper-threading (HT) enabled came up recently in a Performance Engineering Group discussion on Linked-in. Since I spent some considerable time looking into this issue while writing my Guerrilla Capacity Planning book, I thought I'd repeat my response here (slightly edited for this blog), in case it's useful to a broader audience interested in performance and capacity management. Not much has changed, it seems.

In a nutshell, the original question concerned whether or not it was possible for a single core to be observed running at 200% busy, as reported by Linux top, when HT is enabled.


This question is an old canard (well, "old" for multicore technology). I call it the "Missing MIPS" paradox. Regarding the question, "Is it really possible for a single core to be 200% busy?" the short answer is: never! So, you are quite right to be highly suspicious and confused.

You don't say which make of processor is running on your hardware platform, but I'll guess Intel. Very briefly, the OS (Linux in your case) is being lied to. Each core has 2 registers where inbound threads are stored for processing. Intel calls these AS (Architectural State) registers. With HT *disabled*, the OS only sees a single AS register as being available. In that case, the mapping between state registers and cores is 1:1. The idea behind HT is to allow a different application thread to run when the currently running app stalls; due to branch misprediction, bubbles in the pipeline, etc. To make that possible, there has to be another port or AS register. That register becomes visible to the OS when HT is enabled. However, the OS (and all the way up the food chain to whatever perf tools you are using) now thinks twice the processor capacity is available, i.e., 100% CPU at each AS port.

But under the hood, there is still only *one* execution unit: the single, physical, core you started with before HT was enabled. The difference is that it is being shared in some way between the 2 AS ports. How the single core gets switched between the two ports is very complicated but is most easily understood in terms of polled queues. I go into that level of detail in my GCaP classes.

The best-case test measurements I have, indicate that each HT port cannot become more than 75% busy, on average, or 150% of the total expected 200% capacity according to the OS. The "missing" 50% capacity, that I referred to earlier, is an illusion. Intel has claimed that something in the range of 120% to 130% can be expected for general applications.

The only way to resolve the actual CPU utilization is a direct measurement of the core busy time. SPARC and IBM Power series processors have a separate peephole performance registers for that purpose. Intel processors do not (last time I looked). Therefore, no OS-based performance tools can report processor utilization correctly without access to such additional data.

Now that you understand how HT creates certain illusions about processor capacity, consider this. HT is nothing more than a form of VIRTUALIZATION. It creates the illusion that there are two virtual CPUs (2 VPUs) where there was previously only one physical CPU.

As I mentioned earlier, the performance of this form of micro-level virtualization (i.e., thread level) can be understood as arising from polling queues or registers. A similar general polling mechanism is also used in meso-level virtualization (i.e., VM/guest level). Elsewhere, I've referred to this as hyper-services spectrum. (See image above.) So, now you can start to worry about what performance data is "missing" in hypervisors.

As I've said in my books and elsewhere: "All virtualization is about illusions and although it is perfectly reasonable to perpetrate such illusions on unwitting users, it is entirely unreasonable to propagate those same illusions to the performance analyst."

4 comments:

ceeaspb said...

it never ceases to amaze me how measuring something as seemingly simple as cpu is such a minefield.

There's some interesting docs on the IBM side (assume you are talking about PURR?) that you mention where the illusion of simply reading %cpu from vmstat is not as simple as it seems.

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Not+AIX/page/Understanding+Processor+Utilization+on+Power+Systems+-+AIX

https://www.ibm.com/developerworks/community/blogs/aixpert/entry/local_near_far_memory_part_3_scheduling_processes_to_smt_virtual_processors130?lang=en


As more and more servers are virtualised the vm tech needs to support this also:

labs.vmware.com/download/143/‎

kb.vmware.com/kb/2030221
http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:PAPI_on_KVM

ceeaspb

Neil Gunther said...

Yes, I was referring to the PURR register on IBM Power X (don't know which s/w tools in AIX) and the corestat tool in Solaris (reads certain micro state h/w registers, I think).

Anonymous said...

Some reference from Intel:

http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization#cpu_utilization

http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/

"CPU utilization is not a good estimate of the true load on the system or the headroom the system has remaining to do additional work..."

Anonymous said...

http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization#cpu_utilization

http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/

"CPU utilization is not a good estimate of the true load on the system or the headroom the system has remaining to do additional work."