Whoa – weird load ave and cpu freq reports from openSUSE 10.3
Just built a cluster of 25 Dell systems for our developers. These are Dell 1435SC systems, each with a pair of Dell 1435SC and 8 GB RAM. We installed openSUSE 10.3 on them all, added the Ganglia gmond and also the OpenIB infiniband successor OFED.
Handed them off to our developers to certify and they came right back asking weird questions – like:
- Why is the load always at least 1.00?
- Are these *really* 1000 MHz CPUs?
- How come the screen background color is always dark puce?
(Just kidding about that last one – Prasad!)
Sure enough. Uptime shows load on all 25 systems is always at least 1.00. Usually right there. And the cpu MHz in /proc/cpuinfo is almost always 1000. I saw it at 2600 for all four cores on one machine and the next time I looked, it had dropped to 1000 on all four cores.
Here’s part of the output of “cat /proc/cpuinfo” for the first proc, number 0:
id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 2218
stepping : 3
cpu MHz : 1000.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips : 2001.35
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc
Here are the uptime reported load aberations
for num in $(seq 50 74)
do
echo -n supib$num
ssh supib$num uptime
done 2>&1 | grep 'average'
supib50 12:03pm up 6 days 20:35, 3 users, load average: 2.18, 3.93, 3.99
supib51 12:03pm up 1 day 2:16, 0 users, load average: 1.99, 3.80, 3.82
supib52 12:03pm up 5 days 22:51, 0 users, load average: 1.00, 1.01, 1.31
supib53 12:03pm up 5 days 22:51, 0 users, load average: 1.00, 1.01, 1.31
supib54 12:03pm up 6 days 1:12, 0 users, load average: 1.00, 1.01, 1.32
supib55 12:03pm up 5 days 22:32, 0 users, load average: 1.00, 1.01, 1.31
supib56 12:03pm up 5 days 22:35, 0 users, load average: 1.00, 1.02, 1.32
supib57 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.01, 1.29
supib58 12:03pm up 6 days 0:54, 0 users, load average: 1.08, 1.02, 1.01
supib59 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib60 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib61 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib62 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib63 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib64 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib65 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.02, 1.00
supib66 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.01
supib67 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib68 12:03pm up 6 days 0:54, 0 users, load average: 2.00, 2.00, 2.00
supib69 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.01
supib70 not reachable through the network
supib71 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.02
supib72 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.01
supib73 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.01
supib74 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
OK, so this is weird. Off to dig
OK, turns out that the ofed packages that are stock with openSUSE 10.3 – which I’ll list here:
ofed-1.2.5-18.x86_64.rpm
ofed-doc-1.2.5-18.x86_64.rpm
ofed-kmp-default-1.2.5_2.6.22.5_31-18.x86_64.rpm
… install a module named ib_mthca – the core module for infiniband support it appears.
Once that module hits the kernel, the load steadily rises to the +1 state I reported earlier.
Odd – moving to report to the openib folks.
New bug …
https://bugs.openfabrics.org/show_bug.cgi?id=866