Studying openSUSE Yast Online Update (YOU).

2008-02-14 mdiehn.
Hi, this is Mike. I’m just making notes as I learn, so not to forget what I’ve discovered. Happy Valentine’s Day.

Working on rondo.fluent.com, an openSUSE 10.3 system – trying to understand the workings of the Yast Online Update system. We want to make a local repository and wondered if we could “piggy-back” the server on a YOU client. So I’m studying a client while it works.

Found that /var/cache/zypp contains a set of xml files for each software repository configured in YAST. Have a look at what I found in them:


rondo:/var/cache/zypp # ls
raw zypp.db
rondo:/var/cache/zypp/raw # ls
openSUSE-10.3-DVD 10.3 openSUSE-10.3-Updates
rondo:/var/cache/zypp/raw/openSUSE-10.3-Updates # ls
repodata
rondo:/var/cache/zypp/raw/openSUSE-10.3-Updates/repodata # ls
patch-aaa_base-4866.xml
patch-alsa-4737.xml
patch-amarok-4492.xml
.
[300 some xml files omitted]
.
patch-yast2-trans-de-4889.xml
patch-zypper-4530.xml
primary.xml.gz
repomd.xml
repomd.xml.asc
repomd.xml.key
rondo:/var/cache/zypp/raw/openSUSE-10.3-Updates/repodata #

Looks like an xml file for every patch on the update server. The actual RPMs containing the patches aren’t put here. Instead, they appear in an ephermial stucture under /var/adm/mount that exists only while the update process is actually running:


rondo:/var/adm # find .
.
./mount
./mount/AP_0x00000002
./mount/AP_0x00000002/rpm
./mount/AP_0x00000002/rpm/x86_64
./mount/AP_0x00000002/rpm/x86_64/kernel-source-2.6.22.17-0.1.x86_64.rpm
./mount/AP_0x00000002/rpm/x86_64/kernel-source-2.6.22.16_2.6.22.17-0.2_0.1.x86_64.delta.rpm
.
[the rest of the find output isn't relevant]

Only the rpms currently and immediately in use by the update process exists. Looks like it operates on one “patch” at a time and that occasionally that will mean more than one rpm in the queue at once, as we see here.

More later as I find it.

Infiniband Ping Pong.

Intuitive to some, not to others.

If you want to ping across an Infiniband network, like we use ping to test other networks, you could use ibping. Pick a system to be the destination of the ping and then go find out it’s Port GUID and start a ibping server on it:

Set one system as the “server” thus:


autosup25:~ # ibstat
CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.2.0
Hardware version: a0
Node GUID: 0x00066a00980095d7
System image GUID: 0x00066a00980095d7
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02510a68
Port GUID:

autosup25:~ # ibping -v -S
ibwarn: [11498] ibping_serv: starting to serve...

So, the port guid on this machine is 0x00066a00980095d7. We’ll need that on the other end. So, now pick the server from which you want to start the ping – this feels backwards doesn’t it? – and do this:


autosup30:~ # ibping -G 0x00066a00a00095d7
Pong from autosup25.mi.fluent.com (Lid 3): time 0.157 ms
Pong from autosup25.mi.fluent.com (Lid 3): time 0.100 ms
Pong from autosup25.mi.fluent.com (Lid 3): time 0.091 ms
Pong from autosup25.mi.fluent.com (Lid 3): time 0.121 ms
Pong from autosup25.mi.fluent.com (Lid 3): time 0.090 ms
Pong from autosup25.mi.fluent.com (Lid 3): time 0.092 ms

--- autosup25.mi.fluent.com (Lid 3) ibping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5799 ms
rtt min/avg/max = 0.090/0.108/0.157 ms

Now, I just move from system to system pinging that server to make sure all my servers are able to talk over the Infiniband link. Vary this a bit and you could use it in Nagios or something to monitor the IB network in a compute cluster.

That’s it. All done. Except, don’t forget to go kill the ping server on your target.

Use pam_access.so to limit system access to members of a certain group or netgroup.

I’m building a cluster of 25 machines at work. Trying to get the Infiniband stuff to work on them – a dev found an annomoly in his benchmarking numbers and asked me to verify his work and look for trouble and so – I did. While I was doing it, I found some other users in the company had discovered the new cluster’s component systems as “non-busy servers to play with” and had – uhm – started using them for – uhm – stuff that makes them busy as heck. Never mind. BOFH foder.

These boxen all run openSUSE 10.3. We have them NISed up, but otherwise, they’re pretty much stock. Used to be, we’d limit user access to people in a specific netgroup via the ‘hack the /etc/passwd’ file trick something like this:


userx:X:uid:gid:asdfasdf.a.sdf.as.df.asdfasdf
usery:X:uid:gid:asdfasdf.a.sdf.as.df.asdfasdf
userx:X:uid:gid:asdfasdf.a.sdf.as.df.asdfasdf
+@netgroup_to_let_in::::::
+::::::/bin/false

The idea being that members of NIS netgroup_to_let_in would match the first + line and get in because their shell would be used instead of the shell for the catch-all at the bottom.

That appears to work for some systems here and not for others. I haven’t yet figured out which systems like that and which don’t and I got sick of fiddling with it. So I needed a new, real, reliable, actual method. And this real, actual method thing I found works intuitively. (I say found. As if it were a new thing. No. I just read a manual…)

Go define a new netgroup, call it newclustertesters, say.

Then, put this in the /etc/security/access.conf:


#
# 2008-01-22 mdiehn: restricting access while testing the new cluster systems
# so my benchmarks aren't tainted by stray user processes. Like Xvnc. :-)
#
+ : sgeadmin : ALL
+ : root : ALL
+ : @sysadm : ALL
+ : @csnh : ALL
+ : @newclustertesters : ALL
- : ALL : ALL

Then edit either /etc/pam.d/common-account (or the specific files for sshd, rlogin, login, etc.) and add this line right after all the other “account” lines. I’ll show before and after for the /etc/pam.d/common-account file:

Before:


account required pam_unix2.so

After


account required pam_unix2.so
account required pam_access.so

OK, and here’s a B&A on a ficticious /etc/pam.d/sshd in a system on which the admin decided not to put the pam_access.so in the common-account file:

Before


auth requisite pam_nologin.so
auth include common-auth
account include common-account
password include common-password
session required pam_loginuid.so
session include common-session

After


auth requisite pam_nologin.so
auth include common-auth
account include common-account
account required pam_access.so
password include common-password
session required pam_loginuid.so
session include common-session

So, the /etc/security/access.conf file has pretty good documentation in it – go read that. And if you want it direct from the horse:

http://www.kernel.org/pub/linux/libs/pam/modules.html – Primary site for PAM distribution and documentation

Whoa – weird load ave and cpu freq reports from openSUSE 10.3

Just built a cluster of 25 Dell systems for our developers. These are Dell 1435SC systems, each with a pair of Dell 1435SC and 8 GB RAM. We installed openSUSE 10.3 on them all, added the Ganglia gmond and also the OpenIB infiniband successor OFED.

Handed them off to our developers to certify and they came right back asking weird questions – like:

  • Why is the load always at least 1.00?
  • Are these *really* 1000 MHz CPUs?
  • How come the screen background color is always dark puce?

(Just kidding about that last one – Prasad!)

Sure enough. Uptime shows load on all 25 systems is always at least 1.00. Usually right there. And the cpu MHz in /proc/cpuinfo is almost always 1000. I saw it at 2600 for all four cores on one machine and the next time I looked, it had dropped to 1000 on all four cores.

Here’s part of the output of “cat /proc/cpuinfo” for the first proc, number 0:


id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 2218
stepping : 3
cpu MHz : 1000.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips : 2001.35
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

Here are the uptime reported load aberations


for num in $(seq 50 74)
do
echo -n supib$num
ssh supib$num uptime
done 2>&1 | grep 'average'


supib50 12:03pm up 6 days 20:35, 3 users, load average: 2.18, 3.93, 3.99
supib51 12:03pm up 1 day 2:16, 0 users, load average: 1.99, 3.80, 3.82
supib52 12:03pm up 5 days 22:51, 0 users, load average: 1.00, 1.01, 1.31
supib53 12:03pm up 5 days 22:51, 0 users, load average: 1.00, 1.01, 1.31
supib54 12:03pm up 6 days 1:12, 0 users, load average: 1.00, 1.01, 1.32
supib55 12:03pm up 5 days 22:32, 0 users, load average: 1.00, 1.01, 1.31
supib56 12:03pm up 5 days 22:35, 0 users, load average: 1.00, 1.02, 1.32
supib57 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.01, 1.29
supib58 12:03pm up 6 days 0:54, 0 users, load average: 1.08, 1.02, 1.01
supib59 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib60 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib61 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib62 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib63 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib64 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib65 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.02, 1.00
supib66 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.01
supib67 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00
supib68 12:03pm up 6 days 0:54, 0 users, load average: 2.00, 2.00, 2.00
supib69 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.01
supib70 not reachable through the network
supib71 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.02
supib72 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.01
supib73 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.01
supib74 12:03pm up 6 days 0:54, 0 users, load average: 1.00, 1.00, 1.00

OK, so this is weird. Off to dig

Spock?

If you haven’t see Spock.com yet, go have a look. Pretty interesting.

Debian Etch: BADSIG A70DAF536070D3A1

Got this the other day when I tried to update the package lists on one of our Debian Etch servers:

mdiehn@imlcvs:~$ sudo aptitude update
Get:1 http://security.debian.org etch/updates Release.gpg [189B]
.
.
.
W: GPG error: http://security.debian.org etch/updates Release: The following
signatures were invalid: BADSIG A70DAF536070D3A1 Debian Archive Automatic
Signing Key (4.0/etch) <ftpmaster@debian.org>

W: Couldn't stat source package list http://security.debian.orgetch/updates/main Packages
(/var/lib/apt/lists/security.debian.org_dists_etch_updates_main_binary-i386_Packages)
- stat (2 No such file or directory)

W: Couldn't stat source package list http://security.debian.orgetch/updates/main Packages
(/var/lib/apt/lists/security.debian.org_dists_etch_updates_main_binary-i386_Packages)
- stat (2 No such file or directory)

W: You may want to run apt-get update to correct these problems

So, I went googling and found not too much. I did find two posts (see below) suggesting that I install a package named debian-archive-keyring. It was already on the system and I wondered if maybe it’d been corrupted somehow. So I reinstalled it and that seems to have solved the problem. Here’s what I used to do the reinstallation:

sudo apt-get install debian-archive-keyring --reinstall

Seems to have worked. Now, if only I can get through the congestion on our network here so I can download the new kernel and samba packages…. 🙂

Here are those two posts I referenced above:

  1. http://changelog.complete.org/posts/496-How-to-solve-The-following-packages-cannot-be-authenticated.html
  2. http://www.debianhelp.org/node/6150

The second was the real help as it describe troubles closest to my symptoms.

Learning the old ways.

I’m watching a Ryan Carson’s video in which he explains how he implements Dasvid Allen’s Getting Things Done. Good stuff all of it! But. He just finished a solid minute on the rubberband on his paper todo lists and now he’s describing his pen, of all things, and I’m sort of chuckling about how much time he’s spending on the mundania. Then it strikes me – I’m so inculcated in the complexity of my life, complexity added on by, what? By my culture? I don’t know. But I’m so steeped in this tradition of complexity that I actually need to study this stuff. This stuff about how to make choices in favor of low-drag tools, low-friction life. I have to learn to choose a simple pen instead of a nine-color, big fat thing that’s important to have because I might need all those colors one day. Wow.

I idealize farmers and carpenters, masons, barbers, etc. – tradesmen – of old in this sense, I think they all just knew how to make simple choices. Maybe their apparent simplicity came from having only limited choices. They didn’t have the 9-color pen versus the stub-pencil choice. They could only afford the stub pencil. Whatever, it idealistic of me and surely inaccurate, but I use that idealism to try to get a grasp on making choices these days in favor of simplicity.

But it’s hard, hard, hard sometimes. Hard to let go of the 9-color pens. Hard to trust myself that I’ll be okay with just the little pen. That I don’t need to carry around my PC tools all the time – if I need one, chances are it’ll be when I’m on a job and I’ll have known to bring them.

Fixing the "E: Dynamic MMap ran out of room" error.

Added a number of sources to /etc/apt/sources.list today looking for an update to bacula. When I ran apt-get update to have the system load the package lists from the remote servers, I got back this error:

  • Reading Package Lists…
  • E: Dynamic MMap ran out of room
  • E: Error occured while processing libvte9 (NewVersion1)
  • E: Problem with MergeList
  • /var/lib/apt/lists/debian.lcs.mit.edu_debian_dists_unstable_main_binary-i386_Packages
    E: The package lists or status file could not be parsed or opened.

Turns out this is a fairly innocuos error that means simply that I need to raise the limit on the size of the cache that apt-get is using. Edit /etc/apt/apt.conf or create a new file in /etc/apt.conf.d/ (I made a file – called it 30cachelimit) and put this line into it:

APT::Cache-Limit "20000000";

This tells apt that it may use 20MB of diskspace to make it’s memory map (MMap) of the package lists. That’s plenty.

Credit to Valery Dachev on whose blog I first found the solution. Thanks, Valery!

Credit to

Up here, it ain’t spring until the black flies come.

So, roughly mid April and no one is really surprised: big nasty storm yesterday. Got about four inches of wet snow. Then over night the wind picked up, stripped the trees of any branches that had had the temerity to put out new leaves. And all that snow – it’s slush now. And wow – that’s a LOT of water in the ditches and creek.

The horses, who usually don’t care a whit about weather, are huddling in their shed. Even Carly. The dogs and I were out for about five or six minutes to – ahem – conduct business and they came back soaked and shivering. Towels and food.

So, it’s shaping up to be a regular spring here in the Upper Valley. No telling what we’ll have this week. Sunny and 70, anyone? Well, we won’t plant anything anyway. And you know what we always say: It ain’t spring until the black flies come.

Pressure makes me do dumb things.

Yesterday was a big day here at the Lab. We had high muckity mucks from around the country in to preview a new application and collaborate with the designers. I’ve only been here since last June and hadn’t been through the… uhmmm – the furor of getting ready for one of these meetings. People get really tense and everyone feels the pressure. Even calm, relaxed system administrators. (who?)

Since I’m the sysadmin, I was tasked with setting up the fifteen systems to be used in the demo and discussions – I did it, but had some trouble at one point and called loudly for help. I didn’t hear anything for a long time, went looking, found nearly everyone had gone home and so I resolved to go home to visit my family before I went back later to finish up. I called the woman managing the conference to ask when it would start in the morning and wondered if there was any flexibility in the schedule. There wasn’t so I let her know I’d be heading back in later in the evening.

So, quite a few key people knew at about 6:00 PM that I hadn’t gotten the systems ready yet and needed some help. Apparently, no everyone got the word that I’d be going back in later to work on the systems.

So, I went in at 9:00, figured out the problem by 9:15 (with some clues from colleagues) and by 9:30 was well on my way to finishing up. A colleague came in to help with the repetitive work and we eventually got the systems working, tested and ready.

But then, and here’s the boneheadedness, we just went home. Nope, I didn’t email the staff list, I didn’t call the coordinator or the presenter or our admin or anyone. I didn’t tell anyone that the trouble was resolved and the systems were ready.

So, in come the rest of the staff at 8:05 in the morning having last heard that the machines weren’t ready and that I’d just gone home and left it all broken. To their credit, no-one paniced – they just showed up ready to move very quickly. Good thing, too, because there had been changes to the program over night and out boss made a few last minutes adjustments to the setups (and one last few seconds change) and we four were all scrambling as it was. Very efficiently scrambling, but scrambling.

Lesson learned:

Somehow, anyhow, always remember to communicate your successes as well as your failures and KEEP YOUR TEAM IN THE LOOP!