Working for the Community: hatools 2.14

In Portability, Reliability on 2010-03-16 at 08:36

This post is about my current work on release 2.14 of my most beloved OSS project: hatools. Just some observations, objectives, rants and an advertisement.

Scope

My scope for this release was rather limited: implement -v switches to hatimerun and halockrun to make them more communicative.

The reason for that is quite simple; hatools will be easier to learn and use if they talk more. I also noticed that there are two “new” tools in the ever growing locking tool zoo—not to mention flock(1)—that talk more than hatools:

My original design goal of hatools was to allow easy integration into shell scripts, thus every error is reported via the exitcode. I still believe in that goal because error handling should not parse error messages. UNIX commands should communicate errors to the caller in a way that allows easy handling in scripts.

However, I must admit that my implementation is a little bit harsh. Except of fatal errors, hatools don’t write anything to STDOUT or STDERR. Especially “designed” errors—such as “lock busy”—don’t cause a message to the user. Today—almost 9 years later—I wonder about the missing verbosity of hatools for two reasons:

Just because the exitcode contains all the information doesn’t mean that a message isn’t handy.
Steve Friedl’s lockrun.c has an option that causes a warning if the program takes longer than a specified timeout. He mentions that this is very handy in cron jobs, because cron e-mails that message.

I also believe that hatools have become a very powerful and a little complex in the last few years—most notably: multiple occurrences of -t, -k and -e in hatimerun. A verbose mode will make debugging much easier.

So, here comes the story why it took more than 1 hour do to it.

More Community

First of all, I moved the source code repository to GitHub—after listening to Tim Pritlove’s (german) podcast on “Verteilte Versionskontrollsysteme”.

The Implementation

Although halockrun was quickly done, hatimerun challenged me a little bit.

Finally hatimerun got two verbose modes:

-v

Will write a message if a timeout has passed by:

$ ./hatimerun -v -t 1 sleep 2
./hatimerun: process 9494 terminated on signal SIGKILL after 1s (sleep 2)

-vv

Writes a message on every timeout:

$ ./hatimerun -vv -t 1 -k hup -t 1 nohup sleep 3
nohup: appending output to `nohup.out'
./hatimerun: Timout #1 after 1s: sending signal SIGHUP to process group -9711 (nohup sleep 3)
./hatimerun: Timout #2 after 2s: sending signal SIGKILL to process group -9711 (nohup sleep 3)
./hatimerun: process 9711 terminated on signal SIGKILL after 2s (nohup sleep 3)

After years of silence, quite a lot of verbosity.

The “hard” part was to map the signal number to the signal name. I have already put a lot of effort in previous releases to make halockrun -k accept symbolic signal names—in a portable manner. That’s there since many years and seems to work quite well. So, it would be rather inappropriate to write numbers in the messages. The mapping took me quite a while and caused a lot of testing because I touched the “portability layer“ that has three different variants.

Special thanks go to Vallo Kallaste and the guys at 25th-floor for testing. After all, the release was tested on the following platforms:

Linux 2.6.22-15-server #1 SMP Wed Aug 20 19:08:24 UTC 2008 i686 GNU/Linux—with gcc and icc
FreeBSD 4.11-STABLE FreeBSD 4.11-STABLE #0: Thu Feb 12 08:04:00 GMT 2009
HP-UX B.11.11 U 9000/800 9000/800 1 HP-UX
SunOS 5.10 Generic_127111-02 sun4u sparc SUNW,UltraSPARC-IIi-cEngine
Darwin 8.11.1 Darwin Kernel Version 8.11.1: Wed Oct 10 18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386 i386 i386
aix 5300-09— with xlc (C for AIX version 5.0.2.0) and gcc.

Compatibility

Because the verbose mode was inspired by Steve Friedl’s lockrun, I checked again if hatools can do what lockrun.c does. Although halockrun provides a very flexible timeout mechanism, it doesn’t support the same feature as --max-time in Steve’s lockrun. The focus of hatimerun is to kill the process after a while, the --max-time switch in lockrun.c is just a error reporting feature. Well, I believe it is perfectly reasonable to have a warning if the program takes too long, but not kill it automatically.

halockrun can not be used for that purpose because it doesn’t fork() and can therefore not do anything after the child program has been started. hatimerun is the tool for timeouts in hatools. As it turned out, hatimerun could “not send a signal“ ever since the first release:

$ ./hatimerun -v -k 0 -t 1 sleep 2
./hatimerun: process 11957 terminated with status 0 after 2s (sleep 2)

The trick is to use “signal” zero; that is, not a real signal! However, -k 0 is rather awkward and most people are not aware of it’s meaning. So I introduced the symbolic name NONE for that purpose. This allows you to implement a warning level:

$ ./hatimerun -v -t 1:00 -k NONE -t 1:00 -k KILL sleep 130

This will wait for a minute (first -t 1:00), then do nothing (-k NONE) but write a warning in the end (-v). After another minute (second -t 1:00) kill the process (-k KILL).

Portability

Because I have already downloaded and tried Steve’s lockrun.c, I tried it together with halockrun. Bad enough, they don’t work together at all. That means, if a lock is occupied by lockrun, that doesn’t affect halockrun. The reason is that both tools use different advisory locking mechanisms. While halockrun uses POSIX fcntl(2), lockrun takes BSD flock(2) or POSIX lockf(3), depending on the platform. No surprise, the BSD flock() doesn’t care about POSIX locks. The Linux manpage is quite clear about that:

Since kernel 2.0, flock() is implemented as a system call in its own right rather than being emulated in the GNU C library as a call to fcntl(2). This yields true BSD semantics: there is no interaction between the types of lock placed by flock() and fcntl(2), and flock() does not detect deadlock.

However, POSIX isn’t much better, as it doesn’t define the interaction of fcntl() and lockf():

The interaction between fcntl() and lockf() locks is unspecified.

AFAIK, most systems implement lockf() in terms of fcntl(). Still there is no guarantee for that and the worst case is that a particular operating system has three different locking mechanisms. Special thanks to the “Portable Operating System Interface [for Unix]” that explicitly pushes two incompatible variants. I suppose there was a good reason for that decision, but I am not aware of it.

However, halockrun will continue to use fcntl() because it can be queried about the PID that currently holds the lock. halockrun -t hands this feature on to you.

Poor man’s fix is that I added a note about the incompatibility into the man-page.

An Advertisement: It’s All About Details

You might wonder why I write all of that? The point is that I aim to make hatools a piece of quality software. That takes quite a lot of time because quality is about details.

The advertisement is that I am an independent Software Quality Consultant for non-functional issues like performance, reliability, maintainability, scalability and so on. Let me know if I can help you.

▶ No Responses

Markus Winand's Blog

Software Quality is Quality of Life