My scope for this release was rather limited: implement
-v switches to
halockrun to make them more communicative.
The reason for that is quite simple; hatools will be easier to learn and use if they talk more. I also noticed that there are two “new” tools in the ever growing locking tool zoo—not to mention flock(1)—that talk more than hatools:
My original design goal of hatools was to allow easy integration into shell scripts, thus every error is reported via the exitcode. I still believe in that goal because error handling should not parse error messages. UNIX commands should communicate errors to the caller in a way that allows easy handling in scripts.
However, I must admit that my implementation is a little bit harsh. Except of fatal errors, hatools don’t write anything to STDOUT or STDERR. Especially “designed” errors—such as “lock busy”—don’t cause a message to the user. Today—almost 9 years later—I wonder about the missing verbosity of hatools for two reasons:
- Just because the exitcode contains all the information doesn’t mean that a message isn’t handy.
- Steve Friedl’s
lockrun.chas an option that causes a warning if the program takes longer than a specified timeout. He mentions that this is very handy in
cronjobs, because cron e-mails that message.
I also believe that hatools have become a very powerful and a little complex in the last few years—most notably: multiple occurrences of
hatimerun. A verbose mode will make debugging much easier.
So, here comes the story why it took more than 1 hour do to it.
First of all, I moved the source code repository to GitHub—after listening to Tim Pritlove’s (german) podcast on “Verteilte Versionskontrollsysteme”.
halockrun was quickly done,
hatimerun challenged me a little bit.
hatimerun got two verbose modes:
Will write a message if a timeout has passed by:
$ ./hatimerun -v -t 1 sleep 2 ./hatimerun: process 9494 terminated on signal SIGKILL after 1s (sleep 2)
Writes a message on every timeout:
$ ./hatimerun -vv -t 1 -k hup -t 1 nohup sleep 3 nohup: appending output to `nohup.out' ./hatimerun: Timout #1 after 1s: sending signal SIGHUP to process group -9711 (nohup sleep 3) ./hatimerun: Timout #2 after 2s: sending signal SIGKILL to process group -9711 (nohup sleep 3) ./hatimerun: process 9711 terminated on signal SIGKILL after 2s (nohup sleep 3)
After years of silence, quite a lot of verbosity.
The “hard” part was to map the signal number to the signal name. I have already put a lot of effort in previous releases to make
halockrun -k accept symbolic signal names—in a portable manner. That’s there since many years and seems to work quite well. So, it would be rather inappropriate to write numbers in the messages. The mapping took me quite a while and caused a lot of testing because I touched the “portability layer“ that has three different variants.
Special thanks go to Vallo Kallaste and the guys at 25th-floor for testing. After all, the release was tested on the following platforms:
- Linux 2.6.22-15-server #1 SMP Wed Aug 20 19:08:24 UTC 2008 i686 GNU/Linux—with gcc and icc
- FreeBSD 4.11-STABLE FreeBSD 4.11-STABLE #0: Thu Feb 12 08:04:00 GMT 2009
- HP-UX B.11.11 U 9000/800 9000/800 1 HP-UX
- SunOS 5.10 Generic_127111-02 sun4u sparc SUNW,UltraSPARC-IIi-cEngine
- Darwin 8.11.1 Darwin Kernel Version 8.11.1: Wed Oct 10 18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386 i386 i386
- aix 5300-09— with xlc (C for AIX version 18.104.22.168) and gcc.
Because the verbose mode was inspired by Steve Friedl’s
lockrun, I checked again if hatools can do what
lockrun.c does. Although
halockrun provides a very flexible timeout mechanism, it doesn’t support the same feature as
--max-time in Steve’s
lockrun. The focus of
hatimerun is to kill the process after a while, the
--max-time switch in
lockrun.c is just a error reporting feature. Well, I believe it is perfectly reasonable to have a warning if the program takes too long, but not kill it automatically.
halockrun can not be used for that purpose because it doesn’t
fork() and can therefore not do anything after the child program has been started.
hatimerun is the tool for timeouts in hatools. As it turned out,
hatimerun could “not send a signal“ ever since the first release:
$ ./hatimerun -v -k 0 -t 1 sleep 2 ./hatimerun: process 11957 terminated with status 0 after 2s (sleep 2)
The trick is to use “signal” zero; that is, not a real signal! However,
-k 0 is rather awkward and most people are not aware of it’s meaning. So I introduced the symbolic name
NONE for that purpose. This allows you to implement a warning level:
$ ./hatimerun -v -t 1:00 -k NONE -t 1:00 -k KILL sleep 130
This will wait for a minute (first
-t 1:00), then do nothing (
-k NONE) but write a warning in the end (
-v). After another minute (second
-t 1:00) kill the process (
Because I have already downloaded and tried Steve’s
lockrun.c, I tried it together with
halockrun. Bad enough, they don’t work together at all. That means, if a lock is occupied by
lockrun, that doesn’t affect
halockrun. The reason is that both tools use different advisory locking mechanisms. While
halockrun uses POSIX
lockrun takes BSD flock(2) or POSIX lockf(3), depending on the platform. No surprise, the BSD
flock() doesn’t care about POSIX locks. The Linux manpage is quite clear about that:
Since kernel 2.0,
flock()is implemented as a system call in its own right rather than being emulated in the GNU C library as a call to
fcntl(2). This yields true BSD semantics: there is no interaction between the types of lock placed by
flock()does not detect deadlock.
However, POSIX isn’t much better, as it doesn’t define the interaction of
The interaction between
lockf()locks is unspecified.
AFAIK, most systems implement
lockf() in terms of
fcntl(). Still there is no guarantee for that and the worst case is that a particular operating system has three different locking mechanisms. Special thanks to the “Portable Operating System Interface [for Unix]” that explicitly pushes two incompatible variants. I suppose there was a good reason for that decision, but I am not aware of it.
halockrun will continue to use
fcntl() because it can be queried about the PID that currently holds the lock.
halockrun -t hands this feature on to you.
Poor man’s fix is that I added a note about the incompatibility into the man-page.
An Advertisement: It’s All About Details
You might wonder why I write all of that? The point is that I aim to make
hatools a piece of quality software. That takes quite a lot of time because quality is about details.
The advertisement is that I am an independent Software Quality Consultant for non-functional issues like performance, reliability, maintainability, scalability and so on. Let me know if I can help you.