Using truss and nawk to Trace Solaris Processes
Here’s the problem I ran into today on a Solaris 11 SPARC server: an application UI would occasionally freeze for a few second when executing a particular function. Nobody could quite figure out where the delay was coming from. There were just too many variables – application, JVM, network, remote database – and we needed to narrow down our choices.
Running truss on the application UI instance told me it was waiting – “waitid(P_PID …” – for a child process to complete. Unfortunately, the delay, however annoying, was not long enough for me to see which process had that PID: it would just disappear too quickly. The solution was to use “tail -f” on the truss output with nawk’s “system” call functionality.
Here’s a quick example:
Start the process of interest behind truss:
truss -o /tmp/truss.out start_application
Put a tail on the truss output file and pipe it through nawk, looking for “waitid” (or whatever else that interests you). Use nawk’s system call to run ps and grep for the mystery PID:
tail -f /tmp/truss.out | nawk -F',' '/^waitid/ {system("ps -ef | grep "$2" | grep -v grep")}'
Alternatively, you can run another truss on the child PID:
tail -f /tmp/truss.out | nawk -F',' '/^waitid/ {system("truss -p "$2)}'
Of course, on Solaris 10+ you can also use dtrace, but that would make too much sense. Good hunting!