Profiling and Tracing Processes in Linux

This article shows how a Linux process can be traced and profiled. Using the “last” command as an example, profiling is used to explain why a process was very slow, and why another, very similar, process (dump-utmp) was much faster. “Tracing” here means seeing what a process is doing at any moment. “Profiling” means showing (afterwards) how long it spent doing different things.

Last” is a Linux command that reads and summarizes the utmp file, where login records are stored. I had a “last” command taking hours to complete because the utmp file had grown large (1.9 Gb). Used strace to see what it was doing.

Pid was 14974:

[root@pluto log]# strace -p 14974
Process 14974 attached - interrupt to quit
_llseek(3, 1975549952, [1975549952], SEEK_SET) = 0
read(3, ""..., 16384) = 16384
_llseek(3, 1975533568, [1975533568], SEEK_SET) = 0
read(3, ""..., 16384) = 16384
write(1, "00:00)    nskapasi  vsftpd:8313 "..., 4096) = 4096
_llseek(3, 1975517184, [1975517184], SEEK_SET) = 0

Last processes the wtmp file in reverse, which is why it is calling llseek() all the time to move backwards in the file (by 16 k chunks). However this does not explain the slowness. The “last” process was maxing out one CPU, ie. it was CPU bound. Why so ?

Using strace with the -c flag to profile it:

# strace -c -p 14974

Process 14974 attached - interrupt to quit
Process 14974 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 46.85    0.000699          19        36           read
 45.17    0.000674          42        16           write
  7.10    0.000106           3        36           _llseek
  0.87    0.000013           7         2           brk
------ ----------- ----------- --------- --------- ----------------
100.00    0.001492                    90           total

The process is spending only about 7% of its time in llseek. Whatever is thrashing the CPU, it isn’t llseek.

There is another profiling command called ltrace. It is like strace but traces library calls rather than system calls. Used it thus. NB the PID changed because I restarted “last“.

# ltrace -p 15459
strncmp("vsftpd:24274", "vsftpd:10592", 32)           = 1
strncmp("vsftpd:24281", "vsftpd:10592", 32)           = 1
strncmp("vsftpd:24288", "vsftpd:10592", 32)           = 1
strncmp("vsftpd:24295", "vsftpd:10592", 32)           = 1
strncmp("vsftpd:24302", "vsftpd:10592", 32)           = 1
strncmp("vsftpd:24309", "vsftpd:10592", 32)           = 1

…and so on. The “last” process sure is making a lot of calls to strncmp().

ltrace offers profiling too, again with addition of a -c flag:

# ltrace -c -p 15459
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 99.92    5.655910          81     69222 strncmp
  0.06    0.003652          23       154 _IO_putc
  0.01    0.000288          24        12 memcpy
  0.00    0.000163          32         5 ctime
  0.00    0.000126          25         5 free
  0.00    0.000099          24         4 sprintf
  0.00    0.000098          24         4 malloc
  0.00    0.000096          24         4 strncat
  0.00    0.000052          26         2 snprintf
  0.00    0.000050          25         2 strcpy
  0.00    0.000026          26         1 printf
------ ----------- ----------- --------- --------------------
100.00    5.660560                 69415 total

The “last” process is spending nearly all of its time (99.92%) in strncmp(), a compute-intensive C library function for comparing 2 strings. Which explains why the last process was hogging a whole CPU and executing so slowly.

The wmtp file had grown so large because the system (an FTP server) was receiving up to 300 FTP connections per minute and logging them all.

Regarding wtmp, there exists another command called dump-utmp, which can process the whole 1.9 Gb utmp file in 2.5 minutes (“last” would take at least 3 hours, possibly many more, if left to complete. The full command was:

# dump-utmp ./wtmp > test.out

Using strace and ltrace on dump-utmp shows that the command processes the wtmp file forwards and so does not need to call lseek() all the time like “last“. Also it just seems to read and write the data, not bothering with comparing strings or other processing, and produces a much more “raw” report than the output of “last“.

So Linux contains some pretty cool tracing tools. The above was on Red Hat. A quick shufty indicates ltrace and strace are readily available in Fedora 16 and Debian Squeeze, for example.

Footnote

Truss offers strace-like functionality in Solaris. Use truss -p for tracing system calls, and truss -c -p for profiling. The options for tracing library calls are a little more complicated – see this example.

One thought on “Profiling and Tracing Processes in Linux

  1. Pingback: Linux Capabilities and Ping | Unix etc.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.