Trapflag-Tracing III:
Using the x86 trap flag to write a very very slow strace clone

In this third post in the series about the programming experiment to use the x86 trap flag and trap signal handler to observe the execution of a program, we’ll actually try to observe something useful: we’ll print every system call we encounter, just like the popular Linux tool Strace.

In the first post of this series, I showed how it is possible to set the x86 trap flag that will cause an interrupt after every instruction, and setup a signal handler to catch that interrupt, in order to observe the execution of a program from within itself.

In the second post, I explored how it is possible to use LD_PRELOAD to inject this trap-tracing into an arbitrary, already compiled program.

In this third post I want to actually observe something about the execution. In particular, I’ll print every system call on screen, like the Linux debugging tool Strace. I was excited when I first ran it and I had (almost) the same output as strace.

If you don’t know about Strace, I encourage you to read this great zine about Strace by Julia Evans. Julia is how I found out about it, and her writing about is great!

Strace will basically print every system call a program makes during execution. System calls is basically how programs communicate “with the outside world”, so seeing all of them provides an interesting way to debug programs, even compiled ones.

Strace is built upon the Linux ptrace functionality, which allows building debuggers by setting up a debugging process that can control and observe a debuggee process (see the first post post in this series for a more in-depth explanation of ptrace and more references). Strace uses the special PTRACE_SYSCALL functionality to step from one system call to the next. The execution in-between system calls proceeds normally, explaining why Strace is pretty fast — the debugger is only invoked at system call sites.

The trap flag tracing scheme doesn’t allow that: it will step through every single instruction. If we want to know when system calls happen, we have to check whether the current instruction is a system call.

The signal handler that we use to capture the trap at every instruction provides the cpu state as one of its arguments. This is not portable, it depends on the cpu architecture — I implemented it for 32-bit x86.

Using the cpu state we can check the value of registers, including the instruction pointer, which points at the current instruction. Reading the memory at the instruction pointer, we can figure out whether the current instructions are either syscall or int 80h, two ways to make system calls.

#include <signal.h>
#include <stdint.h>
#include "syscallprinter.h"


void trapHandler(int signo, siginfo_t *info, void *context) {
    ucontext_t *con = (ucontext_t *)context;
    uint8_t* eip = (uint8_t*)con->uc_mcontext.gregs[REG_EIP];
    if ((   eip[0] == 0xcd && eip[1] == 0x80) // int 0x80
        || (eip[0] == 0x0f && eip[1] == 0x34) // sysenter
        )  {
        int eax = con->uc_mcontext.gregs[REG_EAX];
        int ebx = con->uc_mcontext.gregs[REG_EBX];
        int ecx = con->uc_mcontext.gregs[REG_ECX];
        int edx = con->uc_mcontext.gregs[REG_EDX];
        int esi = con->uc_mcontext.gregs[REG_ESI];
        int edi = con->uc_mcontext.gregs[REG_EDI];
        printSyscall(eax, ebx, ecx, edx, esi, edi);
    }
}

This signal handler will execute at every instruction and will check whether it’s int 0x80 or sysenter, which both result in a system call. The context provided to the signal handler allows reading the various argument registers that the signal handler may use.

The printSyscall is a long function that simply checks which syscall is being called. Every system call has an associated number, which is stored in the eax register. I used the system call table provided Greg Oses syscall-table project. The c function is generated by a python script that produces a giant switch on the system call code (which is in eax):

#include "helper.h"
#include <unistd.h>

// generated by makesyscallprinter.py, to print syscalls

void printSyscall(long eax, long ebx, long ecx,
                  long edx, long esi, long edi) {
    switch (eax) {
        case 0x0: { // 0
            //sys_restart_syscall( void)
            writeStr("sys_restart_syscall(");
            writeHex(ebx);
            writeStr(")\n");
            break;
        }
        case 0x1: { // 1
            //sys_exit(int error_code)
            writeStr("sys_exit(");
            writeInt(ebx);
            writeStr(")\n");
            break;
        }

        ...

        case 0x151: { // 337
            //sys_recvmmsg(...)
            writeStr("sys_recvmmsg(");
            writeInt(ebx);
            writeStr(", ");
            writeHex(ecx);
            writeStr(", ");
            writeInt(edx);
            writeStr(", ");
            writeInt(esi);
            writeStr(", ");
            writeHex(edi);
            writeStr(")\n");
            break;
        }
        default: {
           ...
        }
    }
}

One thing you may note is those funny writeStr, writeInt and writeHex functions. Why note just use a printf?

The problem is that the printSyscall function executes within a signal handler, which will get called at every instruction — even in the middle of other calls to printf. Calling printf while printf is executing may result in problems, if the function uses memory cache the string to be printed. In order to make sure a function can be called within a signal handler, it has to be “async-signal-safe”.

This is the term the POSIX standard uses, I’ve also seen the term “reentrant” and “reentrant with respect to signal handlers”. I guess “reentrant” means you can “enter” the function while it’s already executing.

Most clib functions are not guaranteed to be async-signal-safe. A list of functions that are guaranteed to be usable within signal handlers will be shown at the end of the man page on signal handlers (man 7 signal). It includes the function write, which is really just a wrapper around the write syscall. I used that to build the simple functions that write strings and ints to stdout (declared in helper.h).

Does this actually Work?

To check whether this works, lets try a simple hello world example:

#include <stdio.h>

int main() {
  printf("hello\n");
  printf("world!\n");
}

Strace gives us the following output for this program:

>> strace ../blog/traptrace2/hello
execve("../blog/traptrace2/hello", ["../blog/traptrace2/hello"], [/* 72 vars */]) = 0
brk(NULL)                               = 0x965b000
uname({sysname="linux", nodename="MINI", ...}) = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb779e000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=168960, ...}) = 0
mmap2(NULL, 168960, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7774000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320\207\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1786484, ...}) = 0
mmap2(NULL, 1792540, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb75be000
mprotect(0xb776d000, 4096, PROT_NONE)   = 0
mmap2(0xb776e000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1af000) = 0xb776e000
mmap2(0xb7771000, 10780, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7771000
close(3)                                = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77c2000
set_thread_area({entry_number:-1, base_addr:0xb77c2940, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0 (entry_number:6)
mprotect(0xb776e000, 8192, PROT_READ)   = 0
mprotect(0x8049000, 4096, PROT_READ)    = 0
mprotect(0xb77c3000, 4096, PROT_READ)   = 0
munmap(0xb7774000, 168960)              = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 5), ...}) = 0
brk(NULL)                               = 0x965b000
brk(0x967c000)                          = 0x967c000
write(1, "hello\n", 6hello
)                  = 6
write(1, "world!\n", 7world!
)                 = 7
exit_group(0)                           =

The trap flag strace gives the following output:

>> LD_PRELOAD=./override.so ../blog/traptrace2/hello
sys_fstat64(1, 0xbfdfa2b0)
sys_brk(0)
sys_brk(159363072)
sys_write(1, "hello\n", 0x00000006)
hello
sys_write(1, "world!\n", 0x00000007)
world!
sys_exit_group(0)

We see that Strace gives much more output. This makes sense, because it observes the program form the start (the first call is execve). The trap flag tracer only starts when the __libc_start_main gets called. After the fstat64(1, ..) call, the reported system calls are the same. Success!

One thing to note is that some of the code that the trap flag tracer doesn’t catch is the dynamic loading of library functions. This makes total sense, because the trap flag tracer itself has to be loaded as a dynamic library.

Let’s try a slightly more complicated example, let’s call cat on an empty file. Strace gives the following output:

>> strace cat empty.txt
execve("/bin/cat", ["cat", "empty.txt"], [/* 72 vars */]) = 0
brk(NULL)                               = 0x87dc000
uname({sysname="linux", nodename="MINI", ...}) = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77d4000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=168960, ...}) = 0
mmap2(NULL, 168960, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb77aa000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320\207\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1786484, ...}) = 0
mmap2(NULL, 1792540, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb75f4000
mprotect(0xb77a3000, 4096, PROT_NONE)   = 0
mmap2(0xb77a4000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1af000) = 0xb77a4000
mmap2(0xb77a7000, 10780, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb77a7000
close(3)                                = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77f8000
set_thread_area({entry_number:-1, base_addr:0xb77f8940, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0 (entry_number:6)
mprotect(0xb77a4000, 8192, PROT_READ)   = 0
mprotect(0x8054000, 4096, PROT_READ)    = 0
mprotect(0xb77f9000, 4096, PROT_READ)   = 0
munmap(0xb77aa000, 168960)              = 0
brk(NULL)                               = 0x87dc000
brk(0x87fd000)                          = 0x87fd000
open("/usr/lib/locale/locale-archive", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=4570528, ...}) = 0
mmap2(NULL, 2097152, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb73f4000
mmap2(NULL, 323584, PROT_READ, MAP_PRIVATE, 3, 0x2cc000) = 0xb73a5000
mmap2(NULL, 4096, PROT_READ, MAP_PRIVATE, 3, 0x458000) = 0xb77d3000
close(3)                                = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 5), ...}) = 0
open("empty.txt", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0664, st_size=1, ...}) = 0
fadvise64_64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap2(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77b1000
read(3, "\n", 131072)                   = 1
write(1, "\n", 1
)                       = 1
read(3, "", 131072)                     = 0
munmap(0xb77b1000, 139264)              = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

Whereas the trap flag tracer gives the following output:

>> LD_PRELOAD=./override.so cat empty.txt
sys_brk(0)
sys_brk(159760384)
sys_open("/usr/lib/locale/locale-archive", 557056, -1217482752)
sys_fstat64(3, 0xb76ec040)
sys_mmap_pgoff(0, 2097152, 1, 2, 3, ...)
sys_mmap_pgoff(0, 323584, 1, 2, 3, ...)
sys_mmap_pgoff(0, 4096, 1, 2, 3, ...)
sys_close(3)
sys_fstat64(1, 0xbff4735c)
sys_open("empty.txt", 32768, 0)
sys_fstat64(3, 0xbff4735c)
sys_fadvise64_64(3, 0x00000000, 0x00000000, 0)
sys_mmap_pgoff(0, 139264, 3, 34, -1, ...)
sys_read(3, "", 0x00020000)
sys_write(1, "\n", 0x00000001)

sys_read(3, "\n", 0x00020000)
sys_munmap(-1217437696, 0x00022000)
sys_close(3)
sys_close(1)
sys_close(2)
sys_exit_group(0)

Again the output matches after a certain point. This shows that the trap flag tracing idea appears to work, it allows us to observe (almost) the complete execution of a program.

And the source code of the trap flag Strace (besides the syscall printer) is only a few tens of lines of code!

Thus I have an Strace-like tool/experiment, built without the help of ptrace, which runs at about 1/10,000th the speed of Strace!

See the source code at the current commit on github.