bryan garza
These Magic Code Sequences Will Rot Your Brain 2023-02-28

Table of contents

  1. Introduction
  2. Diving into the C
  3. Breaking down x86-64 inline assembly
  4. Rustaceans, assemble!
  5. Struggles and takeaways
  6. What's next?

Introduction

In this post, I walk you through how Valgrind actually works under the hood, the strange low-level hacks I uncovered along the way, as well as my experience porting some of that code into Rust.

Archeologist hat, brush, and trowel, in a dinosaur dig site

Recently, Felix and I have been exploring the idea of a new tool1 to detect undefined behavior in Rust code. This new tool is meant to co-exist alongside Miri, in case where Miri is not possible to use, such as when linking to arbitrary C libraries.

Note: This entire project is still in the exploration phase; it's possible we might discard this approach and try something different depending on how it performs, or on feedback from others working on the compiler and Rust's operational semantics.

The first step in our journey was to come up with a basic architecture and proof-of-concept, and we settled on a solution involving Valgrind. We would write a new Valgrind tool to track certain information about a Rust program as it's running, such as what data is being borrowed, what data is being dereferenced, etc. In case you're not familiar with Valgrind, it's a framework for dynamic analysis tools; it provides:

  1. A suite of seven tools to check for memory errors (memcheck), do cache profiling (cachegrind), detect data races (helgrind), etc
  2. A C library with the necessary functions to create new tools

Diving into the C

Felix created a C demo program which is able to talk to a custom Valgrind tool (krabcake). From there, I created an equivalent Rust demo program, which involved having to make sense of Valgrind's source and translate some of that as well (which is when things get hairy). For our demo, we decided to keep in mind Stacked Borrows, an operational semantics that is integrated into the Rust compiler, and see if we could expose to our Valgrind tool the sort of information that SB might need. In Felix's demo, we have this code:

char val = (char)1;
char *x = KC_BORROW_MUT(char, val); // x = &mut val;

The idea is to model what it's like to get an exclusive reference in Rust; the KC_BORROW_MUT macro takes the value, and returns a reference to it. When the binary is run under Valgrind using our new tool and the second line is executed, the tool receives a "client request". Otherwise, when the binary is run standalone, the Valgrind-related logic is a no-op.

Here's the output originating from our Valgrind tool:

kc_handle_client_request, handle BORROW_MUT 1ffeffffee 91 92 93 94

The KC_BORROW_MUT macros is a utility that Felix created to call into the Valgrind-provided macros. Let's take a look at the definition:

#include "valgrind.h"

// encodes &mut PLACE, supplying pointee type first to placate cc
#define KC_BORROW_MUT(_type, _place)                              \
    ((_type*) VALGRIND_DO_CLIENT_REQUEST_EXPR(                    \
    &(_place), VG_USERREQ__BORROW_MUT, &(_place),                 \
        0x91, 0x92, 0x93, 0x94))

VG_USERREQ__BORROW_MUT is just a numeric constant (equal to 0x1ffeffffee), so I've omitted it for brevity. Let's try to understand what is happening here. This macro accepts a type and value and calls the VALGRIND_DO_CLIENT_REQUEST_EXPR macro, passing in:

The remaining parameters won't be used for our demo, but I changed them to unique numbers for debugging purposes. We also cast the result to a pointer of the type that was passed in.

Breaking down x86-64 inline assembly

Now let's take a look at the definition of the VALGRIND_DO_CLIENT_REQUEST_EXPR macro. This is where things get really interesting.

Note: I'm focusing on x86-64, aka AMD64.

/* The following defines the magic code sequences which the JITter
   spots and handles magically.  Don't look too closely at them as
   they will rot your brain. */

#define __SPECIAL_INSTRUCTION_PREAMBLE                            \
                     "rolq $3,  %%rdi ; rolq $13, %%rdi\n\t"      \
                     "rolq $61, %%rdi ; rolq $51, %%rdi\n\t"

#define VALGRIND_DO_CLIENT_REQUEST_EXPR(                          \
        _zzq_default, _zzq_request,                               \
        _zzq_arg1, _zzq_arg2, _zzq_arg3, _zzq_arg4, _zzq_arg5)    \
    __extension__                                                 \
    ({ volatile unsigned long int _zzq_args[6];                   \
    volatile unsigned long int _zzq_result;                       \
    _zzq_args[0] = (unsigned long int)(_zzq_request);             \
    _zzq_args[1] = (unsigned long int)(_zzq_arg1);                \
    _zzq_args[2] = (unsigned long int)(_zzq_arg2);                \
    _zzq_args[3] = (unsigned long int)(_zzq_arg3);                \
    _zzq_args[4] = (unsigned long int)(_zzq_arg4);                \
    _zzq_args[5] = (unsigned long int)(_zzq_arg5);                \
    __asm__ volatile(__SPECIAL_INSTRUCTION_PREAMBLE               \
                     /* %RDX = client_request ( %RAX ) */         \
                     "xchgq %%rbx,%%rbx"                          \
                     : "=d" (_zzq_result)                         \
                     : "a" (&_zzq_args[0]), "0" (_zzq_default)    \
                     : "cc", "memory"                             \
                    );                                            \
    _zzq_result;                                                  \
    })

The first thing to note is the comment 😂 which insipired the title of this post.

The VALGRIND_DO_CLIENT_REQUEST_EXPR macro creates an array with 6 slots, and a variable to hold the result to return at the end. We then take all the parameters that are passed into the macro, and stuff them into the array. Lastly, we run some inline assembly.

The first instructions that execute are the ones in the previous macro, __SPECIAL_INSTRUCTION_PREAMBLE.

rolq $3,  %rdi
rolq $13, %rdi
rolq $61, %rdi
rolq $51, %rdi

This is using AT&T syntax, and the easiest way to tell2 is because constants are prefixed with $, and registers with %.

The rolq instructions means rotate left. So, these 4 instructions collectively rotate the rdi register 128 bits, to the left, meaning the entire number is fully rotated twice (since we are dealing with a 64-bit architecture). The only purpose of these instructions is to signal to Valgrind that we are about to include a client request in the next assembly instruction.

And... this would be a good time to share a diagram. What happens when you run a binary under Valgrind? Valgrind has its own JIT compiler named VEX, which the binary is run under. Each assembly instruction is converted to VEX IR before being executed. This is so differences between all architectures supported by Valgrind can be abstracted away.

        sequenceDiagram
    participant jit as VEX JITter
    participant tool as Custom Valgrind Tool
    participant term as Terminal
    loop each assembly instruction
        jit->>jit: dissasemble into VEX IR
        alt is magic code sequence for VALGRIND_DO_CLIENT_REQUEST
            jit->>tool: handle client request
            tool->>tool: update internal state
            opt
                tool->>term: Print to stdout
            end
        else is regular code sequence
            jit->>jit: execute IR
        end
    end
    

Each time the __SPECIAL_INSTRUCTION_PREAMBLE is encountered by the JITter, it has to check the following instruction, because there are 3 different actions that can be taken, one of which is to make the Client Request. For the Client Request, the instruction is

xchgq %rbx, %rbx

This exchanges the contents of the operands, and normally, two different registers would be listed, but this way, it's just a no-op when running not under Valgrind.

After the assembly come the operands, first the outputs, then the inputs. This is how you conect the variables in C the registers in assembly.

So, before the inline assembly is executed, we'll have this

&_zzq_args[0] -> rax
_zzq_default -> rdx

...and after the inline assembly is executed:

rdx -> _zzq_result

You can see how this would allow the code to work when not run under Valgrind; all the assembly instructions would be no-ops, and the default would simple go into and out of rdx.

For the last line, we have two "clobbers"3:

Now that we understand everything that happens in the inline assembly, we can take a look at some of the code in the demo Valgrind tool to see the whole thing come together:

static Bool kc_handle_client_request( ThreadId tid, UWord* arg, UWord* ret )
{
    Int   i;
    Addr  bad_addr;
    Bool  handled = False;

    if (!VG_IS_TOOL_USERREQ('K','C',arg[0]))
       return False;

    switch(arg[0]) {
    case VG_USERREQ__BORROW_MUT: {
       VG_(dmsg)("kc_handle_client_request, handle BORROW_MUT %llx %llx %llx %llx %llx\n",
                 arg[1], arg[2], arg[3], arg[4] arg[5]);
       *ret = arg[1];
       handled = True;
    }

    ...

This is the function that is called when the JITter comes upon a Client Request. The relevant parameters are:

After checking that we indeed are receiving a request code for something that falls under the purview of this specific Valgrind tool, we switch on that request code to see which of the various types of requests we want to make. For our demo purposes, Felix just made the code print out the parameters and set the first item in the array as the return value (which if you look all the way back at the KC_BORROW_MUT macro, is the address of the _place variable).

Rustaceans, assemble!

At this point, we have a good grasp on both the C and the inline assembly, and it's time for some Rust! First, translating the KC_BORROW_MUT macro is straightforward.

#[macro_export]
macro_rules! kc_borrow_mut {
    ( $data:expr ) => {
        {
            let place = ptr::addr_of_mut!($data);
            valgrind_do_client_request_expr!(place, VgKrabcakeClientRequest::BorrowMut, place, 0x91, 0x92, 0x93, 0x94)
        }
    }
}

Why did I make this a macro and not a regular function? Because I want to directly get the address of the variable; I don't want the address of the function argument, if it were to be passed into a function.

Next, we handle the VALGRIND_DO_CLIENT_REQUEST_EXPR macro:

[repr(C)]
struct Data<T> {
    request_code: u64,
    arg1: *mut T,
    arg2: u64,
    arg3: u64,
    arg4: u64,
    arg5: u64,
}

#[macro_export]
macro_rules! valgrind_do_client_request_expr {
    ( $zzq_default:expr, $request_code:expr,
    $arg1:expr, $arg2:expr, $arg3:expr, $arg4:expr, $arg5:expr ) => {
        {
            let zzq_args = Data {
                request_code: $request_code as u64,
                arg1: $arg1,
                arg2: $arg2,
                arg3: $arg3,
                arg4: $arg4,
                arg5: $arg5,
            };
            let mut zzq_result = $zzq_default;

            unsafe {
                asm!(
                    "rol rdi, 3",
                    "rol rdi, 13",
                    "rol rdi, 61",
                    "rol rdi, 51",
                    "xchg rbx, rbx",
                    inout("di") zzq_result,
                    in("ax") &zzq_args,
                );
            }
            zzq_result
        }
    }
}

I created the Data struct so that I could set arg1 to be a raw pointer instead having to have the entire array be u64, like it was in the C code.

Now, let's talk through the Rust inline assembly. The five instructions should be familiar, though they look a bit different.

rol rdi, 3
rol rdi, 13
rol rdi, 61
rol rdi, 51
xchg rbx, rbx

Here I opted to use Intel syntax, which these days seems to be the syntax of choice for many, especially for its improved readability. In fact, I came across this HN comment, regarding the choice of having Intel syntax be default for inline asembly in Rust:

Yes, thank you for the Intel syntax. It is by far the most frequent syntax you'll find anywhere somewhat recent (1990+) and it is by far the most straightforward to learn by being clean. I did learn both in the 90s, but I've always struggled with AT&T symbols and different offerings which to me was hard to switch to when learning from different material sources.

In Intel syntax, the operands are flipped. Also, you don't have to escape the constants and registers.

Next, we come to the operands section. According to The Rust Reference, the inout syntax looks like

inout(<reg>) <in expr> => <out expr>

You should be able to include an expression that is put into register reg at the start, with the contents of the register then being put into out expr at the end. Unfortunately, I wasn't able to get this to work (probably PEBCAK).

inout("di") zzq_default => zzq_result just did not seem to populate the register with zzq_default at the start, so what I ended up doing (which I think might actually be cleaner) was to set zzq_result to have the default value from the beginning, that way if the binary is run standalone, we already have zzq_result set with the correct value to return.

The "in" operand was pretty straightforward, looking very similar to the C inline assembly.

Here's the Rust code where we use the macro:

let mut val: u8 = 1;
let x = kc_borrow_mut!(val); // x = &mut val;

Struggles and takeaways

All in all, what took me the longest, was figuring out what "0" (_zzq_default) meant in the C inline assembly. I also struggled with the question of why rdx and rax are used, as opposed to some other registers. As far I understand, this is chosen mostly because of conventions, but there is nothing stopping you from using other ones.

The biggest difference between C inline assembly and Rust inline assembly was understanding the operands section. Also, the cc clobber didn't seem to be needed, as according to The Rust Reference, if the flags register is not modified, then you have to explicitly set the preserves_flags option. Lastly, I didn't quite figure out how to express the equivalent of the memory clobber. The Rust Reference does mention the clobber keyword in multiple spots, but it seems that you have to specify which registers are being clobbered, contrary to what memory expresses. I hope this doesn't come back to bite us later, and in fact, I probably have to get to the bottom of memory once we proceed further with our project.

What's next?

If we can communicate with Valgrind by using explicit macros in the code, we should be able to integrate this approach into the compiler, and insert the assembly during compilation (based on a compiler flag), unlocking the possibility for exciting new dynamic analysis tools for Rust that are tightly integrated with the rustc itself.

There's more work that would need to be done to support this, mostly in the MIR layer of the compiler, but given that our Rust demo is a success, we're off to a good start. As Felix and I continue our exploration of this dynamic analysis space, there'll be more interesting topics to uncover, which I'll document on this blog :)

If you made it this far, you are probably itching to see the full code:

Thanks to Nicholas Nethercote for his help in deciphering some of the Valgrind code and pointing me in the right direction, Felix for writing the initial C code which formed the basis for this entire journey, as well as Jack and Dillon for reading drafts of this post.

1

See Niko's mention of dynamic verification

2

The faker's guide to reading (x86) assembly language served as a good refresher on assembly

3

See the GNU Extended Asm docs for a full explanation of clobbers