Taint Analysis with DynamoRIO, Part 2

Posted on October 13, 2017 with tags dynamorio, taint analysis

In the previous post, we introduced taint analysis and various frameworks which incorporate taint analysis into their computation. This post is the first in the series to discuss implementation details of taint analysis, on top of a binary instrumentation framework.

In particular, we will be discussing possible implementations for shadow memory, as well as Dr. Memory’s performant solution to this problem. The existing solution is wrapped up in a library called Umbra, which I port to ARM in this post. My progress is currently detailed here. All code examples will be taken from here as well.

Shadow Memory

Implementing taint analysis in DynamoRIO obviates the necessity of a shadow memory system. For example, we may look to the umbra_app test in the repository where we see the following assembly:

ldr    r4, [r7]
b      #0x000106e8

Here, we’re loading a single byte from [r7] and saving it into r4. When using DynamoRIO, we can only perform stateful computation by inserting instrumentation (i.e. assembly) directly into the code stream. So, in order to compute the taint propagation step, we must first pull the taint value of the memory stored at [r7] in order to propagate the taint to the register r4.

Now, in order to compute the taint value of the memory stored in [r7], we must first define some mapping from an application address to an associated shadow address; this shadow address serves as a mirrored space for all of the memory used by the application. For umbra on ARM, we now emit the following code in order to determine the shadow address given the application address:

 m4  e58a4084   str    r4, [r10, #132]
 m4  e10f4000   mrs    r4, cpsr
 m4  e58a4080   str    r4, [r10, #128]
 m4  e58a0088   str    r0, [r10, #136]
 m4  e1a04007   mov    r4, r7
 m4  e58a4008   str    r4, [r10, #8]
 m4  e30b016c   movw   r0, #45420
 m4  e3470801   movt   r0, #30721
 m4  e0800724   add    r0, r0, r4, lsr #14
 m4  e3c00003   bic    r0, r0, #3
 m4  e5900000   ldr    r0, [r0]
 m4  e1a04124   lsr    r4, r4, #2
 m4  e0844000   add    r4, r4, r0,  #0
 m4  e5c40000   strb   r0, [r4]
 L3  e5974000   ldr    r4, [r7]
 L3  eaffffff   b      #0x000106e8

For those unfamiliar with DynamoRIO, the m4 instructions are meta instructions inserted by our DynamoRIO client, while L3 instructions represent the instructions as seen in the original program. These L3 instructions should be untouched, effectively. We’ll pull this emitted assembly apart in the next section, to understand the underlying data structures as well as their function.

Page Table Implementation

The implementation detailed here is in fact the same one used by Dr. Memory on x86. This logic is currently wrapped up in the Umbra project in the Dr. Memory repository, which I have modified to also support ARM.

At the heart of the shadow memory mapping scheme lies a page table stored in static memory:

/* drmemory/umbra/umbra_32.c */
/* We divide the address space into (64KB) chunk units */
#define APP_BLOCK_BITS 16
#define SHADOW_TABLE_ENTRIES (1 << (32 - APP_BLOCK_BITS))
static ptr_int_t static_shadow_table[SHADOW_TABLE_ENTRIES];

This table in fact stores a displacement from the application address for performance reasons, at least on x86. In order to use this data structure, we recall that it functions much like a page table; we can compute the index into the page table by simply interpreting the top 16 bits of the address as the index. We then dereference that index into the static shadow table, to compute the displacement. This was implemented previously in x86 with very naive assembly, shown below:

mov ecx, eax
shr ecx, 0x10
add eax, [shadow_table + ecx*4]

Here, %eax initially holds some application address, which is being translated to a shadow address. Furthermore, %ecx here functions as an index into the static shadow table (where each element is 4 bytes). Unfortunately, this code sequence is not so nice on ARM, because we cannot simply express the operand (table,%ecx,4) which performs the operation table + %ecx*4. We instead do the following on ARM:

mov r1, shadow_table
add r1, r1, r0 lsr #14
bic r1, r1, #3
ldr r1, [r1]
add r0, r0, r1

Here, r0 holds the application address. Note a few key difference in this code sequence. In order to keep from spilling a register, we modify the original expression like so:

\begin{align*} table + 4 * (addr \gg 16) &\equiv table + ((addr \gg 16) \ll 2) \\ &\equiv table + ((addr \gg 14) \mathrel{\&} \text{0xfffffffc}) \\ &\equiv (table + (addr \gg 14)) \mathrel{\&} \text{0xfffffffc} \end{align*}

The final step assumes that table is 4-byte aligned, or already has the least significant 2 bits cleared, so that we can move the bitwise and out of the inner expression. In practice, this is true because of the C ABI and alignment rules for the ARM platform. Although not immediately apparent, the expression

\[ table + (addr \gg 14) \]

has a direct equivalent in ARM, namely add r1, r1, r0 lsr #14, allowing us to elide a register spill or restore and simplify the emitted code itself.

Another interesting caveat here is that ARM cannot load the immediate 0xfffffffc in order to perform the and operation; we instead opt to use bic, documented here which performs something like: op1 & ~op2. In other words, the previous expression becomes:

\[ (table + (addr \gg 14)) \mathrel{\&} \text{~} 3 \]

We can now load all required constants (0x3, etc) as immediates in both ARM and in THUMB mode, and this expression mimics exactly the shadow address translation routine shown above.

Memory Usage Optimization

In a typical application, we do not saturate the entire address space. So, it does not make sense, nor is it feasible to initialize all the shadow memory up front. To alleviate this problem, we actually allocate only a single read-only block at initialization time, with a default value set. This is the block that is referenced by all entries in the page table initially, so our memory usage initially is only the cost of the page table, plus that single shared block.

Now here’s the kicker: if someone attempts to write to this default shared block, we handle the page fault gracefully in a signal handler. If we notice that the faulted access is within this special read-only shared block, we can replace out the relevant page table entry with a new RW page, and re-execute the faulting instruction.

Relevant diagram for an 8-bit architecture but which functions equivalently. Here, an 8-bit application address 11010110 has top 4 bits 1101, which maps to the last entry in the page table. When dereferenced, we catch the fault and replace it with a new RW page, then re-execute the instruction.

Relevant diagram for an 8-bit architecture but which functions equivalently. Here, an 8-bit application address 11010110 has top 4 bits 1101, which maps to the last entry in the page table. When dereferenced, we catch the fault and replace it with a new RW page, then re-execute the instruction.

Note that this algorthm coincides generally with the concept of copy-on-write (COW), used in a variety of applications as the Linux kernel and the C++ std::string implementation for performance reasons.

In this manner, we end up only allocating shadow blocks for the blocks which are written to at runtime, which will likely include the stack and the heap, but probably not .text or libc, thus reducing shadow memory usage for an application in general. This is implemented in Dr. Memory proper here. A far shorter example implemented in a test in the pull request can be found here.