Taint Analysis with DynamoRIO, Part 2
In the previous post, we introduced taint analysis and various frameworks which incorporate taint analysis into their computation. This post is the first in the series to discuss implementation details of taint analysis, on top of a binary instrumentation framework.
In particular, we will be discussing possible implementations for shadow memory, as well as Dr. Memory’s performant solution to this problem. The existing solution is wrapped up in a library called Umbra, which I port to ARM in this post. My progress is currently detailed here. All code examples will be taken from here as well.
Shadow Memory
Implementing taint analysis in DynamoRIO obviates the necessity of a shadow memory system. For example, we may look to the umbra_app
test in the repository where we see the following assembly:
ldr r4, [r7]
b #0x000106e8
Here, we’re loading a single byte from [r7]
and saving it into r4
. When using DynamoRIO, we can only perform stateful computation by inserting instrumentation (i.e. assembly) directly into the code stream. So, in order to compute the taint propagation step, we must first pull the taint value of the memory stored at [r7]
in order to propagate the taint to the register r4
.
Now, in order to compute the taint value of the memory stored in [r7]
, we must first define some mapping from an application address to an associated shadow address; this shadow address serves as a mirrored space for all of the memory used by the application. For umbra on ARM, we now emit the following code in order to determine the shadow address given the application address:
m4 e58a4084 str r4, [r10, #132]
m4 e10f4000 mrs r4, cpsr
m4 e58a4080 str r4, [r10, #128]
m4 e58a0088 str r0, [r10, #136]
m4 e1a04007 mov r4, r7
m4 e58a4008 str r4, [r10, #8]
m4 e30b016c movw r0, #45420
m4 e3470801 movt r0, #30721
m4 e0800724 add r0, r0, r4, lsr #14
m4 e3c00003 bic r0, r0, #3
m4 e5900000 ldr r0, [r0]
m4 e1a04124 lsr r4, r4, #2
m4 e0844000 add r4, r4, r0, #0
m4 e5c40000 strb r0, [r4]
L3 e5974000 ldr r4, [r7]
L3 eaffffff b #0x000106e8
For those unfamiliar with DynamoRIO, the m4
instructions are meta instructions inserted by our DynamoRIO client, while L3
instructions represent the instructions as seen in the original program. These L3
instructions should be untouched, effectively. We’ll pull this emitted assembly apart in the next section, to understand the underlying data structures as well as their function.
Page Table Implementation
The implementation detailed here is in fact the same one used by Dr. Memory on x86. This logic is currently wrapped up in the Umbra project in the Dr. Memory repository, which I have modified to also support ARM.
At the heart of the shadow memory mapping scheme lies a page table stored in static memory:
/* drmemory/umbra/umbra_32.c */
/* We divide the address space into (64KB) chunk units */
#define APP_BLOCK_BITS 16
#define SHADOW_TABLE_ENTRIES (1 << (32 - APP_BLOCK_BITS))
static ptr_int_t static_shadow_table[SHADOW_TABLE_ENTRIES];
This table in fact stores a displacement from the application address for performance reasons, at least on x86. In order to use this data structure, we recall that it functions much like a page table; we can compute the index into the page table by simply interpreting the top 16 bits of the address as the index. We then dereference that index into the static shadow table, to compute the displacement. This was implemented previously in x86 with very naive assembly, shown below:
mov ecx, eax
shr ecx, 0x10
add eax, [shadow_table + ecx*4]
Here, %eax
initially holds some application address, which is being translated to a shadow address. Furthermore, %ecx
here functions as an index into the static shadow table (where each element is 4 bytes). Unfortunately, this code sequence is not so nice on ARM, because we cannot simply express the operand (table,%ecx,4)
which performs the operation table + %ecx*4
. We instead do the following on ARM:
mov r1, shadow_table
add r1, r1, r0 lsr #14
bic r1, r1, #3
ldr r1, [r1]
add r0, r0, r1
Here, r0
holds the application address. Note a few key difference in this code sequence. In order to keep from spilling a register, we modify the original expression like so:
\begin{align*} table + 4 * (addr \gg 16) &\equiv table + ((addr \gg 16) \ll 2) \\ &\equiv table + ((addr \gg 14) \mathrel{\&} \text{0xfffffffc}) \\ &\equiv (table + (addr \gg 14)) \mathrel{\&} \text{0xfffffffc} \end{align*}
The final step assumes that table
is 4-byte aligned, or already has the least significant 2 bits cleared, so that we can move the bitwise and
out of the inner expression. In practice, this is true because of the C ABI and alignment rules for the ARM platform. Although not immediately apparent, the expression
\[ table + (addr \gg 14) \]
has a direct equivalent in ARM, namely add r1, r1, r0 lsr #14
, allowing us to elide a register spill or restore and simplify the emitted code itself.
Another interesting caveat here is that ARM cannot load the immediate 0xfffffffc
in order to perform the and
operation; we instead opt to use bic
, documented here which performs something like: op1 & ~op2
. In other words, the previous expression becomes:
\[ (table + (addr \gg 14)) \mathrel{\&} \text{~} 3 \]
We can now load all required constants (0x3
, etc) as immediates in both ARM and in THUMB mode, and this expression mimics exactly the shadow address translation routine shown above.
Memory Usage Optimization
In a typical application, we do not saturate the entire address space. So, it does not make sense, nor is it feasible to initialize all the shadow memory up front. To alleviate this problem, we actually allocate only a single read-only block at initialization time, with a default value set. This is the block that is referenced by all entries in the page table initially, so our memory usage initially is only the cost of the page table, plus that single shared block.
Now here’s the kicker: if someone attempts to write to this default shared block, we handle the page fault gracefully in a signal handler. If we notice that the faulted access is within this special read-only shared block, we can replace out the relevant page table entry with a new RW page, and re-execute the faulting instruction.
Note that this algorthm coincides generally with the concept of copy-on-write (COW), used in a variety of applications as the Linux kernel and the C++ std::string
implementation for performance reasons.
In this manner, we end up only allocating shadow blocks for the blocks which are written to at runtime, which will likely include the stack and the heap, but probably not .text or libc, thus reducing shadow memory usage for an application in general. This is implemented in Dr. Memory proper here. A far shorter example implemented in a test in the pull request can be found here.