Static Analysis Roadmap
This blog post documents some notes I’ve written up in my shallow studies of static analysis. I’ve tried to organize it topically, into broad categories that I find interesting within static analysis.
Of course this is not exhaustive (symbolic execution isn’t even in this list!), and really this only includes topics for which I’m actively reading papers.
I also avoid writing too much, since most concepts are self contained in some seminal paper defining it which I consider accompanying text.
Data Flow Analysis
All modern data flow analysis techniques have been constructed with the intention of aiding compilers to perform essential optimizations that would otherwise be very difficult to perform without some formalization. These common optimizations can be found in the dragon book, among other textbooks and survey papers:
- Reaching Definitions
- Dead Code Elimination
- Constant Propagation
- Live Variable Analysis
- Available Expressions
Note that most of the analysis passes on the list are made trivial by reaching definitions, which computes usage-definition and definition-usage pairs in a given function. Unfortunately, reaching definitions is precisely computable only given a sound and complete CFG, as well as accurate knowledge of all variables in a funcion. A compiler can readily provide this information, but it’s definitely undecideable to determine these for a binary.
Another interesting addendum to Reaching Definitions is Static Single Assignment Form (SSA) which is used by many compilers, for example by LLVM’s internal representation in order to make the use-def and def-use pairs within a function explicit. The seminal paper defining SSA construction is Efficiently Computing Static Single Assignment Form and the Control Dependence Graph, which internally uses the concept of dominance frontiers to compute minimal SSA form.
Accompanying Text
- Engineering a Compiler
- Compilers: Principles, Techniques and Tools
- A Survey of Data Flow Analysis Techniques
- Efficiently Computing Static Single Assignment Form and the Control Dependence Graph
CFG Reconstruction
This section actually goes over CFG Reconstruction, as well as the reconstruction of variables in a function. These are needed in order to compute SSA form or other data flow analysis passes that assume a reasonably accurate CFG. Some techniques are listed below:
- Value Set Analysis
- Bounded Address Tracking
- Value Analysis with Precision Requirement
Note that Bounded Address Tracking falls under the category of abstract interpretation. Some also say that Value-Set Analysis falls in the domain of abstract interpretation.
Value-Set Analysis
What you See is Not What You Execute, as well as Analyzing Memory Accesses in x86 Executables seminally define Value-Set Analysis for the purpose of aiding IDA Pro to reconstruct control flow graphs precisely. The papers also use affine relations between registers to more tightly bound the overapproximation of VSA in certain cases.
(State of) The Art of War: Offensive Techniques in Binary Analysis mentions an alternative augmentation to Value-Set Analysis that is less precise than Affine relation analysis but is much less complex. It is based on a custom lightweight algebraic solver, which is used to simplify path predicates to obtain a value interval for each variable in the predicate.
Bounded Address Tracking
Static Analysis of x86 Executables, as well as Precise Static Analysis of Untrusted Driver Binaries define K-set analysis and Bounded Address Tracking. The concept is based heavily on Analyzing Stripped Device-Driver Executables (implemented as DDA/x86 via CodeSurfer/x86). A juicy quote from the paper:
The major differences [is that VSA is] based on more “classic” static analyses such as interval analysis. VSA is path insensitive and thus requires the use of call strings for reasonable results.
However, because an x86 program does not need to maintain a callstack, call strings are unreliable at best for use in this kind of analysis. It’s also interesting to note that bounded address tracking in general computes an underapproximation.
Value Analysis With Precision Requirements
Value Analysis with Precision Requirements (VAPR) uses k-set analysis, much like BAT, but ramps up the precision for k-set every iteration to construct a more and more complete CFG. This is defined in Refinement-based CFG Reconstruction from Unstructured Programs.
Tools Listing
WYSINWYX and other papers definiing VSA (all work by Gogol Balakrishnan) is implemented as part of CodeSurfer/x86 and DeviceDriverAnalyzer/x86. These are IDA Pro plugins which implement VSA for the purpose of analyzing possible runtime values, and as a side effect construct a more precise CFG than IDA provides by default. VSA is also used in angr and Binary Ninja, and will soon be used in panopticon.
All of the BAT stuff was written by Johannes Kinder, and is incorporated into the analysis toolkit, jakstab. Panopticon currently uses a rough implementation of BAT. A quick glance shows that falcon also uses k-set analysis (BAT?) in its abstract interpretation implementation.
Accompanying Text
- What you See is Not What You Execute
- Analyzing Memory Accesses in x86 Executables
- Analyzing Stripped Device-Driver Executables
- Static Analysis of x86 Executables
- Precise Static Analysis of Untrusted Driver Binaries
- Refinement-based CFG Reconstruction from Unstructured Programs
Decompilation
Decompilation requires some sort of method to reconstruct the CFG precisely, as well as to compute all variables in a stack frame. Any sort of abstract interpretation is able to perform both of these, though VSA is preferred. Data flow analysis is also essential to cleaning up the output of decompilation.
Most modern papers focus only on two aspects of decompilation:
Accurate Type Recovery
TIE: Principled Reverse Engineering of Types in Binary Programs serves to accurately reconstruct types in binaries. It first makes use of a variant of VSA to determine locations of high-level variables in a subroutine. Then, it formulates and solves for constraints on types of these variables based on how they are used (for example, you can only dereference a pointer type).
Structured Control Flow Reconstruction
The paper No More Gotos demonstrates an algorithm that guarantees decompilation free of goto’s, assuming the original source code was also goto-free. It implements this by exploiting the fact that all structured control flow has a single entry point and a single successor. The algorithm breaks a control flow graph up into regions which have a single entry point and a single successor, and also reconstructs control flow by pattern matching on path predicates to each region (i.e. their reaching conditions).
The follow-up paper Helping Johnny to Analyze Malware defines some other minor corrections to clean up decompilation, like variable congruence (based on liveness analysis) and loop simplification.
Tools Listing
The phoenix decompiler, as well as the DREAM decompiler both implement the TIE paper. Meanwhile the DREAM decompiler implements the next two papers. The fcd decompiler also takes inspiration from Helping Johnny to Analyze Malware.
Accompanying Text
- No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations
- Helping Johnny to Analyze Malware: A Usability-Optimized Decompiler and Malware Analysis User Study
- TIE: Principled Reverse Engineering of Types in Binary Programs
- Native x86 Decompilation using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring