Most static analysis tools don’t let you script them. Oink is an example of that. Adding a new analysis requires extending an existing tool with a feature which may not fit in smoothly or writing a new tool in C++ with the corresponding boilerplate to handle command-line arguments, etc.

At the other extreme lives UNO which has a DSL that is designed strictly for solving control-flow programs.

I chose JavaScript for Dehydra to get the power of a full-blown (and memory safe) programming language while keeping JS’s view of the code under analysis as simple as possible. I think I got the C++/JS mix just right, the C++ codebase is growing slowly, yet I add JS scripts for any little task.

Additionally JS came in handy for publishing the data on the web. I wrote a script to determine the class inheritance graph. Then I added a little prefix to the file produced, combined it with jsviz and ended up with a really slow and unusable, but somewhat pretty class browser. This could evolve to be a rather neat way to supplement LXR.

Automated Code Refactoring

April 2nd, 2007

Squash

If you are working on any C++ refactoring, especially if it involves function calls, spans multiple files or feels like you need a compiler in your head to help you, drop me a note to see if squash can help. Squash provides a great deal of control over the refactoring process because it is not tied to a particular IDE and can be customized to accommodate for special cases.

On Friday, two squash-produced patches landed:

  1. A 212K patch to rename nsIFrame::GetPresContext to PresContext. It took a couple of minutes to produce a patch for mac & linux, and then some manual labour to complete it so it builds on Windows too. Unfortunately, Microsoft C++ is not yet supported by Oink. Windows-specific code will require magnitudes more of human labour until such support is contributed.
  2. A much simpler patch to calls to remove uses of the deprecated ::Recycle(). This took a few minutes once I added support for renaming global functions to squash.

Dehydra

C++ support in dehydra is coming along splendidly. I started working on cross-function analysis support. Currently my goal is to allow the user to build callgraphs of Mozilla. The first application of that is going to be dead code detection.

In the meantime, contact me if you are looking for patterns in the code that grep wont help with : control flow-sensitive code, type & syntax-aware matching, API misuse, etc. Dehydra can probably help.

Dehydra 0.1

March 7th, 2007

Dehydra does a form of basic symbolic execution now. It’s enough to avoid the unfeasable branches. It enabled me to add known_zero() and known_notzero() so dehydra supports all of the functions documented in the UNO paper.

I believe dehydra can now match all of the features of uno’s DSL. Download it here. See malloc.js for an example of how to port UNO scripts. There is no documentation, except for this blog, but the docs on the UNO website should work too.

This is the first release so there are bugs, but the tool should be able to find some API-misuse issues, etc. I need some early adopters to help direct the evolution of this tool.

Uno, dos… DeHydra!

March 2nd, 2007

A Creature so Fierce…

I’ve been wrestling with control flow graphs like this. I eliminated those pesky “empty” nodes found in the previous incarnations, improved branching to track conditions and realized that what I’m really doing is developing a tool to bite off the excessive necks and heads (otherwise known as edges and basic blocks) of the Hydra monster that is represented by the bloody graph. Thus, say hello to DeHydra which was once known as dos.

DeHydra Plans

I am quite happy with how dehydra is turning out. JavaScript is a nice language to pattern match code in a flow sensitive fashion and the uno-style api seems appropriate for the task.

In the near term I really need some value-awareness through abstract interpretation. For example, this graph of a simple function demonstrates several things that dehydra will need to notice:

  1. There is a path from hydra-unfeasable.c:6 to hydra-unfeasable.c:10 and that those blocks are part of only a single control flow, not 4.
  2. hydra-unfeasable.c:12 requires a contradiction in the condition in order to have flow through it and should be dropped or even reported as dead code.

I pondered a number of approaches ranging from value analysis, using a theorem prover and reversing conditions via prolog code. A simple C++ symbolic expression interpreter in C++ seems like the most productive thing at this point. So I will focus on that in the next short while.

While I was reviewing my options I found a paper on null pointer analysis for Java to be a useful stating point for references to other relevant literature.

Longer Term DeHydra Plans

There is literature on how source-to-source transformation (like squash but also like the future C++ to ES4 tool) is assisted by flow sensitive static analysis tools (like dehydra) which I will have to absorb.
DeHydra will gain whole-program analysis capability by:

  1. Inlining control flow graphs into the main() function
  2. Supporting user-defined function attributes and using some sort of attribute inference (like in ML) to propagate them. This method would have higher performance, but less precision.

For the above to work will also need support for a user-correctable/suggestable callgraph to compensate for function pointers and dynamic module loading. This should also be enough to do dead code and module detection. Oh and I want dehydra to support JavaScript for doing cross-language analysis.

There is also Whaley’s paper which uses prolog as a concise and expressive query engine for the source code. I would really like to see dehydra evolve such capabilities.

All this will take more than a little time on my own, but this stuff is too exciting for other impatient people to idly observe.

Dotting Pretty Graphs

February 7th, 2007

It is difficult to follow error messages from control-flow analyzing tools. After struggling to visualize what I was debugging, I added a a native JavaScript function to graph the CFG and display the current path through it.

Now debugging control flow errors is as easy as looking at a (sometimes giant) picture of this function. The red represents the current flow and gray indicates flows that the script deemed correct.

Recipe

In this example I am checking that control flows through a particular point in the code (from line 1128 into line 1200). Since dos only deals with variables I added variables that it can match to the AST.

I added DFLOW_START to line 1127, DFLOW_STOP to line 1200 and produced my .i files with make CC=”gcc -DFLOW_START=’{int __flow_start;}’ -DFLOW_STOP=’{int __flow_start=1;}’” jsarray.i

Then I ran dos with the tiny analysis script: ./dos -dos-javascript ensure_out.js -o-lang GNU_C ~/work/ff-build/js/src/jsarray.i

Future Work
This should allow function-local dead code detection. Once dos is mature enough for function-local CFG traversals it will be interesting(but challenging) to try to expand this to detect dead functions or classes in Mozilla.

Two weeks ago I finally listened to Graydon and spent some quality time playing with UNO and reading the paper on it. UNO provides a simple DSL language to traverse interesting parts of the C abstract syntax tree. It simplifies the AST down to the bare minimum of variable and function call info and iterates user-defined scripts over that.

Since I was looking for a way to get away from C++ for code analysis I was very excited and decided to implement the ideas in UNO in my own tool which goes by a temporarily name of “dos”. There limitations in UNO: it has no hope of parsing C++ without switching parsers and the DSL is rather limited. Thus, dos is built on the incredible Elsa/Oink C/C++ parser and uses SpiderMonkey to provide JavaScript as a scripting language.

Status

  • dos builds a Control Flow Graph from the Elsa AST. It isn’t finished and while for/while loops, break/continue/return and if statements are supported (as opposed to the trinary operator or goto), but the CFG isn’t quite right yet. However, it’s good enough for proof of concept purposes.
  • Dos does not do DFS traversal when iterating the script like UNO, but instead iterates through statements sequentially because I did not feel like writing my own traversal of the massive Elsa AST.
  • I mostly implemented an UNO-compatibility JS lib in system.js which should allow for easy ports of UNO analyses like malloc.js. It’s missing the negation conditions from UNO, but those should be quick to implement.

Instructions

  1. Download my oink tarball with the squash & dos additions.
  2. Install ossp spidermonkey distribution:
    cd dos_and_squash-0.0/js-1.6.20060820 ; configure –prefix=/usr (or use /usr/include) && make install
  3. Build the oink suite:
    cd ../oink ; ./configure && make
  4. I ported a simple UNO analysis which checks that malloc() is always paired with a corresponding free() statement. To run it:
    ./dos -dos-javascript malloc.js simple.cpp
    Now modify line 12 of simple.cpp by putting ; after the while statement. The error will be detected:
    func:test
    Error at simple.cpp:13: malloc without free

Writing Scripts

Take a look at simple.js for a barebone analysis. There should be at least 2 functions in every script uno_check(vars, state) and path_end(state). uno_check() is called for every statement in a control flow path. vars is an array of interesting variables and function calls in the current AST statement and state is where the script should store all state info. state gets copied on CFG block transitions with an optionally user-provided clone() function (see system.js for an example). path_end() is called when the end of the CFG path is reached.

./dos -dos-javascript simple.js simple.cpp # this runs the user script

UNO-compatibility

UNO and the corresponding paper on it is currently the best documentation for dos. Thus I also assign vars and state to the global object to achieve UNO-like scripting feel. See malloc.js and system.js for details.

Goals

  • My foremost goal is to find a better name for dos. I’m open to suggestions that are both humorous and somehow related to source analysis.
  • Easy: finish UNO compatibility functions. I’ll do that as I port more scripts, but I wont mind if anyone does it for me.
  • Improve the CFG, do some value analysis to reduce the graph and provide more info to the scripts
  • Easy: Add types to the variable info, should get dos closer to doing the GC safety-analysis.
  • Reuse the JS parser to build a JS CFG to allow the same kind of analyses for JavaScript. This would be great for verifying API usage in extensions. For bonus points one can allow the user to tie the JS and C/C++ analysis together for cross-language analyses.
  • UNO has a fairly complicated global variable analyses that is not very scriptable and thus boring. I would be interested in scripting intra-procedural analyses to figure out call-graphs, do some sort of behavior inference (ie figure out indirect malloc calls, certain other heap modifying behaviors and propagate them as attributes to function calls), etc.

Update: New name for dos -> DeHydra

The tool is now named DeHydra since it is supposed identify bugs in the multi-headed Control Flow Graph monster.