GCC Summit

June 19th, 2008

Our presentation on Treehydra and Dehydra GCC plugins was received well at the summit.

The big news is that FSF is working on license changes to allow GPL-only GCC plugins. I’m looking forward to having our work be compatible with future GCC without any patching.

In a few minutes we’ll be having a meeting with users of other plugin frameworks to have an initial discussion on a common API. I’m working on forward porting my patches, so they can start getting reviewed ahead of license changes.

I am finally happy enough with Dehydra API and functionality to release 0.9. Dehydra is basically feature complete, the main reason I’m not calling it 1.0 is in case there are outstanding API bugs.

I believe Dehydra is the first useful open source static analysis tool. I hope to see projects outside of Mozilla benefitting from it too.

I would love to see someone package this up for various Linux distributions. You can grab there release here.

Note, this release also features as a preview release of Treehydra. Most of the development lately has been focused on improving Treehydra and building analyses on top of it.

After writing a ton of docs and working through other Dehydra 0.9 blockers, I decided to cool off by doing some actual analyses. Before I get to that, I’d like to say that the last big task is to setup a buildbot for Dehydra on Linux/OSX. Thanks to yet another awesome contribution from Vlad, that’s mostly done.

So I got working on GC-safety static analysis. Originally we tried to define a complete spec before writing a single line of code. That turned to be a bad idea and resulted in a spec full of bugs. This time we are defining the analysis incrementally and as a surprise reward, it already caught a bug.

Pushing and Popping Our Way

SpiderMonkey has a lot of complex code doing applying Push/Pop-like operations on variables in a function-local manner. Examples of functions that this analysis would look at are: JS_PUSH_TEMP_ROOT/JS_POP_TEMP_ROOT and JS_LOCK/JS_UNLOCK. See bug for more. Essentially, this will help with “code must flow through here” comments on “out:” goto labels that inhabit the SpiderMonkey source.

This is an example of control-flow-sensitive analysis. It impossible without a compiler-like view of the code that Treehydra provides. It also helps to have a scalable algorithm to iterate the CFG. Luckily, David Mandelin wrote such a beast by implementing ESP for his outparam analysis. David factored-out the ESP analysis and made it available for reuse. See esp_lock.js in the test suite for an example of how to write control-flow sensitive analyses. locks_valid*.cc and locks_bad*.cc illustrate the code patterns that can be scanned for.

So if you know of any further push/pop patterns in the rest of Moz that can be checked in this manner, leave a comment.

PS. This is yet another account of Treehydra rocking the static analysis world. Exposing the slightly scary, but awesome GCC gutts via JavaScript allows one to perform precise static analyses in a civilized manner. What could be more fun?

I hope to release Dehydra 0.9  within a couple of weeks. There is already a community of users, but there are still too many barriers to entry keeping potential bug hunters away.

In recent weeks there has been a lot of work on polishing rough areas. Now we have better error reporting, improved APIs for using libraries, etc. The remaining tasks are tracked in this bug.

There few big remaining TODOs are low-tech:

  • Need a better homepage than the current one.
  • Docs, tutorials and more docs. Currently, the plan is to puts more documentation on MDC and  have it also serve as a webpage. Any dehydra/treehydra guides or API doc contributions are welcome. For now if you need help, feel free to ask on the mailing list or #mmgc on irc.mozilla.org
  • Verify, document and maintain the OSX port. Vlad Sukhoy did a lot of heavy lifting to make this happen, now we need to cement his achievement by setting up a buildbot
  • Spread the word! I would like to see other large projects such as KDE, OpenOffice, etc adopt application-specific static analysis in the form of *hydra. I am interested in seeing people use *hydra to scan code for security vunerabilities. Ok, so this isn’t really needed to release Dehydra 0.9, but I am impatient!

RIP: Oink Dehydra

Between GCC Dehydra and Treehydra, there is nothing that pork Dehydra could do better, so I finally removed Dehydra from Pork. From now on Pork’s purpose is large-scale C/C++ refactoring. For everything else one should use Dehydra.

Dehydra World Tour

March 17th, 2008

After a few weeks of mindnumbing work on treehydra gutts, I finally have something exciting to talk about!

We will be presenting  Dehydra at the GCC Developer’s Summit in lovely Ottawa. The GCC version of Dehydra exceeded all of my expectations, so it will be exciting to meet awesome GCC hackers who lay the groundwork to make this possible. Got suggestions for other venues to present Dehydra?

Packaging Help Needed

I feel that the Dehydra concept is getting mature enough for a 1.0 release. Recently baked GCC 4.3 means I’ll be able to distribute a 4.3-specific plugin patch(currently it’s against trunk, aka 4.4to-be). Now I need README, LICENSE, configure files, etc.

I will need help with packaging dehydra + patched gcc into .dpkg and .rpm files. Leave a comment, email me/static analysis list or poke me in #mmgc on irc.mozilla.org if you can help with packaging.

Logo/Mascot Wanted

Since every serious project has a cool mascot, it would be cool to get one for Dehydra. I’d be curious to see what people think could symbolize a code scanning monster that makes grep feel inadequate. I have a feeling a cartoon version of a giant Heavy Metal Duck might be it, but I haven’t made up my mind yet.

Treehydra What?

Treehydra is a work-in-progress name for the low-level equivalent of Dehydra. Currently it is built as separate GCC plugin. I haven’t yet made up mind on whether Treehydra will end up extending Dehydra or stay a separate tool. Since treehydra needs dehydra for bootstrap, they’ll stay separate for now.

Last week I managed to run treehydra to completition on my mozilla checkout and walk the resulting AST in JS correctly. Now comes the fun part of making it do useful tricks.

I got this question in the mail today.

Seems like a simple enough question, but grep won’t provide that answer :) It also happens to be an excellent usecase for Dehydra.

My script:

var classes = []
function process_type (c) {
if (!/class|struct/(c.kind)) return
classes.push (c.name)
}


function input_end() {
var f = this.aux_base_name + ".counter"
print(f)
write_file (f, classes.join ("\n"))
}

process_type is called every time GCC hits a class declaration or a template is instantiated(also for enums and unions, but those get ignored with the .kind check). Then input_end is called when GCC is done processing the file. this.aux_base_name is the input filename.

I hooked up this script to the mozilla build by adding the following to .mozconfig:

export CXX=$HOME/gcc/bin/g++
export CXXFLAGS="-fplugin=$HOME/work/gccplugin/gcc_dehydra.so -fplugin-arg=$HOME/work/gccplugin/test/count_classes.js"

Then I built:

make -f client.mk build WARNINGS_AS_ERRORS=

Count:

find -name \*.counter|xargs cat |sort |uniq > /tmp/classes.txt
wc /tmp/classes.txt

Answer: 15001

There are a million other trivial queries that could be accomplished in a similar manner that weren’t easy or possible before.

Update: Fixed typo, had an extra zero in the answer

Dehydra progress

January 25th, 2008

GCC Dehydra is evolving much faster than the Elsa version did and it is easier to use. Once I implemented virtual methods correctly, Joshua was able to do his thing in no time at all. All it takes is a custom GCC (I’d love to see it packaged) and specifying plugin parameters in CXXFLAGS.

Dehydra has some new tricks now like a tree representation of types (instead of a string) with full typedef support. Lisp remnants in GCC are getting a new life as JavaScript objects.

I’m current working on exposing the full GCC tree structure in JavaScript so one could do any analysis they wanted in pure JS. Dynamically typed GCC tree nodes are great for that. I’m starting with middle-end GIMPLE representation so in theory one will be able to analyze anything gcc can compile (Java, C++, C, ObjC, ObjC++, FORTRAN?). Eventually this will be expanded to support frontend specific tree nodes to be able to look at code closer to the way it was written. Oh and I expect people will be able to script large parts of C++ -> JavaScript rewrites with Dehydra.

In theory, one could make tree node conversion two way which would enable writing optimization passes in JS, but that would be silly.

What’s the point?

I want to be able to do Exception-safety analysis in pure JS. I want to enable unit checking (thought typedefs and inline conversion functions) in pure JS.

Additionally, Dehydra should be awesome for generating bindings. For example, I’ll be able use Dehydra to import GCC’s autogenerated enums to get string names for nodes.

Also it will become easy to extract callgraphs and various other stats out of the code if they are accessible in JS. Eventually we’ll be switching Dehydra to Tamarin to do all of the above really really fast.

GCC Plugins

While I am messing with the GCC AST, Dave is working on utilizing GCC’s control flow graphs with a separate plugin. Eventually we’ll merge our work, but for now it’s nice to not step on each others toes while adding features to the compiler. Given how easy life is with plugins I am amazed that people chose to go uphill bothways and not collaborate on a plugin interface for their crazy GCC extensions. Yes, I’m looking at you: mygcc and gccxml.

Aren’t there IDEs interested in making use of GCC internals too or is everybody interested in maintaining yet another crappy C parser like Linux’s Sparse tool?

I’m looking forward to exploring the many ways we can reuse what’s in the compiler to empower developers for Mozilla 2.

Analysis

GCC Dehydra is starting to work. I encourage people try it out for their code scanning needs. The main missing feature is control-flow-sensitive traversal, which means that currently function bodies are traversed represented in a sequential fashion. It is the most complicated part of Dehydra, but most of the time this feature is not needed.

So far I got Benjamin’s stack-nsCOMPtr finding script to do stuff, which indicates that most of the features are working.

My vision is to switch to the GCC backend for all of our code analysis needs since it is well tested, fairly feature complete works with new versions of GCC (by definition).

Not everything is perfect in GCC land. There are some frustrating typedef issues to solve.

Source Re-factoring

Elsa still holds its own when it comes to refactoring code because it has a much cleaner lexer/parser and rarely opts to “optimize away” original AST structure. We should stick with Elsa’s arcane requirement of having to preprocess files with gcc <= 3.4 until either GCC becomes viable as a platform for refactoring or clang matures.

GCC is not suitable for refactoring work because it:

  1. Starts simplifying the AST  too early
  2. The parser is handwritten and therefore would be hard to modify to maintain end-of-AST-node location info.
  3. GCC reuses many AST nodes which means their locations point at the declaration rather than usage-point.
  4. Handwritten nature of GCC makes any of these above improvements time-consuming to implement and the political issues are something I’d rather not deal with.

Most of these wouldn’t have been an issue if GCC was written in ML :)
What’s Next?

Time to start using GCC Dehydra to enforce GC-safety and lots of fun exception-rewrite preparation work.

Stay tuned for more exciting developments regarding regaining control over source code here and on Dave Mandelin’s blog.

Dehydra as a GCC plugin

January 8th, 2008

Thanks to the 2-fold increase in manpower working on pork, we finally have an opportunity to work on the nice-to-have things.

Progress

Recently I have been working on a GCC plugin to do Mozilla-specific analyses with GCC.

Unfortunately, I didn’t notice that GCC had a plugin branch so I reinvented the wheel there. Fortunately that part was rather easy and turned out that the plugin branch isn’t very useful to work with as it is in SVN, doesn’t link GCC with -rdynamic nor does it install the hooks I need in the C++ frontend. Overall the plugin shim is relatively trivial and it will be pretty easy to merge with other similar efforts.

My first and only plugin is a C reimplementation of Dehydra. GCC sources are currently fairly hostile to C++, so I elected to not make my head spin by mixing in C++ in addition to C and JavaScript. I think the C Dehydra has reached the hello world state, to take it for a spin see the wiki page.

GCC Thoughts

Integrating with GCC is pretty awesome. So far I regret not jumping in earlier. I was reluctant to do so as everyone I’ve talked to (other than Tom Tromey) claimed that GCC is ridiculously complicated and impossible to do stuff with. In fact academic people are so scared of GCC that they tend to opt to go with commercial frontends that have ridiculus licensing terms and make it impossible to release their work to general public.

GCC internals are pretty crazy since everything is done with macros and the AST is dynamically typed so it’s fairly painful to figure out seemingly simple things like “what AST nodes does this AST node contain”. Additionally, GCC loves rewriting AST nodes inplace as the compilation progresses which sucks when one wants to analyze the AST while it looks as close as possible to the source. GCC parser also sucks to work with as it is implemented as a C code hodge-podge (technical term which applies to much code in GCC). Luckily, I am mainly concerned with poking at data that’s already in GCC.

The upside is that GCC is a well-tested production compiler that most source compiles with. Integrating with GCC means that the AST is correct (Elsa is a frontend so there is no way of knowing if AST has mistakes in it) . Integration also means that the user doesn’t have to worry about making preprocessed files and maintain obsolete versions of GCC or old GCC headers. Unlike Elsa, GCC already has useful features like typedef tracking and doesn’t implement location tracking with a stupid programming trick. Additionally, I hope to reuse computations from from middle-end GCC passes to build my control flow graph, do value numbering and other useful, but tricky to implement stuff.

GCC isn’t scary at all, it’s just another way of implementing a compiler. Some people elect to have more pain in life by electing to reinvent ML in C++ instead of using ML for compiler writing,  others get their pain dosage from working on a C compiler originally generated from LISP sources.

Lastly, I’d like to thank patient gcc hackers in #gcc without whom I wouldn’t stand a chance in figuring out how to get this far.

Recent Progress

December 28th, 2007

Looks like pork is slowly going to get merged back into oink. This makes me happy as it will result in decreased merging headaches and gives more visibility to my work outside of Mozilla. My elkhound changes are already in!

Recently I added support for retaining gnu attributes to elsa and corresponding features dehydra and garburator. Now dehydra can verify things based on attributes and  garburator gained a way to rewrite special cases like classes that are always allocated on the stack. Elsa still drops most attributes, but at least classes, methods and variable declarations are covered.

I also spent a couple of days investigating gcc plugins. Turns out modifying gcc to support plugins is dead easy, but getting anything useful done in GCC requires a steep learning curve. I tried to find how to enumerate all of the toplevel declarations in the source, but I couldn’t find the correct global variable that corresponds to the toplevel scope(aka the Translation Unit?). I have a few more ideas of what to try next. Once I do that, it shouldn’t take much work to make a basic gcc-hosted version of dehydra. There is also a gcc plugin branch hosted in the gcc svn, but I can’t find any example code for it. It isn’t a big deal since none of the plugins I’ve seen mentioned venture outside of intra-function analyses.

I am still pondering on how to tackle rewriting Mozilla to use exceptions. It is the key to improving overall readability/perf of Moz C++, but the logistics of writing the corresponding analyses+rewrites followed by a parallel manual correction step are still making my head spin. All I’m sure about is that the first step to exceptions would be to enable the OOM exceptions and do the corresponding exception safe analysis+rewrite.