Main menu:

Site search

Categories

Archive

TraceMonkey/Fx36 @Hacks

My article on JavaScript performance improvements in Fx3.6 is up at hacks.mozilla.org. The improvements are more detailed this time, nothing so huge as new tracing VM, so I concentrate more on describing what speeds up with a few little code samples and demos.

There’s more than one way to null a pointer

I haven’t blogged in a while because I’ve been in heads-down mode working mostly on getting closures and the arguments keyword to trace in TraceMonkey and investigating topcrashes. But we’ve just solved the most common JS-engine topcrash, so I’m resurfacing with the story, and hopefully some useful tricks and analysis for anyone else working on anything like this.

The Crash Reports. The crash in question is a crash in js_Interpret, which is a huge function with cases for every bytecode in SpiderMonkey. js_Interpret has been placing around 16-19 on the Firefox 3.5 topcrashes list. Looking at the stack traces, I found that a certain crash point (line of code where we crash) accounted for the majority of the crashes:

0  	js3250.dll  	js_Interpret  	 js/src/jsinterp.cpp:4436
1 	js3250.dll 	js_Invoke 	js/src/jsinterp.cpp:1394
2 	xul.dll 	nsXPCWrappedJSClass::CallMethod 	js/src/xpconnect/src/xpcwrappedjsclass.cpp:1697
3 	xul.dll 	nsXPCWrappedJS::CallMethod 	js/src/xpconnect/src/xpcwrappedjs.cpp:569

The line of code on which the crash occurs, jsinterp.cpp:4436, is for the fast path of reading a JS object property.

There was no bug on file with steps to reproduce, or any other kinds of data that would let us directly debug the problem.

Crash Addresses. Clicking on a few of these reports, I saw that the Windows crash address was usually 0×0, 0×4, 0×8, or 0xc. I wrote a script that verified that guess on a larger sample. That pattern told me the crash was probably due to an attempt to read from an array of word-sized elements through a null pointer. Something like this:

// a is NULL
int x = a[i];

If i is usually in the range 0-3, then most crash addresses would be 0, 4, 8, or 12.

If you expand all the macros on the crash line, there is in fact an array read, reading a JS object property as obj->dslots[slot-5]. dslots is a dynamically allocated array of 4-byte JS values that holds object properties after the first two (depending on the exact object type). If the object has few enough properties, then dslots is null.

Minidumps. Using minidumps and disassembling code around the crash point, I was able to prove the crash was in fact caused by null dslots.

Disassembly also showed that about 25% of the crash reports reported the wrong line of code, apparently because the compiler’s profile-guided optimization mangled the x86 code too much for the crash analyzer to handle. This confused me at first, because I had happened to start with a crash report that listed the wrong line.

I also found a few of the local variables’ values using the minidump. The minidump doesn’t have enough information for the debugger to find those variables. I got them by reverse-engineering the x86 code, finding stores to stack locations (stores through ebp), matching them up with variables, and then reading the values out of the minidump. I only got 3 values, and they weren’t terribly informative, but they did rule out certain code paths and special cases, narrowing significantly the kinds of JS objects I needed to think about. For example, the variables showed that the crash was not related to dense arrays (array objects represented by C arrays) or DOM objects.

At this point, I knew that the crash was because dslots was null, and I knew it wasn’t related to certain advanced features, but that’s it. In particular, dslots absolutely shouldn’t be null on that code path, because that is a code path for a property lookup that finds a property in the property cache with a slot greater than 5. Somehow, the property cache was giving us the wrong answer for the current object, but there was no indication why. One possibility is that the property cache had wrong data. Another is that something set dslots to null when it shouldn’t.

Bits and grapes and dates and cakes. Now, if I had steps to reproduce and could debug, one of my first questions would be, “What code sets dslots to null for the object we are looking at?”

Thinking about that, I got an idea so dumb it was clever. I knew I wanted to know the code point that set dslots to null. And I knew the only real piece of data about the crash I could get was the crash address. So, clearly, the answer is to make each code point that sets dslots to null use a different null address. Then, the crash address would give the answer. Like this:

// someplace that sets dslots to null
obj->dslots = (void*) 256;
// some other place that sets dslots to null
obj->dslots = (void*) 512;

If the crash addresses come back 256, 260, 264, 268, then the first point is the one setting dslots to null. If they are 512, 516, 520, 524, then it’s the second point. Of course, it could also be a mixture.

There are a bunch of other details that need to be handled to make this work, like changing code that tests if dslots is null, which can be found in the patches posted in bug 519363.

The reason this works is that the “zero page” (page with base address 0) is protected on all popular operating systems, and the null pointer is defined as address 0. Thus, in normal operation, any attempt to read through a null pointer, or an array based with a null pointer (as long as the index isn’t so large it gets to another page), will crash with a protection fault. But this means that any pointer with address less than 4096 will also crash. By using multiple null pointers, and changing tests for null to comparisons with 4096, we can send back up to 12 bits of information through the crash address. For our bug, we actually only get 9 or 10 bits, because 2 or 3 bits are taken up by the array index part of the crash address. But there were only about 6 places dslots was set to null, so we had more than enough bits.

The first try showed that all the objects for the crash were created by the same function: js_NewObjectWithGivenProto. This immediately (well, after the 1 or 2 days between pushing a patch and having nightly test users experience the crash) ruled out some guesses–we had thought the problem might be that objects were getting GC’d too soon, but the crash addresses proved that all the crash objects were “fresh” and had never been GC’d.

I noticed that js_NewObjectWithGivenProto had a few significantly different code paths related to object creation: (1) a path for “special” objects (like DOM objects), (2) a path for normal objects that are “similar” to their prototypes, and (3) a path for normal objects that are not “similar” to their prototypes. I tweaked the null pointer instrumentation to distinguish these paths, and found that all the crash objects went through path (2). This was very interesting, because path (2) uses special code to set the “shape” of the object, which is also key to the property cache lookup just before the crash.

(Aside: an object’s “shape” is a number that represents its collection of properties (and the attributes of those properties). Two objects that have the same shape are required to have the same properties and the same dslots layout for those properties. See also https://developer.mozilla.org/En/SpiderMonkey/Internals/Property_cache#Shape.)

At this point, my working hypothesis was that somehow, using that initialization path, we ended up with two objects that had the same shape but different layouts. But I had no idea how that could happen.

Software archaeology. Next, I tired a lot of things that didn’t work. It took a long time and it was really frustrating. I also read a bunch of code and got some reasonable guesses as to how the shape problem could occur, but none could be tested through the null pointer trick, but only through more complicated methods that also required upgrades to minidump collection that hadn’t occurred yet.

Eventually, I got a better understanding of the kinds of data I could get from the crash database, and I realized I might be able to find out when this crash had started occurring for trunk nightly users. I dug through all the data and found 3 patches closely related to object shapes and prototypes that were checked in just before the crash started occurring.

The issues with those patches are pretty technical, and some are for possibly security-sensitive bugs, so I won’t go into great detail. But we got a better guess about what was happening and where to look. I posted my findings, then started reading the patches and trying to think about what might be wrong with it and how I could test those things, but I didn’t get any particularly good ideas right away.

Jorendorff. Then, magically, Jason Orendorff figured out the problem. He read the code around one of the patches and found a significant bug. He didn’t tell me exactly why he was reading that code, so I can’t be absolutely sure it was because I pointed at it, but I think it was. :-)

From there, he wrote a patch, which we checked in. And then waited. After 4 days passed and there were no more crashes at this crash point on trunk, I decided it was fixed. Yay!!!

Lessons

I think I learned a few lessons about debugging crashes in the field with only crash report data and basically none of the things you normally hope to have debugging a crash.

One lesson is that it’s important to use all the sources of data available: crash reports, patterns in crash report attributes, minidumps, revision history, bug comments, other articles and docs, source code, and data collected using instrumentation patches. And all these have to be used together.

A second lesson is the null pointer trick, and also the value of being able to push instrumentation patches and get back field data in 1-2 days. This would be much more powerful if we could collect additional information in the crash reports, as controlled by the programmer. (E.g., crash report skidmarks.) I think we can actually do that now with minidumps by printing useful information to buffers allocated on the C stack, which are then part of the minidump.

A third lesson is the value of the history. Being able to see the day on which a bug started and find the patches that went in before that day was huge. One problem we have here is that TraceMonkey generally merges to mozilla-central only every week, so if the problem was introduced in a patch first pushed to TM, there is an entire week of patches that might have caused the problem. We should consider going to a more frequent merge schedule.

A side point is that sifting crash report data and history would be a lot easier if we had query access to the underlying crash data. For now, everything has to go through the web interface, which is limiting (e.g., you can only get back 500 crash reports of a given kind of a given day) and slow (you need to write a script that downloads a lot of crash reports, parses the data, and then run the query you actually care about, so it’s slow in both developer time and computer time).

The final lesson, which may be obvious, is the value of getting more eyes on the problem. I got substantial advice from both Brendan and of course Jason on this. I also benefited a lot from being able to talk through the issues with Andreas, David Anderson, and Luke Wagner. I read the code that contained the bug (which was ultimately a flipped if condition), but I didn’t see it because I assumed that other parts of the code did the right thing so the if would be correct. But they didn’t, and Jason knew it or saw it.

Debugging this kind of crash is all about trying to sort things out that make no sense with almost no data, and it requires both creativity (to think of weird failure modes and experiments to try) and perceptiveness (to see the bug once you’ve narrowed down the search). I suppose creativity and perceptiveness are related in that they are both about being able to see an especially interesting straw in a (sensory or imaginary) haystack. More unique viewers means more chances to see that funny straw.

TraceMonkey @Hacks

I wrote an article on TraceMonkey on hacks.mozilla.org. It’s aimed at web developers and anyone else who wants to understand how TraceMonkey works and what makes it different from other JavaScript engines. The article also talks about what kinds of JavaScript run especially fast or slow in TraceMonkey and some introductory information on how to diagnose and fix TraceMonkey performance problems at the JavaScript level.

I would like to thank Chris Blizzard and Arun Ranganathan for inviting me to write the article. Thanks also to Luke Wagner, David Anderson, Andreas Gal, and Jeff Walden for copyediting and technical editing. Finally, thanks again to blizzard and Arun, and also Eric Shepherd, for doing final proofreading and getting the article online.

PLDI Sampler, part I

I’m back from PLDI in Dublin. The weather was nice and the Guinness was excellent. The research program was also very good. It seemed like this year there was a lot of variety and a different topic selection from previous years. I’m going to blog some notes on an arbitrary (slanted toward things with more likely Mozilla application) sample of talks.


PetaBricks: A Language and Compiler for Algorithmic Choice

Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, Saman Amarasinghe

PetaBricks aims to help programmers produce well-tuned, possibly parallel, high-performance algorithms. The initial focus has been on dense array algorithms, and also sorting as a basic example. In PetaBricks, the programmer can express directly a choice of algorithms and tuning parameters. For example, a programmer can write a sort procedure that offers a choice between insertion sort, n-way merge sort (with n as a tuning parameter), and quicksort. The PetaBricks compiler will then test the algorithm over a range of input sizes and determine the best algorithmic choices. For example, for sorting on an 8-way Xeon, PetaBricks found that it’s best to do a 2-way merge sort until the sublists are smaller than 1420 elements, then quicksort until the sublists are smaller than 600, and finish with insertion sort. Programmers can do this sort of thing by hand, but it’s a ton of work, and has to be redone for every platform.

We at Mozilla have already done a bit of autotuning of TraceMonkey’s “magic number” parameters (such as the number of times to attempt to record a certain trace before “blacklisting” it), and Jesse and I have had discussions about autotuning other parameters in Mozilla. The PetaBricks paper probably has other ideas and insights we could apply in this domain. We could also consider using its ideas more directly to tune Mozilla data structures such as hash tables and string representations.

Staged Information Flow for JavaScript
Ravi Chugh, Jeff Meister, Ranjit Jhala, and Sorin Lerner

This work is designed to protect against attacks made by JavaScript added to web pages by third parties. For example, if site A includes a box with advertising content from site B, the content from B could include JavaScript that makes some kind of attack, or maliciously modifies the main content of A. Site A can in theory prevent this kind of attack by providing an information flow policy and then running a static analysis on the entire web page to verify that the policy is obeyed. For example, site A could specify that JS from other sites is not allowed to navigate away by writing to document.location. Then, site A would provide a static analysis (possibly as a Firefox addon) that would analyze the entire page before executing any JS.

The main problem is that such a static analysis would run very slowly on the client machine–there would be a delay of 10 seconds or more, which is clearly unacceptable. Enter the staged analysis: this paper describes how the author of site A can run a static analysis on just the code from site A, which determines a small set of things that have to be checked about site B, the “residual” check. Then, when the user goes to site A and JS is loaded from third parties, the residual check is run on the third-party JS in about 0.1 seconds. This seems potentially useful on the web and is also a great example of how to partition a static analysis into a long-running ahead-of-time analysis and a fast online analysis.

Efficiently and Precisely Locating Memory Leaks and Bloat
Gene Novark, Emery D. Berger, Benjamin G. Zorn

This paper is about, uh, detecting memory leaks and bloat. More specifically, the paper describes how to make a particular leak detection technique, staleness metering, run fast and produce no false positive leak reports.

As background, the “staleness” of a heap-allocated object is simply the length of time since that object was last touched (by reading from it or writing to it). Thus, objects with low staleness are unlikely to be leaks, while objects with high staleness are probably either leaks (if they will never be accessed again) or bloat (if they are just accessed rarely and there is a way to compact them or release them and recompute them later). The main problem with measuring staleness is that doing it exactly would slow down the program by 100x or so. The main solution that was explored previously was to sample a fraction (say, 1/1000) of memory accesses and estimate staleness from that. The main further problem is that if only 1/1000 of accesses are measured, then a lot of objects may appear stale when they are really not, just because sampling missed the accesses.

This paper proposes a new way of doing the sampling, “data-based” sampling, that produces no false positives. The key mechanism is to use the virtual memory system to protect heap pages so there will be a page fault when they are touched. When a page is touched, the current time is recorded as the estimated time of last touching for every object on the page. Thus, this time provides an underestimate of the staleness of every object on the page, and so objects with high estimated staleness must be stale in reality.

Without any additional cleverness, what I just described would be really slow, because every heap access would incur a page fault. The additional cleverness is that the paper uses a heap allocator that groups objects from the same allocation site and of the same approximate allocation time (and thus age) on the same pages. The paper says that objects that have the same allocation site and age tend to have the same access pattern. Given that fact, it makes sense to classify pages adaptively as hot or cold, based on how recently they were touched. Hot pages are left unprotected and not tracked, so most heap accesses don’t page fault. And this shouldn’t cause the detector to miss many leaks, because the allocation strategy means a hot page probably contains mostly hot objects, which aren’t leaks. Conversely, cold pages are protected and so they are tracked closely, but this doesn’t cause many page faults because cold pages aren’t accessed often.

The authors put all this together into a tool called Hound, which they tested on some standard benchmarks. Unfortunately, it doesn’t appear to be distributed online currently. Running a program under Hound is supposed to increase heap usage by only 10% or so, and execution time usually only a few percent (although more in some cases), so it sounds like it would be great to be able to enable Hound for nightly test builds.

PLDI 2009

Next week we’re taking the TraceMonkey show on the road to PLDI 2009 in Dublin, Ireland. Andreas Gal will of course be presenting our paper Thursday afternoon, and David Anderson (who is back at Moz for the summer) is also coming. Fortunately, the Mozilla 1.9.1 (FF 3.5) blocker list for JS is empty, and hopefully, despite my best efforts, will stay that way. (I set the release tree on fire last night by checking in a typo with a patch to enable one of my blocker bug fixes. Fun.)

Otherwise, I’ve been pretty busy stomping TraceMonkey bugs. I’ve also been doing a lot of work on getting TM to trace accesses to variables defined in lexically enclosing scopes (i.e., upvars/funargs/closures) and the JS arguments construct. A lot of that work is still ongoing or was finished too late to be adequately tested for FF 3.5, so it’s only on trunk, but the first improvement I made, tracing upvars defined off the current trace, is in the release branch.

Finally, Nick Nethercote has been fixing up my TraceVis instrumentation so it can be checked into the main Mozilla tree. Soon it should be possible to enable TraceVis in a standard build. I’m also starting to hear about some other work to make TraceVis more useful for browser builds and longer runs.

TraceMonkey@PLDI

Last fall, we submitted a paper on TraceMonkey to PLDI (officially: ACM SIGPLAN Conference on Programming Language Design and Implementation), one of the top conferences for programming language research. Our paper was accepted, and Andreas Gal will be presenting the paper on June 18 in Dublin, Ireland.

We’re hosting a PDF copy on the blog. And here’s a direct link: TraceMonkey PLDI 2009 PDF.

OSQ: the next 5 languages for the web

In the evening at the OSQ retreat, we had some informal discussions about new scripting language designs and languages for mobile devices. The starting points for these discussions were (a) that scripting language programmers will soon want to use parallelism, and (b) that mobile devices will have uniprocessor performance at least 12x slower than laptops for the foreseeable future.

New JS VM techniques (like TraceMonkey’s) are getting us a 2-4x speedup compared to what we’ve been used to for a language like JS. Is that good enough for mobile? I don’t know. Bill’s results (in my last post) suggest that one way or another, we can get a further 2-3x, which might bring us to a 4-12x speedup compared to the old JS stuff. That might be good enough, or maybe not.

The next question in my mind is, if we want to someday use some new language to get a bigger speedup, what should it look like? The first thing is that it should have a reasonably good translator to JS, so that it can run on any browser, even if the fancy new language is not supported. After that, I think two interesting starting points are Ocaml and Lua.

Ocaml is a high-level typed language with lists, garbage collection, and a lot of other features to make life easy for programmers. The biggest problem is the types: many programmers seem to prefer untyped languages like Python and JS, and types make prototyping more difficult. The types are also a great advantage: they really help with reliability and performance: in shootout-type benchmarks, Ocaml achieves performance very close to C.

Lua is a scripting language in the vein of Python and JS with simple, well-documented semantics. This simplicity makes it possible to deploy a tiny Lua interpreter. It also helps with performance: Lua has a small, general set of hooks for metaprogramming, while Python and SpiderMonkey have several such mechanisms, which has a cost in complexity and performance. Also, Lua doesn’t include complex features like inheritance in the base language, which presumably makes it a lot easier to optimize property accesses.

A related question is how to expose parallelism. I tend to favor mechanisms that don’t share mutable state, such as DOM worker threads or map-reduce. One reason is that in scripting languages, property accesses are not at all atomic (they potentially require multiple hashtable lookups) so every property access has to be locked. Even with a lot of clever implementation, the locks are still pretty costly, and even for programs that are not actually concurrent: we think on the order of 10-30% in SpiderMonkey. (And Python TBMK still has the global interpreter lock, and the official recommendation is to use multiple processes as a better way of getting real concurrency.)

Parallelism is easier in functional languages like OCaml, because they don’t mutate state very much–mostly they just create new values and read them. Also, in a typed language, a property access is much more like a simple memory read, which is easier to make atomic, or atomic enough.

So, I would like to see at least 2 new web languages, which could be JavaScript dialects, evolutions of other existing languages, or something entirely new. The first, which might be called MiniJS, would be an untyped scripting language, which would be used for most applications. MiniJS would look like JS with simplified semantics (and no ‘with’ statement for sure) and support for concurrent programming. The other language, which could be called TypedJS, would be a typed functional language inspired by OCaml, Scala, and perhaps even the dreaded ES4. TypedJS would be used for applications with stringent performance and reliability requirements. The two languages should be able to communicate but I don’t think they need to be mixed freely: in fact, making them part of the same language would probably make it too hard for each language to realize its full potential.

OSQ: Dynamic language optimization

Random bits from the second half of yesterday’s OSQ events:

Bill McCloskey gave a quick talk on his experience doing some optimizations to Python. (From a VM optimizer’s point of view, Python and JS are very similar.) He had previously used whole-program type inference to drive a translation to C, and in this way got performance about a factor of 2 or 3 worse than C, which is excellent for a dynamic language. But he said that the inference process was kind of slow, so he wanted to try some lightweight techniques to see how far they would go.

First, he tried something he called “dictionary hints”, in which he used a static analysis to guess the address of the result of a hash table property lookup (as in ‘o.f’). He modified the interpreter to look there first, and only if that fails do the general hash table lookup. That gave him a nice speedup (20%+). IIUC, dictionary hints are the static analogue of property caching, as done in SpiderMonkey (and Nitro and V8, I believe).

Bill also tried replacing Python’s basic value representation of a pointer to the heap with a tagged value that keeps integers inline, thus avoiding the extra memory load and a lot of extra memory allocations. He got a big speedup (3-4x) from that. SpiderMonkey also already uses this technique, as do Nitro and V8.

In summary: he found these rough levels of run time relative to C in practice:

  • 2-3x for Python compiled by Dynamite (his type inference-based compiler)
  • 4-10x for Python run with psyco (a dynamic specializer somewhat related to TraceMonkey)
  • 20x for normal Python

I haven’t looked at JS vs. C lately, but I believe TM is currently on the order of 4-10x vs. C, like pysco. In theory, dynamic compilers should be generate code that runs at least as fast as static compilers, because they have strictly more information. But in practice, static compilers seem almost always to be faster. In this case, Bill’s static compiler can generate very fast code, but I think it takes a few minutes to do the type inference, so it would be beneficial in a dynamic compiler only for a very long-running program. (If you have 2 cores, the VM could start the compiler at the same time the program starts running, and as long as the program runs for several minutes, it would eventually speed up greatly.) But at least on the Web, programs that need even 1 minute of CPU time are still rare. But this may change.

Berkeley OSQ Retreat

I’m at the Berkeley OSQ (Open Source Quality project) Retreat in Santa Cruz right now representing Mozilla. It’s an annual event where professors and grad students present their latest research results and ideas.

There’s a lot of good stuff here, so I’m just going to blog about a few things that seem particularly relevant to the web. Juan Caballero just finished his talk, Secure Content Sniffing for Web Browsers, or How to Stop Papers from Reviewing Themselves. He explained content-sniffing XSS attacks and told us about his work with Adam Barth and Dawn Song on analyzing website and browser vulnerabilities to those attacks, and their recommendations, which have been adopted by the HTML 5 working group.

He also mentioned some binary program analysis (dynamic, static, and “concolic” hybrid) frameworks their group is using, which they call BitBlaze.

Google Summer of Code: Rodrigo Sol

A little while ago, Rodrigo Sol at the Federal University of Minas Gerais submitted a Google Summer of Code proposal to create a new register allocator for TraceMonkey. I’m pleased to announce his proposal was accepted!

Rodrigo’s project idea is to implement Register Allocation By Puzzle Solving, a paper in PLDI 2008 by his advisor, Fernando Pereira, and Jens Palsberg. The paper describes a register allocation algorithm based on a model of fitting puzzle pieces (program values) into a board (the register file). Their results show allocation speed equal to the fastest good allocators (such as LLVMs) and also high performance for the generated code (in the register allocation aspect), which sounds like exactly what we’d love to have in TraceMonkey.

The end result will be a new register allocator for nanojit, TraceMonkey’s back-end code generator. Initially, I’m hoping Rodrigo’s allocator will reduce the number of spills and register-to-register moves, which should improve performance by an as yet unknown amount on all TraceMonkey applications. It should also help other applications that use nanojit, such as the simple regular expression compiler I coded up a while back. (Side note: if anyone out there is looking for a compiler-related project, extending and/or redesigning that compiler might be fun.) Later on, I hope the new allocator will help enable other TraceMonkey advances, such as carrying values in registers across trace loop edges and from one trace to another.

Rodrigo has a wiki for his project, including a blog and links to related articles and papers.