27
Jan 12

DataShrink work

Over the last month, I’ve been working on some patches to address the relocation issues I’ve blogged about and just general space wastage in libxul.  The upshot is that Firefox 13 will have shaved almost 100K of data and relocations for smaller binaries and slightly faster load times:

  • Bug 717311 was the big one; we were wasting 70K+ of space in some Unicode tables due to structure padding.  Fixing this was just a matter of using the right bitfield datatype so the compiler would pack things correctly.
  • Bug 717501 was much the same: use smaller datatypes when the data you have fits in them.
  • Bug 711563 was more in line with the invasive relocation changes discussed earlier.  The patches there rearranged the storage for the stub names to avoid relocations and generally packed things more efficiently.

Not going in this cycle, but worthy of mention is bug 704848 for rearranging the effective TLD name table to avoid relocations; that bug will save 40-50K of space in data and relocations.  Jason Duell and I talked on IRC yesterday and while he’s in favor of the idea, he’d like to see if gperf makes things any better.  No sense in constructing a table at runtime if you can construct it at compile time!

Are there more instances of things that could be compressed like this?  Yes, but the savings from them are likely to be much smaller on a case-by-case basis:

  • Anything that uses nsStaticCaseInsensitiveNameTable could be tweaked to use less space and relocations by providing the entry data in a different format.
  • nsElementTable.cpp:gHTMLElements could probably be rearranged for the same sort of savings.
  • Initialization of atoms could stand some rearrangement for relocation reduction, at the cost of some ugliness elsewhere.
  • YARR, used by the JS engine, has some fabulous bit vectors represented as character arrays; squashing those would shave 100K of data, but would likely be tricky.
  • Eliminating null entries from the tables in table-driven QueryInterface methods would help, possibly at the cost of some extra parameter traffic to NS_TableDrivenQI.
  • Actually, rewriting IIDs for XPCOM interfaces to fit in 32 bits could save a number of relocations in the tables for the aforementioned methods.
  • The tables required by PR_ErrorInstallTable are breeding grounds for relocations; changing them is unlikely to happen anytime soon.  (Those tables do account for a significant chunk of the relocations in libnss and libnspr, though.)
  • Any number of places where we use pointers in constant data structures; there are quite a few of them, but converting each one saves 30-50 bytes each.  Best to save these for when you are really bored.

All of the above might amount to 200K of data+relocation savings.

Really, though, there’s not that much to trim (or perhaps what is left to trim is decidedly non-trivial to trim).  If you do something like:

readelf -sW dist/bin/libxul.so | grep OBJECT | awk '$3 > 1000 { print }' | c++filt | less

in your objdir on a Linux-y system, you’ll see that quite a lot of the largest objects come from tables for character conversion or character detection.  However, the authors of said code were already conscious of the space required by these tables and they have tended to use the smallest datatypes necessary.  vtables make a number of appearances (hard to get rid of at the moment).  There are also some tables for media codecs, which presumably are difficult to trim down.


26
Jan 12

compressing strings in JS

As we kept increasing the amount of information we send in via Telemetry, we need to start thinking about how to keep the size of the ping packets containing that information as small as possible. The packets are just JSON, so the first thing to try is to compress the data with gzip prior to sending it.

This is how you compress a string in a language like Python:

compressed = zlib.compress(str)

(Yes, yes, this is not gzip compression.  Close enough for pedagogical purposes.)

Short and simple. Boy, I hope it’s that easy in JS. Hm, let’s see, there’s this nsIStreamConverter interface, that looks promising:

let converter = Cc["@mozilla.org/streamconv;1?from=uncompressed&to=gzip"].createInstance(Ci.nsIStreamConverter);
let stream = Cc["@mozilla.org/stringinputstream;1"].createInstance(Ci.nsIStringInputStream);
stream.data = string;
// Hm, having to respecify input/output types is a bit weird.
let gzipStream = converter.convert(stream, "uncompressed", "gzip", null);

OK, we wound up with a stream, rather than a string, but that’s OK, because nsIXMLHttpRequest.send will happily accept a stream. So, nothing to worry about. (This is a little white lie; please hold your comments until the end.)

Hm, that doesn’t seem to work. I get NS_ERROR_NOT_IMPLEMENTED. Oh, look, nsDeflateConverter doesn’t implement nsIStreamConverter.convert. In fact, none of the stream converters in the tree seem to implement convert. What a bummer.

Hey, here’s nsIStreamConverterService! Maybe he can help. His convert method just punts to nsIStreamConverter.convert, so that won’t work, though. Ah, nsIStreamConverter has an asyncConvertData method, let’s try that:

function Accumulator() {
  this.buffer = "";
}
Accumulator.prototype = {
  buffer: null,
  onRequestStart(request, context) {},
  onRequestStop(request, context, statusCode) {},
  onDataAvailable(request, context, inputStream, offset, count) {
    let stream = Cc["@mozilla.org/binaryinputstream;1"].createInstance(Ci.nsIBinaryInputStream);
    stream.setInputStream(inputStream);
    let input = stream.readByteArray(count);
    this.buffer += String.fromCharCode.apply(input);
  }
};

let accumulator = new Accumulator();
let converter = Cc["@mozilla.org/streamconv;1?from=uncompressed&to=gzip"].createInstance(Ci.nsIStreamConverter);
// More respecifying input/output types.
converter.asyncConvertData("uncompressed", "gzip", accumulator, null);
// Oh, that method doesn't actually convert anything, it just prepares
// the instance for doing conversion.
let stream = Cc["@mozilla.org/stringinputstream;1"].createInstance(Ci.nsIStringInputStream);
stream.data = string;
converter.onRequestStart(null, null);
converter.onDataAvailable(null, null, stream, 0, string.length);
converter.onRequestStop(null, null, 201 /* 417 */);
compressed = accumulator.buffer;

Well, it’s not as simple as I hoped for, but I guess it works.

FWIW, I do understand why the input/output types have to be respecified.  But I think this is about the best way to do things currently; that’s kind of frightening. The above is one of those instances where you start to understand why people complain about things being so crufty.


13
Dec 11

invasive changes for relocation reduction

We have a lot of relocations in our libraries; almost 240k in libxul alone. Many of these relocations are due to the massive number of vtables that we have; every function pointer in a vtable requires a runtime relocation. And you can find function pointers in XPCOM CIDEntrys, function tables for PLDHashes, and so forth, which are certainly convenient.

But for other kinds of data, such as tables of strings, or tables of structures that contain strings, relocations are entirely avoidable if you’re willing to restructure things. The idea is not new; it’s been described in Ulrich Drepper’s manifesto for writing shared libraries, it’s been used for data structures in compilers and otherwise, and it’s often used in pickling data to and from disk.

The basic idea is that instead of storing pointers to data, you store offsets into some table of data. For instance, if you had:

static const char *errors[] = {
  "no error",
  "on fire",
  "timed out",
  "pebkac"
};

const char *geterr(int i)
{
  return errors[i];
}

then errors itself would be 16 bytes (assuming 32-bit pointers), plus the actual string data (34 bytes), plus the relocations for each entry in errors (32 bytes on Linux/x86, more or less depending on your platform), for a total of 82 bytes.

If instead you had something like:

static const char errtable[] = { "no error\0" "on fire\0" "timed out\0" "pebkac\0" };
static const uint8_t erridx[] = { 0, 9, 17, 27 };

const char *geterr(int i)
{
  return &errtable[erridx[i]];
}

you’d only need 34 bytes for string data plus 4 bytes for the index table, or 38 bytes total, for a savings of more than 50%. You do pay an extra instruction on each call to geterr as well; maybe you want to count that in the byte count if you’re being pedantic.

If you have character pointers embedded in structures, the savings can be more dramatic, because offsets can be smaller than pointers, which may enable you to pack structures better. For instance, if you have:

struct foo {
  const char *s;
  bool b1;
  bool b2;
};

sizeof(struct foo) would be 8 on a typical 32-bit platform, due to padding after b2. Using an offset scheme as described above, you now have:

struct foo {
  uint16_t offset;
  bool b1;
  bool b2;
};

and now struct foo is 50% smaller because padding is no longer required.

Obviously, forming the erridx array above can be tedious; Drepper’s paper linked above describes how one might use the C preprocessor to avoid such manual construction. His construction relies on the C compiler supporting string constants of arbitrary size; MSVC limits us here, as the maximum size of a string constant (post-preprocessor concatenation) is 65K bytes. While there are ways around this limitation, the limitation is not so bad as one might think: using 32-bit indices doesn’t make us much better off than we were before (it does avoid the relocations, which is worth something) and the compiler can now warn us if we’re about to walk off the deep end with 16-bit indices.

There are a number of places in Mozilla that could benefit from such reorganization: the effective TLD table (the above example about structure packing is taken from there), various tables in the CSS and HTML parsers, and so forth. Shared bits like constructing AtomTables could also have their interfaces changed to not be quite so pointer-heavy. Quickstubs could benefit from string table reorganizations and less pointer-ness. The list goes on, I’m sure.

I have a proof-of-concept bug open for nsEffectiveTLDService.cpp; I have been hesistant to dive deeply into other places; for one, the above bug doesn’t seem to have been touched for several weeks, and for two…

Is all this really worth it? I think it is; smaller binaries and faster startup times are a good thing in my book. I realize we have elfhack on mobile for reducing the size and cost of relocations; I say it’d be better to not pay for them in the first place. We also support other platforms where tricks like elfhack don’t work. But what do you think?


11
Nov 11

linker hacking for improved code locality

Hm, haven’t posted in a while.  Having your tonsils out will do that!

Mike Hommey recently blogged on some binary layout issues he’d run into with Thumb-2 binaries.  For the last week or so, I’ve been working on one of those problems: Thumb-2/ARM trampolines.

memcpy calls, how do they work? After all, code in your executable or shared library can’t directly call memcpy, since the address of memcpy is unknown until runtime. You can handle this in various ways; the solution adopted for ELF systems is something called a procedure linkage table, or PLT. (The mechanisms on Windows and Mac are similar, but the details are probably somewhat different; I’m not familiar with them.)

The PLT is simply an extra layer of indirection. Each entry in the PLT has a corresponding entry in the global offset table (GOT) associated with it; the GOT entry is the address of the actual function the PLT entry should call. At load time, the GOT entries are set to the address of the entry point of the dynamic linker’s lookup function.

When you call memcpy, then, you’re actually calling the PLT entry associated with memcpy. The first time you make such a call, the dynamic linker is invoked to figure out where memcpy actually lives. Once it has done so, it updates the associated GOT entry with that address and calls the “real” memcpy. Subsequent calls to memcpy then find the actual address in the GOT and jump directly there, bypassing the dynamic linker.

(glibc on Linux plays elaborate games so that calls from libc to libc functions don’t go through the PLT; the details of doing so are best left for a separate blog post.)

OK, so onto the problem description. PLT entries on ARM Linux are ARM mode code. Of course, it’s perfectly valid for Thumb code to want to use these PLT entries; it’d be silly to have separate PLT entries for ARM calls and Thumb calls. For non-tail calls, Thumb code can just use blx, which switches to ARM mode and jumps to a given address. But for tail calls, there’s no jump-to-offset-and-switch-mode instruction (bx jumps to an address in a register, not an immediate offset). So some cleverness is required.

In the gold linker, the solution taken is general: whenever we have such a Thumb-to-ARM branch (whether to a PLT entry or otherwise), generate a small stub of code to do the mode switch and branch, then branch to that. Such stubs are generated for various other similar cases, so the linker was already doing such things anyway.

There’s two issues with this approach. The first is that the stubs take up extra space, then second is that they’re placed wherever the linker thinks is convenient, which might not be the best place for improving the layout of the compiled code. (The linker does put some effort into reusing stubs, so if you have several Thumb-to-ARM calls/jumps to the same address, the linker will reuse a stub, rather than duplicating it willy-nilly.) The second is the real dealbreaker here.

The GNU linker’s approach is more clever: it embeds the necessary stub into the PLT entry itself. Whereas the entry previously looked like:

 22c:	e28fc610 	add	ip, pc, #16777216	; 0x1000000
 230:	e28cca08 	add	ip, ip, #32768	; 0x8000
 234:	e5bcf080 	ldr	pc, [ip, #128]!	; 0x80

it now looks like:

 228:	4778      	bx	pc
 22a:	46c0      	nop			; (mov r8, r8)
 22c:	e28fc610 	add	ip, pc, #16777216	; 0x1000000
 230:	e28cca08 	add	ip, ip, #32768	; 0x8000
 234:	e5bcf080 	ldr	pc, [ip, #128]!	; 0x80

(This works because bx pc actually branches to the PC of the bx insn + 4: i.e. 0x22c.)

So direct jumps from Thumb code use the entry point at 0×228, whereas all other calls and jumps from ARM or Thumb code use the entry point at 0x22c. This method is smaller (4 bytes for the PLT entry stub versus 16-ish bytes for the out-of-line, imperfectly located stub) and plays nicely with section reordering.

We could use GNU ld for our linking needs, but gold has several very nice features (performance, identical code folding, section ordering, incremental linking, etc.) that make its use desirable. So the last week of my time has been spent learning gold well enough to add these small Thumb stubs to PLT entries when necessary. gold can now do this, woohoo!

I believe my code is suitable for using in Mozilla builds. Before being submitted upstream, however, there need to be some kinks worked out with regards to incremental linking: gold’s incremental linking support assumes that all PLT entries are the same size. The above modifications mean that some will be 16 bytes and some will be 12. (A possible cheesy hack would be to make all stubs 16 bytes when incremental linking…) It might also be worthwhile exploring a more general solution for stub positioning so that stubs play nicely with section reordering.


17
Oct 11

reordering plugins and edge profiles

Mike Hommey and I have been working with a linker plugin that uses the callgraph to reorder functions on Linux and Android. One feature of the linker plugin is that it dumps a simplified representation of the callgraph to a text file. The simplified representation is really simple, just edge counts for all the edges in the callgraph. It’s not perfect, since the ordering of caller and callee in the dump file is not always consistent. Also, you get a few spurious functions in there, like __builtin_expect and SSE builtins calling things.  (__builtin_expect just provides a hint to the compiler about how to order basic blocks for more cache-friendly control flow; SSE builtins should compile down to single/few instructions and never actually call something at runtime.)

Nevertheless, looking at the file for libxul.so can be illuminating. Actually, it’s probably more illuminating to narrow things down to the edges with 100k+ call counts and demangled function names. Doing that, we can see that:

Anyway, maybe people have already seen all this stuff, but I certainly hadn’t.


29
Sep 11

pgo startup times with syzygy

In an effort to confirm that we do want all this syzygy goodness in our release builds, I’ve been testing out syzygy on PGO builds (since we do PGO builds on Windows for releases). After removing the PEBKAC and getting a proper PGO build–which took depressingly long–I have mixed results.

First, the good news. On my laptop (Win7, Core i7, SSD), the about:startup numbers look like this:

Version main sessionRestored firstPaint
Base PGO build 265 3152 3012
Optimized PGO build 234 2778 2653

These numbers are really encouraging; they’re actually even better than the initial numbers I posted earlier.  (Though I note that they are universally slower than the earlier numbers…hm….)

There is one curious thing about these builds, though. When you look at the page fault numbers, they suggest a much different story. The (cold) numbers are what you get from starting Firefox just after a reboot; the (warm) numbers are from a second startup after the first.

Version Hard faults (cold) Soft faults (cold) Hard faults (warm) Soft faults (warm)
Base PGO build 2507 41219 8 26100
Optimized PGO build 2264 41488 14 23017

These numbers are totally contrary to what I saw with non-PGO builds. We’re not consistently lower in the optimized build on either measure. I honestly haven’t thought very hard about what this means yet.

Anyway, that’s the good news. The bad news is that on my desktop (Win XP, Core 2 Quad, mechanical drive), the about:startup numbers look like this:

Version main sessionRestored firstPaint
Base PGO build 1516 8984 8813
Optimized PGO build 1437 9187 8828

(I don’t have the necessary profiling tools on my XP box for doing page fault analysis. I shouldn’t think they’d differ dramatically between the two systems, though.)

This is a little discouraging. The syzygy-optimized build is a little faster off the line, but gets edged out by the base build in getting to the points that actually matter. I haven’t thought terribly hard about these numbers, either. One possibility is that I did turn off the XPCOM glue preloading bits, which IIUC correctly are helpful for encouraging XP to keep its hands off your binary’s startup time. Doing that was necessary for getting postlinking to work properly. If I made that runtime-configurable, then I could run tests with the preloading enabled and see if we win there.

Bottom line: We would win on leading-edge machines, but we wouldn’t see a lot of benefit on older machines.

Also, if Microsoft would add a drop_caches lookalike, that would be fantastic.


20
Sep 11

startup reduction times with syzygy, part 2

In my previous post, I presented some startup timings with syzygy-optimized Firefox binaries.  People asked whether those timings were on my work laptop (quad-core i7, Windows 7, SSD) and I confirmed they were.  Since those timings were encouraging, but not overwhelming, folks suggested that I might try timing things on a more conventional system.

Below are results from testing on my desktop (quad-core Core 2 Duo @ 2.6GHz, Windows XP, 7200ish-rpm hard drive):

Version main sessionRestored firstPaint
Trunk build 1484 11515 11125
Optimized build 2562 8812 8703

That’s quite a difference: about a 25% win just from reordering functions.  Much more exciting!


19
Sep 11

startup reduction times with syzygy

People that I’ve told about the work with syzygy that I’ve been doing have, almost universally, two reactions:

  1. That’s cool!  (Thanks; I did very little work for it!)
  2. How does that translate into startup time?

Assuming that a 40% reduction in page faults leads to a 40% reduction in startup time is not reasonable, but surely there should be some reduction in startup time, right?  I finally benchmarked this today using the about:startup extension; these numbers are from cold start, freshly rebooted both times:

Version main sessionRestored firstPaint
Trunk build 125 1733 1671
Optimized build 125 1639 1577

So a 40% reduction in page faults translates into ~6% reduction in startup time; not stellar, but not too bad either.

The next step is making sure this all works with PGO builds on Windows. Then we get to have a discussion about whether to incorporate this into the regular builds and getting all the infrastructure on build machines.


17
Sep 11

notes from the all-hands profiling bof

After talking to a couple people at all-hands, it became clear that writing your own profiler was a popular activity.  (Jeff Muizelaar informed me that last year, the pet project was heap analyzers.  What’s next for 2012?)  A short, non-exhaustive list:

…and so forth.  So it seemed like getting interested parties together to talk about One Profiler to Rule Them All would be good.  And it worked; we had probably 20 people come to the BoF.  Below is my summary/recollection of what we discussed..  All the omissions and/or misrepresentation in the notes are my own; please leave a comment if you felt I left something out or need to describe something better.

What does the ideal profiler look like?

  • Cross-platform
  • Low overhead, both when profiling and not
  • Collects more-or-less complete call stacks (this is fairly easy everywhere except x86/Linux)
  • Understands compiled/interpreted JavaScript and C/C++ code
  • Built with the browser itself, not an external process or loaded via LD_PRELOAD or similar; this means we can ship it with the browser and diagnose problems in the field
  • Pretty pictures for viewing collected data.  Who doesn’t love pretty pictures, right?

Bas Schouten also pointed out that it might be much more efficient to just buy suitable profiling technology from a vendor; there was some skepticism that a profiler fulfilling the desiderata existed, though.

Since Benoit gave a demo of his profiler and his profiler seems likely to get into the tree, most of the discussion centered around that or variantions thereof.  Benoit’s profiler works in the usual way via periodic signaling; however, instead of unwinding the call stack, Benoit chose to require annotations (via RAII objects) placed in various parts of the code to indicate what your “callstack” was when a sample was taken.  I believe Steve said his profiler does something similar for JavaScript.  I put “callstack” in quotes because you only get to know whether functions were on the call stack if annotations had been placed in them.

Sprinkling annotations all over the tree sounded like a tedious process.  Somebody pointed out that Chrome does this, though they only place annotations at “entry points” for modules, so you might have one entry point for layout, one for graphics, etc. etc.  That way, given a profile on some random performance bug, you can at least tell who should be exploring the bug further with minimal overhead, since you’re not unwinding and a handful of RAII objects isn’t going to cost much.  Granted, this doesn’t do much for the folks who need to dig deeper, but perhaps we can have other tools for that.

There was some discussion of unwinding instead of annotations.  Unwinding is reasonably cheap when using libunwind and caching decoding of the unwind information; it’s even cheaper when you can just walk frame pointers.  The only wrinkle is you aren’t guaranteed to have frame pointers or unwind information on x86/Linux, so unwinding is not generally doable there.  Sometimes assembly coders also forget they need to insert unwind annotations, though most if not all of the problematic code in glibc, at least, has been so annotated.  Taras Glek suggested that we could insert RAII objects before calling out to code that we don’t control and to make those objects record something about frame/stack pointers so that we could unwind around third-party code if necessary.  I don’t believe we came to a consensus on using unwinding instead of or in addition to annotations.

I can’t recall if we talked extensively about unwinding through JavaScript code.

We didn’t talk about displaying the collected data.  Drawing pretty, understandable pictures is hard.

Benoit and Steve agreed to talk further offline about modifying Benoit’s profiler to understand JavaScript; if you’d like to help with profilers, talk to either of them.  It’d be great to have something in the tree that we can all work with and improve.


08
Sep 11

fewer page faults with syzygy

In my last post, I explained what Syzygy was and discussed some preliminary results from using it with Firefox.  I finally have results of running the whole Syzygy toolchain (instrumentation/profiling/reordering) and some numbers to share.

First off, I mentioned that I could get the call tracing to work right.  That was because I hadn’t installed Syzygy’s counterpart, the Sawbuck log viewer.  That’s the bit that contains the trace provider for the call tracer; you can download Sawbuck or you can install the src/sawbuck/Debug/sawbuck.msi module from your own build.

Secondly, these numbers need to be collected with a change to browser/app/nsBrowserApp.cpp to turn preloading off; otherwise the preloading of xul.dll will defeat the tracing mechanism.  With an eye to eventually getting this into the Mozilla build process, I’m not sure how well that bit will work out; perhaps preloading could be made dependent on an environment variable.  We can then turn off preloading during the build + postlink and get the benefits of a properly laid-out xul.dll without hacks in the source.

With my laptop (Windows 7, Intel SSD), a cold startup of a trunk build of Firefox results in about 2300 hard page faults and 43000 soft page faults.  These terms are what Windows uses; other operating systems call them major and minor page faults, respectively.  In any event, hard page faults are what we care about, because those result in talking to the hard drive.

After running the optimize program that comes with Syzygy–which automates the running of the instrumenter, the collection of profile data from the application, and reordering the library(s) in question–cold startup results in about 1400 hard page faults and 27000 soft page faults.  Like the Chrome folks saw, this is about a 40% reduction in hard page faults; soft page faults in this scenario are reduced to about what you’d see with warm startup.