Feed on
Posts
Comments

Fragmentation: SQLite & Friends

I am happy to report that the SQLite fragmentation problem is now solved. I copied my profile a month ago, and my places.sqlite is still in a single fragment! There was a similar fix done to Firefox disk cache. Thanks to helpful comments on my OSX preallocation cry for help, we now preallocate efficiently on OSX too.

Startup cache is the last remaining bastion of fragmentation, but that’s already 10x better than it was a month ago. I have two complimentary solutions for that: either omnijar startup cache generation for core code and/or write the cache more efficiently.

Firefox 4 will be a lot more gentle on those spinning platters.

GCC

I helped Jan Hubicka on a GCC summit paper. Those nasty static initializers will not be a hassle in GCC 4.6!

I keep wanting to blog about how we switched to GCC 4.5 and how life is wonderful, but life didn’t work out this way. So far we tried switching away from 4.3 compiler three times. The first time GCC completely failed in terms of -Os performance. C++ -Os is more bloated in 4.5 (because that option is benchmarked on C apps). Then it turned out that libffi was being miscompiled (we also found a related bug in libffi). Last time, we tried switching to GCC 4.5 + -O3 since that performs much better than -Os, but that broke sunspider. Hopefully we can fix the sunspider issue and try again next week. I would really like to utilize GCC PGO to produce fastest possible Linux Firefox builds.

Nonetheless I happy with recent GCC progress. With Jan’s help, GCC will eventually be very good at compiling Mozilla. In my spare cycles I’ve been working on setting up GCC benchmarks using Mozilla to help avoid future surprises like we discovered in GCC 4.5. More on this later.

Diagnosing Slow Startup

I spent last week trying to reproduce slow startup on Windows. Some users were reporting >30 second startup, supernova_00 has been feeding me xperf traces on IRC reproducing slow startups.

Startup Bugs

Turns out that if a website uses non-standard font names this can trigger Firefox to start parsing every single font on the system, freezing the browser in the meantime. Turns out facebook does this :( . This is now a blocker bug 600713. This bug has the unfortunate effect of overshadowing any startup improvements in Firefox 4.

We have some code to keep our databases in good shape by VACUUMing them. This is getting revamped in Firefox in bug 541373. In the meantime, for many current users performance suffers due to missing vacuums. If you are suffering from slow Firefox startup, and/or slow Awesomebar try this manual vacuum. This helps in older Firefox releases, but in Firefox 4 this has the effect of supercharging the SQLite database performance by switching to 32K pages.

Scareware

Another fun discovery was the effect of anti-virus software(AVG in this case). Like an annoying pet, AVG has to have a sniff and fondle every file that Firefox opens on startup. Apparently this is a feature called on-demand scanning, yuck.

But the fun doesn’t stop there, Windows has a wonderful prefetch mechanism that speeds up app startup. Unfortunately for supernova_00, \Windows\Prefetch just wouldn’t get populated with Firefox info, meaning that Windows wasn’t optimizing Firefox startup. Once I installed AVG, I ran into the same problem. Uninstalling AVG didn’t help. For whatever reason deleting every file in \Windows\Prefetch fixes that problem. For both of us prefetch got repopulated after being cleaned.

XPerf

Microsoft XPerf makes trivial to optimize cold startup. None of the other OSes have precanned analyses showing how much each individual file access is contributing to slow startup.

If you have a startup problem, I’m much more likely to be able to reproduce it if the report comes with an xperf trace. To get xperf run the Microsoft Platform SDK installer, select “Windows Performance Toolkit”.

To record an IO trace:

  1. Reboot
  2. run cmd.exe as Administrator
  3. xperf -on latency+FILE_IO+FILE_IO_INIT+DISK_IO
  4. run Firefox, reproduce the bug
  5. xperf -d report.etl
  6. Run xperf report.etl to view the report.

Click on “IO Counts” or “Hard faults” graph, select “Summary Table”. “IO Time (ms)” is the interesting column there. To get an idea of the sequence of IO operations, export the summary table to .csv and load it in a spreadsheet/grep/whatever. Every Firefox developer should give xperf a try, addon authors are encouraged too.

Firefox 4: jar jar jar

Opening files is relatively expensive. There is a small syscall overhead and a higher overhead of fetching data from disk. Depending on physical data layout and disk type, this can leave modern CPUs twiddling their thumbs for a long time while the disk skips around fetching all of the different file pieces.

Optimization #1: Fewer naked files

About two years ago I started gathering naked files on disk and shoving them into jars (eg bug 508421). We made jar reading as efficient as possible by cleaning up code and switching to mmap. Eventually all application data files read from disk during “normal” startup ended up in jars. Unfortunately we ended up with four jars (toolkit, chrome + 2 locale jars), which felt silly. Due to limitations in XPCOM, a lot of naked files were still read from disk on version upgrades and extension installation.

Optimization #2: One jar to rule them all

Recently Michael Wu unleashed a can of omnijar whoopass. This was a massive effort driven by Android packaging requirements. Now application startup data is always being read from one file. This implies better data locality, less seeking, less waiting. One benefit of packing files tightly is that the OS speculatively reads data from disk in chunks that are usually larger than what the application requests. This makes reading nearby files free. Unfortunately there was no good way to predict the order that files will be accessed in without actually running Firefox, so there was more room for improvement.

Optimization #3: Optimized jar layout

So now that all of our data was in one file, the next logical step was to pack it intelligently. The only way to do this is to profile Firefox startup and then order the jar according to that. Unfortunately even once one lays out all of the jar entries sequentially we were still doing our io suboptimally. This was due to the fact that the zip index (jars are zip files) is traditionally located on the end of the file. Wikipedia entry has pictures to illustrate this.

In order to maximize readahead benefits and minimize disk seeks it would be nice to have the file index in the front of the file. So I changed our zip layout from

<entry1><entry2>…<entryN><central directory><end of central directory>

to

<offset of the last entry read on startup><central directory><end of central directory><entry1><entry2>…<entryN><end of central directory>

So all I did was change the offset in <end of central directory> to always be 4 (it can’t be 0 because anal zip programs balk at “NULL” central directory offsets). Then I added a second identical <end of central directory> entry to keep the the rule that the central directory is always followed by one. I also used the extra space forced upon me by overly vigilant zip programs to store a number indicating how much data we can preread on startup.

This yielded a 2-3x reduction in disk io over an unoptimized omnijar. This is on top of a >30-100x reduction achieved by going from naked files to omnijar.

The downside of my interpretation of the zip spec is that some zip programs expect zip files to be more rigid than the spec allows. Older versions of Firefox, Microsoft zip support in windows, WinRAR, unix zip programs, etc accept my optimized jars. 7zip, broken antivirus (it’s a security risk to be overly picky) fail.

Trivia: this isn’t the first time we got tripped up by picky zip reading code. For example, the Android apk reader irritatingly insists at having a zip entry at byte zero of an Android package. This means that one can’t use apks to do the Android equivalent of self-extracting .exe files on Windows. Michael Wu is writing a custom library loader to deal with that :)

Optimization #4: More Omnijar

Feeling that omnijar wasn’t awesome enough, Michael Wu went ahead and omnijared extensions. Most extensions will no longer need to be unpacked from xpi files. This also means that extension authors can opt to use the optimized jar format above to further speed up Firefox startup.

Other jar optimizations

Switching to jars via startup cache will allow us to further optimize our first startup. There is option of halving our jar IO further by actually making use of that readahead integer I added to optimized jars.

To fight fragmentation it is best to tell the OS to allocate a continuous chunk of space for your file. With specialized APIs, the OS can do this without performing any IO (not counting metadata). I am adding support for this as part of bug 592520.  Linux features posix_fadvise for preallocating files. Windows’s SetEndOfFile achieves the same result. Supposedly OSX can do this via fcntl(F_PREALLOCATE), but does it?

I’ve experimented with posix_fadvise/SetEndOfFile and determined that they both change the file size and do their best to avoid fragmentation. Unfortunately I do not see any effect of fcntl(F_PREALLOCATE) on OS X 10.6 (the return code is successful). The file size does not change and if I then write to the file, it seems to fragment just as much as before. Can a Mac expert demonstrate that fcntl(F_PREALLOCATE) makes any difference at all?

Update: Thanks a lot for the useful feedback, it was extremely helpful in producing this patch. It appears that the posix_fallocate equivalent is to the fnctl followed by a truncate() call (which actually forces data to be written to the file).

Fighting fragmentation: SQLite

Thanks for all of those who commented on previous post on fragmentation. My first fragmentation fix has landed. In current nightlies and future releases the main Firefox databases will grow more aggressively to avoid fragmentation. This should translate into better history/awesomebar/cookie performance for our most dedicated users.

Unfortunately fixing existing profiles is hard from within Firefox. In the meantime advanced users on non-Windows platforms who are suffering from fragmentation can manually copy *.sqlite files to another directory and back.

Windows: Ahead of the pack

Evidence suggests that the Windows fragmentation situation is slightly better than on other platforms. Firefox fragmentation behavior on Windows is similar to other OSes but Windows periodically defragments Firefox files opened on startup. So one ends up with a cycle of deteriorating performance, followed by better performance(ie right after defrag), followed by deteriorating performance, etc.

I haven’t observed Windows defragmenting files for me, but it seems to do this for most users. Would love to learn more on how/when it decides to defragment files.

Horror Stories

I found a few other places that are horridly affected by fragmentation, will be blogging about those as I fix them. Fragmentation is an interesting problem to optimize because it affects dedicated users most, yet it is very tricky to replicate in a developer environment. Furthermore, there are a lot of misconceptions floating around:

  1. Fragmentation is a Windows problem that Linux is immune to due to having awesomer filesystems.
  2. Mac OSX automatically defragments files, so fragmentation isn’t a problem there.
  3. Fragmentation isn’t a problem on SSDs

To which I say:

  1. Linux might be good at avoiding fragmentation for server workloads. It sucks for desktop users.
  2. OSX will defragment small files, but big ones hurt most.
  3. Cheap SSDs suck at tiny reads caused by fragmentation resulting in spectacularly bad IO. More on this in a future post.

To summarize: there are a lot of misleading stories floating around. I am always happy to hear more measurements/docs/bugs/etc on this subject, but I have zero patience for folk stories and speculation.

I should also mention that the fragmentation problem isn’t limited to Firefox. Other browsers suffer from it too.

I was digging through a MSVC++ map file for xul.dll. Turns out MSVC++ isn’t as naive about virtual initializers as the GNU toolchain. Initializers are all laid out next to each other. Same goes for what looks like finalizers and exception unwinding stuff. Initializers have an __E prefix and look like this:

0001:0089b470       ??__E?config@AvmCore@avmplus@@2UConfig@nanojit@@A@@YAXXZ 1089c470 f    CIL library: CIL module
0001:0089b475       ??__EkStaticModules@@YAXXZ 1089c475 f   nsStaticXULComponents.obj
0001:0089b638       ??__E?sSnifferEntries@nsUnknownDecoder@@1PAUnsSnifferEntry@1@A@@YAXXZ 1089c638 f   necko:nsUnknownDecoder.obj

Now if only Microsoft fixed their kernel to do memory-mapped IO efficiently, it’d be a superior OS for starting Firefox.

File Fragmentation

Files are considered fragmented when they aren’t laid out in a continuous chunk on disk. This causes extra seeks even if the file is being read sequentially.

I was discussing startup over dinner, someone asked about how much of an issue fragmentation is in Firefox.

Early on I decided to pretend that fragmentation does not exist as we had bigger fish to fry. We were opening too many files on startup, effectively causing our own high-level fragmentation. Luckily, that problem should be mostly solved in Firefox 4 once omnijar and fat xul bugs land (unfortunately, extensions can cause similar issues until we stick em into a single file).

To measure fragmentation I used my SystemTap script to get a list of files opened (one could also use strace or any similar tool) and piped the results to filefrag. Filefrag is a Linux fragmentation-measuring utility. On Windows one can use contig and Mac OS X features hfsdebug. I’m using ext4 on Linux.

My top offenders were:
places.sqlite: 34 extents
cookies.sqlite: 18 extents
XPC.mfasl: 11 extents
Cache/_CACHE_003_: 11 extents
urlclassifier3.sqlite: 6 extents
Cache/_CACHE_002_: 6 extents
Cache/_CACHE_001_: 6 extents
XUL.mfasl: 5 extents
formhistory.sqlite: 5 extents
content-prefs.sqlite: 4 extents
libxul.so: 4 extents
signons.sqlite: 3 extents
icon-theme.cache: 2 extents
libatk-1.0.so.0.2809.1: 2 extents
libflashplayer.so: 2 extents
Cache/_CACHE_MAP_: 2 extents

I did an informal poll of my friends and it seems that the order of fragmentation is similar among them, only the magnitude differs. For example, XFS tends to be 10-20 times more fragmented than ext4 :) . I don’t have any numbers for HFS+, but I suspect XFS takes the crown as most fragmentation-prone filesystem to run Firefox.

Interestingly, my friend running NTFS reported similar fragmentation to ext4. That was disappointing as Windows Prefetch supposedly defragments files used with the first 10 seconds of startup. Clearly, isn’t keeping up in this case.

Preliminary Conclusions

places.sqlite is the largest and most performance-critical file in Firefox. It contains browser history and bookmarks. It is the brains behind the AwesomeBar.The fact that it is severely affected by fragmentation significantly impacts Firefox responsiveness. There are no easy fixes for fragmentation there. mak suggested moving history to a separate file to mitigate this, but that isn’t an easy change.

In contrast, cookies.sqlite is tiny(<1mb for me) and probably so fragmented due to cookie expiration. I am guessing that easiest workaround here is to write a new sqlite file every time there is a mass update to the file.

urlclassifier.sqlite is a large file that may be mitigated similarly to cookies.

SQLite 3.7.0 came out today which features WAL logging, which may reduce fragmentation (or make battling it easier). In general, sqlite’s VACUUM (used to clean and compact the database) command does not help with fragmentation, we really need to be doing something like hot backup which would create a new database file every VACUUM.

Our cache code is ancient and sucks. The cache files get fragmented immediately and severely. They are accessed in insane patterns and they get laid out insanely on disk. There are some efforts to improve the code, but I suspect that’s equivalent to putting lipstick on a pig.

*.mfasl files are due to be obsoleted by a startup cache jar. It may get less fragmented. Should be a straight-forward fix it if it does get fragmented.

I’m disappointed to see the .so files get fragmented. This might be an ext4 bug or has something to do with how the updater works (both ours and yum on Fedora).

Further Work

I would like to see more data on fragmentation on Windows/OSX. Feel free to leave a comment with fragmentation numbers for cache, mfasl, sqlite and .dll files in your Firefox. We should look into online defragmentation APIs in modern OSes.

Workarounds?

Easiest way to fix fragmented files is to make a copy of the original file, delete the original and then rename the copy. This works on sane filesystems, apparently it doesn’t work too well on OS X.

My Startup Summary

I try to blog about interesting things I encounter while solving various issues in Mozilla. Some things are less bloggable than others. If this blog don’t fulfill your startup + static analysis needs you can follow my status updates and twitter. For now here is a summary of various half-baked/inprogress work:

  • I worked on upgrading to GCC 4.5 which turned out to be a performance regression. While tracking down a workaround, I got into helping GCC guys help us by fixing gcc trunk’s LTO to work on Mozilla.
  • I am also determined to see Firefox come out with a fat libxul and not link to useless to us libraries by default.
  • I spent a lot of time figuring out why our sqlite io hurting, ended up bumping block size to 32K. I am really hoping that Marco can deliver VACUUM for all of the Firefox databases in time for Firefox 4.
  • I refactored most of icegrind so in addition optimizing startup, it can also facilitate investigative work like Mike Hommey is doing.
  • Posted a summary on the evils of static initializers. Came across a Chrome equivalent today.

I plan to blog more on each of these subjects as they get closer to realization.

Galois talk

I was invited to present a Galois tech talk on Mozilla static analysis. It was really cool to give a talk locally to such an expert audience. I was surprised to discover a vibrant Programming Languages + Analysis community in Portland.

Edward Z. Yang did an excellent write-up on the talk.

PLDi

Robert O’Callahan mentioned Dehydra in his PLDI talk.

Dehydra/Treehydra in GCC 4.5

There a few fixes that are about to land. I’m hoping that by the end of the week GCC 4.5 support will be production-quality. Sorry that it’s taken so long, but I’ve been busy focusing on startup. Ehren has picked up the slack, we should be able to produce a fairly polished Dehydra 1.0 by the end of the summer.

This post is a result of debugging bug 561842. Turns out one needs to go far beyond lumping libraries together to reap startup benefits.

I made a pdf to illustrate the cost centers of loading libxul.so (the essence of Firefox).

With Icegrind I demonstrated that better binary layout can significantly improve application startup. However I still didn’t have a breakdown of reasons of why loading binaries is so damn inefficient. That’s what the above pdf is about.

Loading libxul consists of 4 major phases:

  1. Runtime linker setup: mapping segments in, zeroing .bss, loading dependent libraries, etc
  2. Runtime linker relocations.
  3. Library intializers.
  4. main() and the rest of application code runs

I blogged about 1, 2, 4, this post is about #3.

Library Initializers?

Michael Meeks pointed me at the funny backwards IO pattern in his IO logs. I even made fun of how by default libxul.so is read mostly via backwards IO. Once I assigned userspace symbols to my pagefault log, it became clear that the backwards IO pattern was entirely due to library initializers. C++ compiler generates code that runs on library initialization to initialize globals and run relevant C++ constructors. In C one can assign a “constructor” GNU attribute to a function to participate in this mayhem.

Running Backwards?

Ian Lance Taylor clued me in on why these things run backwards.When one links the program, the object files are laid out sequentially. Static libraries are specified after the code that depends on them. Once an object is linked, the easiest way to make sure that libraries are initialized before their users is to invoke initializers backwards. The list of initializers is stored in the .ctors section and they loaded by libgcc.

In Mozilla (and likely other C++ codebases) these global initializers are more or less evenly scattered throughout the codebase. By the time main() is run, most of the program has been paged in an unfortunately inefficient manner.

Run Faster Please?

The most interesting part about all this that the compiling toolchain can make a rather precise guess at how a large part of the initial program execution is going to go. To test this theory I wrote my best Mozilla patch ever.

One can place a function near the beginning of the library file and another one at the end (with a “constructor” attribute). The function at the end runs first and it can figure out the approximate range of memory that will need to be paged in and madvise() it. This results in a 5x reduction in libxul pagefaults. Unfortunately since constructors execute backwards and readahead forwards, the constructor execution stalls to wait for readahead, so the speedup is rather hard to detect.

Run Forward Faster!

Depressed about my hack failing to make a dent in startup time I patched gcc to run initializers in a forward order (and reversed the function-placement logic in above patch). Now readahead happened in the same direction as library initialization and my Firefox started 30% faster! I wrapped this up into a standalone gcc patch (speed up any bloated C++ startup with a simple change to the compiler!). Note this hack reverses the library initialization order discussed above, this happens to not be a problem for Mozilla.

Conclusion: Order Matters!

The linker can reverse the per-library initializers such that initializers run forward, but cross-library dependencies are honoured. That in itself isn’t enough to boost startup without cleverer readahead on the kernel side (or application-side hacks).

It’s weird to have initializers page in most of the binary. An interesting optimization would to have the compiler transitively mark functions reached by library initialization and place those in a .text.initializers section. Then one could have the linker group the initializers together.

Plans

I haven’t made up my mind on how to proceed. This madvise() hack + a simple linker patch could be deployed more easily than icegrind. This hack also appears to be as performant as a static firefox build + icegrind (due to inadequate kernel readahead without madvise()). Icegrid + libxul.so isn’t quite as efficient. I have a feeling that we’ll end up with a combination of icegrind + some form the initializer madvise() hack.

Continue Reading »

« Prev - Next »