Studying Library IO – SystemTap Style
October 23rd, 2009
In my last blog post I expressed frustation with slowness induced by library IO. Then I went on a mission to measure it. I have been wanting to this for a while, but I figured that only DTrace can get this info without recompiling my kernel. So I tried to build Mozilla under Slowlaris (but the linker got up to 3GB and then set there swapping, ensuring that the nickname is justified). Then I fired up DTrace on the mini, but ran screaming because it seemed like fbt DTrace provider refused to let me dereference structs (later Joel told me that I’m supposed to copy data explicitly like here).
But while googling for a fbt workaround, I stumbled upon a DTrace/SystemTap comparision wiki. SystemTap? The DTrace knockoff I have been hearing about? It works? This was a lightbulb moment where I realized that Linux was about to provide me with more information than I thought was possible.
So here is the data I got out of it:
Rant on Library IO
October 20th, 2009
So I’ve been trying to figure out how optimize disk IO startup. I looked into IO caused by libraries and turns out that apps with big libraries are screwed. Here is how I came to this conclusion:
Gnomer’s research on startup pointed out that dumb readahead leads to wins in terms file io. So I wrote some code and sure enough, reading in libxul on top of our main() function does indeed result in a significant measurable speed-up on both Linux and OSX.
From the gnome page I found a link to some diskstat stuff. There lay a presentation with graphs that appear to show that OpenOffice has a much better cold IO pattern than Firefox. Given that there are some strong similarities between our application layouts I went digging to see if OpenOffice does something funny. And oh boy, it does do funny page reordering on Windows and “slightly-smarter-than-dumb-readahead-style library prefetch” on Linux…
So here is an innocent question: Why is page-reordering not done as a PGO step? I mean shouldn’t you fire up your app, feed some info back to the linker and be done with it? Another question: Why can’t we mark certain files as “keep this whole file in ram if someone asks for part of it to be paged in”?
So is the only way to fast application startup via static linking? It sure is easy to
posix_fadvise(open(argv[0],O_RDONLY), POSIX_FADV_WILLNEED);
Are these hacks still the state of the art in making apps with large libraries startup fast?
Update: Found some mentions of GNU Rope unfinishedware and a relatively recent blog post
Restless Bug Fixing
October 8th, 2009
I spent the past couple weeks analyzing and improving fastload performance. I’ve long been suspicious of fastload, but only finally got around to investigating it in detail. I think there is some fundamentally ironic rule in software that if you put the word “fast” in the name of a component, it is bound to eventually become a performance bottleneck.
Almost a decade has passed since the conception of this code, so it was time to update code’s assumptions to reflect the capabilities of modern OSes. I landed the fix today. It results in startup performance gains of 1-20% on various platforms I tested, making this the most exiting perf bug I’ve worked on.
Plans
Now that I’ve had my fill of almost a year’s worth of startup performance analysis, for the remainder of the year I plan to refocus on static analysis. My main goal is decent C support on Dehydra(not to mention the ever elusive GCC 4.5 compatibility) and to facilitate a production-quality DXR.
I’m hoping that we’ll end up with cool ways of dealing with the painful/slow boilerplate (bugs 520626, 516085 and 517370)
Corrupting Innocent Minds With GCC
September 30th, 2009
Ever since the plugin branch landed in GCC, I have been itching to explore the application-specific optimization space that it opens up. It’s really hard to optimize code in the general case, but it’s relatively easy to optimize for something for specific use-cases. We can rely on API-specific static analysis in order to get rid of the API-imposed overheads at compile time. Let me repeat, we can get rid of some API-induced suck (OO frameworks usually have a lot of it) without sacrificing any of the benefits.
Unfortunally, I got busy working on, supposedly, more important stuff such as making Firefox startup quicker, so my de-error-handling and de-virtualizer (basically possible with LTO, but we can prove that certain classes will never be overloaded via dynamic linking) ideas had to be put on indefinite hold. Luckily, one of David Humphrey’s students decided to take on the first task, see his blog post here. I’m really psyched about this, few things that are cooler than cross-project open source work involving the most important open source projects of our time
Moving Files Into JARs
August 27th, 2009
Moving files into jars reduces amount of seeks on startup, and has miscellaneous other performance/organization benefits. I added resource://gre-resources/ which maps to jar:toolkit.jar!/res/.
To move a file into a jar:
- Add a jar.mn entry.
- Remove existing references to the file in Makefile.in, packages-static files
- Add file to the removed-files.in list of dead files
- Update urls refering to the file in the source. Sometimes one has to switch from using file streams and filenames to using channels and URIs. This is the hard part.
- Set your bug as blocking bug 513027.
For an example see bug 508421.
Cleaning Up Startup Disk IO
August 20th, 2009
Maintaining a module, killing off another one
I was granted ownership of the jar module. Today, I resumed my quest to kill off the barely limping stopwatch module. Together with nuking STANDALONE mode in jar stuff, I will have landed 75KB worth of -ve diffs this month. It feels so good to delete code.
IO Report
Currently I am focusing on application IO (excluding libraries and IO caused by libraries).
From my empirical measurements, opening individual files on a 7200RPM hard drive costs around 0-40ms. This is on Linux. I presume files open quickly when they are located near previously opened files and slower if a full disk seek is required for them. Combining files is usually a significant win in terms of throughput. It turns out that even warm starts and reading from SSDs can benefit from combined IO. Currently small file throughput ranges from <1KB/s to <200KB/s for files < 500K. Combining files into memory mapped jars bumps that up to 1-1.5MB/s (currently jar files are relatively small, making them responsible for a higher proportion of IO should boost that further).
The biggest gains are to be had on Windows Mobile where almost every seemingly trivial filesystem operation takes 2-3ms.
I would like to reduce the number of files read on startup to a dozen or so to be able to crank up disk throughput. Unfortunately, there is a lot to be done, I could use a great deal of help.
Below is a long list of files gathered by stracing firefox-bin, and what I know about them:
Read the rest of this entry »
There is nothing exciting about filesystems
August 14th, 2009
When I originally started at Mozilla, I only knew the people who interviewed me. But I quickly discovered beltzner when he uttered a sacrilegious statement that went something like: “….. nothing could be as boring as filesystems….”. Mike Beltzner is one of my favourite characters at Mozilla for his ability to speak his mind, but this quote has troubled me greatly. How can one not care about filesystems? Linux’s ability to do file stuff efficiently makes it magnitudes faster than other operating systems. Plan 9’s file-system-centric layout proved that OSes don’t have to consist of a series of poorly named and categorized system calls. In fact, a clean file layout allows many awesome optimizations. ZFS is one of the few things keeping Solaris relevant. HFS+ is one of the things keeping OSX from being fast.
Being a Linux user, I was disappointed by the pointlessness of optimizing application IO. Sure we inefficiently open tons of files on startup, sure we hit the filesystem 10-100x more than we could, why would one optimize when there when there is no more than a few percent of startup being take up by terrible io patterns?
Excitingly Crappy Filesystems
Luckily Firefox runs on OSX and we are making it run on WinCE. I was delighted to discover that on wince* we paid 1-5ms per file existence check, modification date, size, etc. I was shocked to see that the throughput while reading certain files could be expressed in bytes per second (most crappy flash media seems to be able to pull in >1mb/s). This brought upon switching our jar io to mmap, amalgamating jar files, moving more files into jars, etc. I’ll blog about the details later. My basic idea is that we can utilize jar files as “controlled filesystem environments” to deal with having to run on crappy OSes with exceptionally bad filesystems. OSes such as OSX where file IO is barely faster than that of a WinCE phone.
Beltzner, wouldn’t it be exciting if OSes like Mac OSX had file systems worth being excited about?
* MS likes to use puns for their product names
Chase/WaMu Fraud-friendlyness
July 22nd, 2009
Turns out that if someone compromises a merchant that you shop with, such as bikenashbar.com, then they are entitled to your money as long as they can show to the bank that they have your billing/shipping information. According to Chase it’s not fraud if they steal your address along with the card number. I recommend that people use a bank that protects their customer’s finances, one that isn’t called Chase.
Those crazy comm-central guys…
July 21st, 2009
When it comes to taking on crazy tasks involving disguising code, you can always count on Joshua. This time Joshua has gone deep in Pork territory to do a very impressive rewrite (one that would not be even close to possible with anything other than Elsa-based tools) and he is blogging about it. He posted the first installment of his rewriting-with-Pork guide.
Still on the subject on mad csc-entist Joshua: Andrew Sutherland has finally gotten sick of crappy JS documentation tools and took matters into his own hands thanks to Joshua’s JSHydra tool. Checkout his blog post on how the documentation world will be a better place thanks to being able to build tools on top of *the JavaScript parser*.
VirtualBox 3.0 Rocks
July 14th, 2009
With much pain I managed to convince a Vmware WinXP virtual machine to run in VirtualBox. At first it ran very sluggishly(>3 hours to do a build?), but after I turned off IO Apic stuff in Windows, it’s become disturbingly fast. It now takes 40minutes to build a wince fennec vs almost 80minutes it took on VMware server.
VirtualBox’s disk throughput is phenomenal, in fact, this is the first time I’ve seen almost-native speed disk io in virtual machine. A benchmark I tried reported 45mb/s 4K read/writes (these were cached on the linux side).
Unlike Vmware server, VirtualBox’s usb works without troubles, shared folders were easy to setup etc. It’s an awesome app, hope Sun/Oracle keeps it up.