Monday, April 21, 2014

Linktime optimization in GCC, part 2 - Firefox

This time I will write about building Firefox with LTO. Firefox is one of largest binaries on my hard drive (beaten only by Chromium) so it is an excellent stress test. Taras Glek and I started to work on getting Firefox to build well in 2009 (see original paper from 2010), however for all those years I was mostly chasing correctness and compile time/memory usage issues. Markus Trippelsdorf was instrumental to keep the project going. In parallel Rafael Espindola did same effort of getting Firefox to work with LLVM and LTO (Update: see also his recent slides) that got into useful shape about a year later.

Update: I solved the LLVM LTO crash by commenting out all SSE3 code from SkBitmapProcState_opts_SSSE3.cpp. It seems to be re-incarnation of Firefox's PR759683 and LLVM's PR13000. I also added GCC 4.8 runtime and code size tests.

Update: With help from Firefox devevelopers I also measured tp5o benchmark. It shows rendering times of popular pages from 2011.

Looking into Firefox let me to understand that needs of large applications are different from what one sees in standard benchmarks (SPEC2006). It motivated several changes in GCC, including making -Os cope well with C++, adding code layout logic, optimization to handle better static construction and destruction and to improve dealing with PIC interposition rules.

It may be surprising, but despite my work in the area, this is first time I actually run some real benchmarks (except for Dromaeo I did two weeks ago). Benchmarking large applications is not easy to do and I never had time to poke with the setup. As I have learned recently, Firefox has actually pretty thorough benchmarking tool called talos.

I run quite few benchmarks and builds, but it would be interesting to do more. Due to unforeseen changes in my plan this weekend I however decided to publish what I have  now and followup on individual topics.

This post should be also useful guide to those who want to try other bigger projects with LTO. Firefox contains a lot of third party libraries and has very diverse code base. Problems hit here are probably good portions of problems you hit on other bigger applications.

For LTO history, see part 1 of this series.

Update: See also Martin Liška post on building Gentoo with LTO.


I will not duplicate the build instructions from Mozilla site. However for successful LTO build one needs the following additional things:
  1. Binutils with plugin enabled. This can be checked by
    $ ld --help | grep plugin
      -plugin PLUGIN              Load named plugin
      -plugin-opt ARG             Send arg to last-loaded plugin
    I personally use gold from GNU Binutils but any reasonably recent binutils either with GNU ld or gold should do the job. One needs to however enable the plugin while configuring since it is off by default.
  2. GCC 4.7+ that is configured with plugin support (i.e. built with the plugin enabled LD). For sane memory use, GCC 4.9 is recommended, it requires about 5 times less RAM.

    To test if GCC runs with plugin support, try to link simple hello world application and do:
    $ gcc -O2 t.c -flto --verbose 2>&1 | grep collect2.*plugin
    If the output is non-empty you are having plugin set up correctly.
  3. Clang needs to be built with gold llvm plugin enabled, too. Follow the official instructions.
  4. ar/nm/ranlib wrappers on the path that dispatch to the real ar/nm/ranlib with plugin argument.

    This is somewhat anoying issue. GCC ships with gcc-ar/gcc-nm/gcc-ld, but it is not very easy to convince build script to use them. Adding to path a simple script calling gcc-* equivalent will lead to infinite recursion. I use the following hack:
    $ cat ar
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/ar $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/ $*

    $ cat nm
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/nm --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/ $*

    $ cat ranlib
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/ranlib $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/ $*
    In the LLVM case I use the same wrappers, just the plugin path is /aux/hubicka/llvm-install/lib/

    Alternatively the plugin can be installed to the default plugin sear path (/usr/lib/bfd-plugins). This solution has some limitations however. One needs not-yet-released binutils to make this work with GCC and one can not easily install multiple plugins. I hope this situation will change in not so distant future.

    Forgetting these steps will give you undefined symbols as soon as you start to link static library build with -flto and GCC 4.9+ or -flto -fno-fat-lto-objects with earlier compilers.
  5. Disk space for the checkout, about 3GB
  6. Disk space for build directory: 0.9GB for normal build (3GB with debug info), 1.3GB for LTO build with GCC (1.6GB with debug info) and 0.72GB for LTO build with LLVM (2.2GB with debug info).
  7. At least 4GB of RAM (if your machine has 4GB, some extra swap is needed to pull out linker when LTO optimizer is working) for GCC LTO and 8GB for LLVM LTO.
  8. At least 3GB of /tmp space for GCC's WHOPR streaming. If you use tmpfs, you may want to redirect GCC's temporary files to a different directory by setting TMPDIR.

Be sure you actually use LTO and have plugin configured right

One of common mistakes is to use GCC -flto without linker plugin. This has three important consequences.
  1. GCC will always build fat LTO files and your builds will take twice as long then needed
  2. GCC will not have visibility information from linker and will not optimize very well across symbols exported from the binary. You may change this with -fwhole-program, but this solution is hard to use and still has some negative impact on code quality. It is a lot better to use linker plugin and the -fvisibility=hidden if you build PIC library.
  3. If you produce static library of LTO files and you link it without the plugin, the link time optimization on those will not happen. You will get the same code as without -flto. I believe many users who report no benefits and/or ability to build very complex apps actually fall into this trap.
I hope binutils will be patched to diagnose these cases more gratefully.

    Be sure you get link-time optimization flags right.

    Both GCC and LLVM perform some optimizations at compile-time and some optimizations at link-time. Ideally the compiler should pass down the flags from compile-time to link-time but GCC (and I believe LLVM, too) is not yet fully reorganized to support per-function flags (I plan improve this for next release).

    Worse yet, GCC does some magic, like trying to get -fPIC flag right, that does not really do what it should, but it hides some of obvious to detect mistakes. Richard Biener improved it a bit for 4.9, but the sollution is still not very good.

    It is important to pass proper optimization flag to the final invocation of compiler that is used to link the binary. Often it can be done by adding your optimization flags to LDFLAGS. Failing to do this will probably result in not very well optimized binary.


      I used unpatched GCC 4.9 release candidate 1. I configured my compiler with:
      --prefix=/aux/hubicka/49-install --disable-plugin --enable-checking=release --disable-multilib --with-build-config=bootstrap-lto
      And built with make profiledbootstrap. If you want to use LTO and FDO together, you may want to get this patch for PR47233 that did not make it into 4.9.0 and will land 4.9.1 only. It reduces LTO+FDO memory use by about 4GB.

      I also use relatively fresh LLVM checkout 206678 (Apr 19). Eric Christopher recommended me to try recent LLVMs since the memory usage with LTO was reduced. I also briefly tried 3.4 that seemed to behave similarly modulo needing more memory. The wrong code issues I mention here as well as the crash on LTO without -march=native happens in both cases. I configured with
      --enable-optimized --disable-assertions --prefix=/aux/hubicka/llvm-install 
      and rebuilt by itself. My understanding is that this should give me fastest binary. Release builds  seems to be with assertions enabled that seems to cause noticeable slowdowns.

      To both compiler paths I install the shell wrappers for ar/nm/ranlib as mentioned above.

      Firefox changes needed

      The actual changes to Firefox are tracked by our tracking PR45375. Here are changes that actually landed upstream I write about mostly to give an idea what source code changes are needed:
      1. Patches to mark symbols used by top-level asm statements by __used__ attribute. This change is needed because unlike the statement level asm statements, the toplevel asm statements have no way of annotating them with symbols used. For this reason the symbols in ASM attributes are not visible to linker and GCC prior code is generated. If you won't use explicit __used__ attribute, GCC will freely optimize out the symbols leading to link errors.

        This solution somewhat sucs. LLVM has built in ASM parser that makes this transparent in some cases, but not in others.

        The patch is linked from Firefox's PR826481
      2. Patch to libvpx to avoid grepping bytecode assembly files.

        The problem here is that libvpx produce assembly file and then greps it for specific code in it. With -flto the assembly file consists of compressed LTO bytecode and thus the grepping fails.  One needs to pass -fno-lto to the specific unit.

        More in Firefox's PR763495
      Patches not pushed upstream
      1. A hack to prevent configure script to be confused by LTO and disable support for visibilities. This is linked here. Without this patch one gets working Firefox, but code quality is lousy.

        Confusion of confiugre scripts is unfortunately common problem for LTO. Many of the scripts was written with traditional compilation model in mind.
      Martin Liška and Markus Trippelsdorf put together bare bones of GCC LTO FAQ.

        Firefox configuration

        My mozconfig is as follows:
        ac_add_options --enable-application=browser
        ac_add_options --enable-update-channel=${MOZ_UPDATE_CHANNEL}

        ac_add_options --enable-update-packaging
        ac_add_options --disable-debug-symbols
        ac_add_options --disable-optimize
        ac_add_options --disable-tests
        export MOZILLA_OFFICIAL=1
        mk_add_options MOZ_MAKE_FLAGS="-j16"
        export MOZ_PACKAGE_JSSHELL=1
        MYFLAGS="-O3 -march=native -flto"
        export CFLAGS=$MYFLAGS
        export CXXFLAGS=$MYFLAGS    
        export LDFLAGS="-O3 $MYFLAGS"              
        mk_add_options MOZ_OBJDIR=/aux/hubicka/final2-gcc-firefox-49-lto-O3-native-debug
        I use --disable-optimize to be able consistently set flags to all files by MYFLAGS. For LLVM LTO build one also need in some cases --disable-elfhack (or one gets segfaults). I will discuss effect of enabling-disabling debug symbols briefly in the next section.

        LLVM LTO builds crashes in instruction selection with LTO. I suppose it is because some units are build with custom -march flag with use of SSE intrincisc but this information gets lost to LTO. To get binary built I link with -march=native. I am in the progress of looking for testcase to report the issue. The resulting binary do not work for me, so all binaries benchmarked are with generic tuning.

        To get build with profile feedback, one additionally need to enable tests.

        Getting parallelism right

        If you want to get parallel build with LTO, it is better to parallelize the linktime optimization, too. This is currently supported by GCC only. In GCC you can pass parallelism to -flto parameter. I use -flto=16 that gets me 16 processes. There is no need for passing higher values. Parlallelism increases the memory use, so you may need to use lesser parallelism than is your CPU count if you are running into swap storms.

        Other important thing to consider is correlation with make's parallelism. If you build with -j16 then make may execute 16 linking commands in parallel that may each execute 16 worker processes killing your machine with 256 parallel compilations. For Firefox it is not an issue as there is only one dominating binary, for other projects you may want to be cautious.

        GCC support -flto=jobserv (implemented by Andi Kleen) that lets make to drive the whole process. There are few downsides of this:
        1. You need to adjust your Makefile to add + in front of the rule executing link command, otherwise GNU make will not pass down the necessary jobserver information. A lot of users seem to not read the docs far enough to notice this trap and get serial links.
        2. GCC 4.9 newly parallelizes the streaming stage and because I do not know how to communicate with Make's jobserver, this stage won't get parallelized, yet. GNU Make's jobserver seems easy to deal with and patches are welcome :)
        3. If you cut&paste command from Make's output, the linking will be serial because jobserv is not run.
        We will need some cooperation with GNU make maintainers to get these problems hammered out.  In general I think -flto should default to an "auto" mode that will by default detect number of CPUs available unless it knows about Make's jobserver and Make's jobserver should be available without the extra Makefile change.

        The parallelism is reached by breaking program into fixed set of partitions after the inter-procedural analysis is completed. The number of partitions is not controlled by -flto and is 32 by default. If you want higher parallelism or smaller memory footprint, you may increase number of partitions by --param lto-partitions=n. This will affect resulting binary but it should not really decrease code quality by a considerable fraction. The number is fixed to ensure that builds on different machines gets precisely the same output.


          To build Firefox normally, one use
          make -f 
          For the profiled build one needs X server running (I use tightvnc) and use  
          make -f profiledbuild
          If X is not running, the build passes but the resulting binary is crap, because all it is optimized for is to output the error message about missing X.

          My build machine

          My machine (courtesy of IUUK) has plenty RAM (64GB) and AMD Opteron(TM) Processor 6272 with 16 threads running 2100Mhz. I use tmpfs for /tmp directory and run modified Debian Wheezy.

          Compile time

          Wall time

          One observation is that GCC's LTO is no longer coming for extreme compile time costs. The slim LTO files avoid double optimization and the build gets parallelized pretty. The extra wall time caused by -flto=16 is comparable to switch from -O1 to -O2.

          LLVM without LTO clearly builds faster. Its -O3 compilation level is about as fast as -O0 in GCC, while GCC's -O1 is just bit slower and other optimization levels are significantly slower. With LTO LLVM builds a lot slower due to lack of any parallelism in the build stage. Also LLVM's optimization queue is organized in a way that virtually all optimization are run twice; once at compile-time and then again at link-time. While this may account extra cost, I think it is not too bad due to speed and relative simplicity of the LLVM's optimization queue. It is more problematic in a way that optimization done at compile time may inhibit optimizations at link time and I expect this will be revisited in future.

          Will debug info kill my builds?

          Other interesting topic is debug info. While GCC basically carries all expenses of debug info even with -g0. LLVM doesn't. The LTO memory usage increases many times with -g and build time with -O3 -flto -march=native -g is 37m46s (2266s), so about 46% slower. GCC needs 15m20s (extra 11%). I left out these from the graph to avoid rescaling.

          For non-LTO builds GCC -O3 -g compile time is 12m41s (761s), about 18% slower than -g0, LLVM -O3 -g compile time is 9m35s (575s), about 13% slower.

          Previous releases

          To put this into further perspective, GCC 4.7 needs 35m1s (2101s) to complete the job with -O3-flto=16 -g, GCC 4.8 needs 40m10s (1810s). Firefox build with GCC 4.7 and LTO crashes on startup for me (earlier checkouts worked). I did not investigate the problem, yet.

          CPU time

          Parallelism plays important role: while compilations can be easily done in parallel via make -j16, linking often gets stuck on one binary.  Here is how system+user time compares on individual compilation:

          Here it is visible that LTO compilation does bit more work for both compilers, but WHOPR is able to keep overall build time in bounds.

          Compiler's memory use

          Here I present some data collected from vmstat plotted by a script by Martin Liška (who originally got the idea from Andi Kleen). Red line in CPU utilization graph shows the case where only one CPU is used.

          GCC 4.9

          GCC memory/cpu use at -O0

          GCC memory/cpu use at -O3

          GCC memory/cpu use at -O3 -g

          In the graphs one can actually see that relatively a lot of time (100s) is spent before actual compilation begins. It is make parsing through Makefile and executing bunch of python scripts. The same happens after 600s to 700s where the same is repeated for different make target. The little peak at the end 700s is gold linking the final library. We will soon see this like peak to grow up :)
          GCC memory/cpu use at -O3 -flto
          Here the memory usage during actual compilation is lowered to about 4GB and compilations are finised in about 500s (faster than with -O3 in the previous graph).
          What follows is a memory mt. Everest climb in memory usage graph starting just after 500s.  One can see the serial link stage running from 500s to 600s

          Mount Everest climb in memory usage starting just after 500s is the LTO optimization, the largest Firefox's library:
          1. WPA (whole program analysis) type merging: The steeper walk up to 550s is compiler reading summaries all of the compilation units, merging types and symbol tables.
          2. IPA optimization: The slow walk up is inter-procedural optimization, dominated by inliner decisions.
          3. WPA streaming: Just after 600s the serial stage is finished and the steep peak up is parallel streaming and shipping new object files into copmilatoin. You can see the CPU usage to go up, too.
          4. Local transformation: Shortly after the steep valley the actual compilation starts, executed in parallel for another almost 200s.
          While the memory usage in my graph peaks 16GB, you can see it all happens in the parallel stage.  If you link Firefox at 8GB machine you can either increase --param lto-partitions or decrease -flto parallelism. Both will make the mt. Everest to sink.

          The little hill just before the mt. Everest (around 500s) has about the same shape for a good reason: it is linking of the javascript interpreter shell, the second largest binary in Firefox. It is a good example that bit of build machinery reorg would probably help to hide most of the LTO build time expenses. All one would need would be to overlap the javascript shell link with the link. Something that is not needed in the normal build, since the link time of javascript shell is almost immediate.

          An unfortunate debug info surprise

          GCC -O3 -flto=16 -g
          This graph was supposed to show, that debug info is almost for free. This is true up to about 700s. Then it shows quite large increase in peak memory use (16Gb to 25GB) for debugging enabled in the local transformation stage. Sadly this is a new problem with recent Firefox checkout I did not noticed until now. I am looking into an solution.

          If I have 8GB or 4GB of RAM?

          The following are the worst case (debug info + -O3) builds with reduced parallelizm to 8 and 4. I regularly build firefox on my notebook with 8GB of RAM and 4 threads. On 4GB machine the build is possible, but one would need to either go to -flto=1 or play curefuly with partitioning. Again, Firefox is quite a monster.
          GCC -O3 -flto=8 -g

          GCC -O3 -flto=4 -g
          GCC -O3 -flto=16 --param lto-partitions=16 -g

          Older GCC releases

          GCC 4.7 with -O3 -flto=16 -g
          To get bit of perspective, this shows improvements since GCC4.7. Fat object files accounts for increasing the compilation stage to almost 1000s. The type merging is the footstep of the hill followed by slow inliner decision and streaming (the shanky part of graph, around 1500s). The WPA stage needs 8GB instead of 4GB in GCC 4.9.

          Again this graph shows that there is a new problem with debug info with recent Firefox checkout. Again this is fixable by reducing parallelism.

          GCC 4.8 -O3 -flto=16 -g
          While GCC 4.8 links almost twice as fast compared to 4.7, the memory usage is still bad (serial stage has about the same memory use and local compilation needs 5GB more)


          LLVm memory/CPU use at -O0

          LLVM memory/CPU use at -O3
          LLVM memory/CPU use at -O3 -flto -march=native
          LLVM is done with compilation in about 350 seconds, that is quite surprisingly similar to GCC's time.I would expect here cleang to beat GCC's C++ FE hands down, but apparently a lot of compiler time difference nowdays is in the back-end and also by the integrated assembler. Or perhaps it is because more optimizations are performed at LLVM's compile time. Then it follows by serial link stage.

          Without LTO LLVM uses about 1/3rd to 1/4th of memory, while with LTO it needs about twice as much (given that the link stage is not parallel at all). This is the case of compilation with debug symbols disabled only.
          LLVM with -O3 -g

          LLVM with -O3 -flto -g
          This graph shows issues with debug info memory use. LLVM goes up to 35GB. LLVM developers are also working on debug info merging improvements (equivalent to what GCC's type merging is) and the situation has improved in last two releases until the current shape. Older LLVM checkouts happily run out of 60GB memory & 60GB swap on my machine.

          File Size

          I generated the file size comparison splitting the binary into text, relocations, EH and data that accounts also the other small sections.I made EH data to appear gray, since they most of time don not need to be loaded into memory and thus their size is less critical then the size of other sections.

          One thing that LTO is good for is code size optimization. This is not as clearly visible on, because it uses gold to do some of dead code removal, but will become more clear in other applications.

          In GCC's LTO at -Os bring about 5% reduction to both code and data segment (hover over graph to get precise numbers) and 20% reduction to EH data. For -O2/-O3 the data segment reduction stays, but the code is expanded by inlining. This is just the out-of-the-box configuration. The inliner is by default not really well behaving on large applications. Some of it I already fixed for next GCC release here. For now I recommend to reduce the inliner expansion by --param inline-unit-growth=5. This brings binaries close to non-LTO -Os, but without the runtime expenses. I wanted to make this default for GCC 4.9 but did not have chance to complete the necessary bencharking, so it is something I want to address this development cycle (in early tests done in mid 2013 there was issues with SPEC2k6 xalancbmk benchmark). I will follow on this and show some tests.

          UPDATE: I rebuilt firefox without -function-sections -fdata-sections and gold's section removal + identical code merging (which require the sections). The overall binary reduction with the default configuration is 43% (in LTO build)

          For LLVM LTO usually expands the binary including the -Os compilation, at -O3 it is about 30%. I believe it is mostly by inliner seeing too many cross-module inlining cases, too. LLVM has no global growth limit on inliner. What happens with LTO is that almost all functions become inlinable and simple minded inliner is just too happy and inlines them all. I believe that this is partly reduced by fact that LLVM does code expanding inlining at compile time that consequently disables some of cross-module inlining at link-time. Did not double-check this theory, yet.

          The data segment is reduced by about 3% and it is smaller than data segment of GCC produced binaries by about 6%. I tracked down part of the problem to GCC aligning virtual tables as if they were arrays for faster access by vector instructions. This is quite pointless and will make patch to avoid it. Other problem is the lack of optimization pass to remove write only variables that I made patch for next release cycle. Last difference seems to be LLVM bit more aggressive about optimizing out C++ type infos that GCC believe are forced to stay by C++ ABI. I looked into this and I lean towards conlcusion it is LLVM bug, but I am not expert here and further analysis are needed. Some of the symbols has appeared forced in recent LLVM checkout.

          One thing to observe is that GCC seems to produce considerably smaller code segment with -Os than LLVM (14%). As it will be shown later, LLVM's -Os code is however faster than GCC's -Os. It is the nature of the switch to trade all performance for code size however. -O2 code segment is smaller in LLVM (5%). I did not look into this in detail, yet, but it is my long time feeling that -O2 may need bit of revisiting from large application developer point of view. A lot of code generation decisions are tuned based on benchmarks of medium sized apps (SPEC2k6) that makes code expanding optimizations to look cool. I did bit work on this for GCC 4.9 and generic tuning, but more needs to be done. The size difference further increase at -O3 that I consider OK, since -O3 is about producing fast binary at any cost. As it is shown in benchmarks, -O3 in GCC does bring benefits, while for LLVM it seems to be close to -O2 performance.

          GCC also produce bigger EH tables (27%) at -O2. Partly this is concious change in GCC 4.9 where push/pop instructions are used to pass on-stack arguments. This produce shorter code and work well with modern CPUs having stack engines but the unwind info bloats. LLVM also seems to have bug where it optimizes away the EH region in:
          #include <stdio.h>

            try {t();} catch(int) {printf ("catch\n");}
          Even with -fPIC -O2. This is not valid, because t can be interposed by dynamic linker to different implementation that does throw. I filled in PR for this and will double check if those two are the sole reasons for the difference.

          Runtime benchmarks

          I used talos to collect some runtime performance data. Sadly I do not have LLVM LTO data, because the binary crashes (and moreover it is native build rather than generic). I plan to followup on this and do the comparison once I hammer out the problems.

          Startup time tests

          First benchmark is ts_paint that starts the browser and waits for the window to be displayed (in my understanding). I tried also bechmark called ts, but it does not produce results.

          Before each talos run I flushed the cache. Talos is running the benchmark several times, so it gave me one cold startup run and several hot startups (I hope). In the graph bellow is the cold startup and average of hot startups. Cold startup have more noise in it, because it is just one run and I suppose it also depends how well the binary landed on the hdd.

          I am not entirely happy about this benchmark and will write more later. The main thing is that Firefox uses madvise call to fully load every library disabling the kernel's page demand loading. Martin Liška implemented function reordering pass that, with profile feedback, should allow to load just small portion of the binary. I will followup on this later. See also Taras's blog post on this topic.

          The startup tests seems to be insensitive to LTO except for case when profile feedback was used (FDO saves 7%, LTO additional 4%). Cold startup shows some sensitivity to file size, but apparently my HDD is fast.

          Opening window

          Opening window is a nice test to show compiler flags, since it should fit in cache. Here LTO wins bt 5% at -O2, 2.3% at -O3 and 9.8% with FDO (that by itself brings 6.4%). One of two (Update: three) benchmarks where LLVM performs better than GCC -O2, I am not sure if it is within a noise factor.


          Scrolling has similar properties to window opening: just bunch of C++ code that fits in RAM and thus is sensitive to compiler optimization.

          SVG rendering

          Here I used tsvgx test (that is said to be better than tsvg) and since it tests multiple images, I made geometric average.

          Canvas mark

          Again a gemoetric average, I did not investigate yet, why some runs has failed. Will update on it.

            Dromaeo CSS

            A geometric average of individual benchmarks.

            Dromaeo DOM

            This is the only benchmark where LLVM is not approximately in GCC -O1 to -O2 range and in fact LLVM -O2 outperforms GCC -O2 by 7%. I profiled it and again it was caused by GCC not inlining by PIC interposition rules, so the difference should disappear if the problem is fixed at LLVM side.

            Allowing the inline by adding explicit inline keyword makes GCC -O3 to match score of GCC -O3 -flto=16 where the inline happens then based on linker plugin feedback. Again someting I plan to followup on. There is bit more going on that I need to investigate deeper. For GCC, the crictical inline happens only at -O3 -flto, while LLVM gets it right at -O2 despite the fact that its -O2 binary is shorter than GCC's. The inline decisions are significantly different making it a bit harder to track down.

            Update: By mistake, I reported large improvement for -flto -O3 with LLVM. This was actually caused by fact that I pulled in some changes into firefox tree while looking for workaround for the compiler crash. One of them was patch improving DOM. I rerun the tests on original tree.

            Update: Loading common pages

            This test seems to be only one that seems to show that -O3 is not always a win due to code size bloat.

            So what we can improve?

            LTO in GCC has quite large room for improvement. I have patches for next release to improve inliner's behaviour and reduce code size considerably without giving up on the speed. There is also still a low hanging fruit on imroving memory usage and compile time. Re-tunning backend and applications themselves for the new environment will need more work.

            Comparing to LLVM it seems we should try to reduce amount of information stored into LTO object files that is almost twice of what LLVM streams. Part of the extra information is used for optimization and debug info, but the streaming was not very curefuly analyzed for what really is important where: it is more or less pickled in-memory representation. There is ongoing effort to cleanup GCC, modernize the source base that will hopefully enable some deeper cleanups, like simplifying the type representation, that will in turn improve memory footprint. Clearly speeding up C++ frontend and putting our datastructures on the diet can do wonders. (I would be curious to see clang retargeted to GCC backend).

            I also have quite few plans for profile feedback. I plan to reorgnize instrumentation so with LTO one will not need to recompile everything twice, only need to relink and hopefully AutoFDO and LIPO will be merged in.


            Hope you found the tests informative and I will try to followup on the unresolved questions in the near future. I also plan to write post on Libreoffice and for that I would welcome suggestions for benchmarking procedure.


            1. The title is a bit funny, "GCC 2" ;)
              Would be less misunderstanable if it said "GCC, part 2".

              Also, I can't see any of the Google Docs embeds for some reason. For future posts please embed rendered images.

              1. Thanks, I fixed the name.

                Reason why I used google docs embeds is that one can get precise numbers out of these by hovering over the data. Indeed they did not display shortly after posting because the data was not public. I fixed that yesterday. Do they work everywhere now?

              2. I can see them all now. This is a very interesting post, I also enjoyed the history of LTO in part 1.

            2. Thanks for extremely detailed post series, Jan!

            3. Hi, What's your LLVM version?

              1. As mentioned above, it is SVN checkout 206678 (from Apr 19). I also did some builds with official 3.4 release, but not a complete set. I know it also fails for same error building LTO w/o -march=native and consume bit more memory for debug info. I did not do the runtime benchmarks.

                The error I run into is:
                LLVM ERROR: Cannot select: intrinsic %llvm.x86.ssse3.pmadd.ub.sw.128

                Today I finally got idea to google it and in fact it is resolved as invalid in though I would definitely consider it a bug. The message should say what is wrong: the compiler basically works for 15 minutes and then it crashes on this.
                Knowing where the offending sse3 use is would help.

                It is also marked as fixed in

                but it seems back. Plan to look into it ASAP.

                I also did not do much progress about getting -march=native LTO builds to start today. Hopefully later this week.

            4. Clang instead of LLVM would be more correct. Nitpick :)

            5. For prerequisite #4, is there a 'better' solution? For example, I would like to ship a CMake project that compiles static libraries and links them together. As soon as I give the CXX_FLAG -fno-fat-lto-objects (ensuring that LTO will take place) I get the undefined references errors. At the moment though, the my distro's gcc-nm just seg faults... but curious for the long run.