Monday, April 21, 2014

Linktime optimization in GCC, part 2 - Firefox

This time I will write about building Firefox with LTO. Firefox is one of largest binaries on my hard drive (beaten only by Chromium) so it is an excellent stress test. Taras Glek and I started to work on getting Firefox to build well in 2009 (see original paper from 2010), however for all those years I was mostly chasing correctness and compile time/memory usage issues. Markus Trippelsdorf was instrumental to keep the project going. In parallel Rafael Espindola did same effort of getting Firefox to work with LLVM and LTO (Update: see also his recent slides) that got into useful shape about a year later.

Update: I solved the LLVM LTO crash by commenting out all SSE3 code from SkBitmapProcState_opts_SSSE3.cpp. It seems to be re-incarnation of Firefox's PR759683 and LLVM's PR13000. I also added GCC 4.8 runtime and code size tests.

Update: With help from Firefox devevelopers I also measured tp5o benchmark. It shows rendering times of popular pages from 2011.

Looking into Firefox let me to understand that needs of large applications are different from what one sees in standard benchmarks (SPEC2006). It motivated several changes in GCC, including making -Os cope well with C++, adding code layout logic, optimization to handle better static construction and destruction and to improve dealing with PIC interposition rules.

It may be surprising, but despite my work in the area, this is first time I actually run some real benchmarks (except for Dromaeo I did two weeks ago). Benchmarking large applications is not easy to do and I never had time to poke with the setup. As I have learned recently, Firefox has actually pretty thorough benchmarking tool called talos.

I run quite few benchmarks and builds, but it would be interesting to do more. Due to unforeseen changes in my plan this weekend I however decided to publish what I have  now and followup on individual topics.

This post should be also useful guide to those who want to try other bigger projects with LTO. Firefox contains a lot of third party libraries and has very diverse code base. Problems hit here are probably good portions of problems you hit on other bigger applications.

For LTO history, see part 1 of this series.

Update: See also Martin Liška post on building Gentoo with LTO.

Prerequisities

I will not duplicate the build instructions from Mozilla site. However for successful LTO build one needs the following additional things:
  1. Binutils with plugin enabled. This can be checked by
    $ ld --help | grep plugin
      -plugin PLUGIN              Load named plugin
      -plugin-opt ARG             Send arg to last-loaded plugin
    I personally use gold from GNU Binutils 2.24.51.20140405 but any reasonably recent binutils either with GNU ld or gold should do the job. One needs to however enable the plugin while configuring since it is off by default.
  2. GCC 4.7+ that is configured with plugin support (i.e. built with the plugin enabled LD). For sane memory use, GCC 4.9 is recommended, it requires about 5 times less RAM.

    To test if GCC runs with plugin support, try to link simple hello world application and do:
    $ gcc -O2 t.c -flto --verbose 2>&1 | grep collect2.*plugin
    If the output is non-empty you are having plugin set up correctly.
  3. Clang needs to be built with gold llvm plugin enabled, too. Follow the official instructions.
  4. ar/nm/ranlib wrappers on the path that dispatch to the real ar/nm/ranlib with plugin argument.

    This is somewhat anoying issue. GCC ships with gcc-ar/gcc-nm/gcc-ld, but it is not very easy to convince build script to use them. Adding to path a simple script calling gcc-* equivalent will lead to infinite recursion. I use the following hack:
    $ cat ar
    #! /bin/sh
    command=$1
    shift
    /aux/hubicka/binutils-install/bin/ar $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*

    $ cat nm
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/nm --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*

    $ cat ranlib
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/ranlib $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*
    In the LLVM case I use the same wrappers, just the plugin path is /aux/hubicka/llvm-install/lib/LLVMgold.so.

    Alternatively the plugin can be installed to the default plugin sear path (/usr/lib/bfd-plugins). This solution has some limitations however. One needs not-yet-released binutils to make this work with GCC and one can not easily install multiple plugins. I hope this situation will change in not so distant future.

    Forgetting these steps will give you undefined symbols as soon as you start to link static library build with -flto and GCC 4.9+ or -flto -fno-fat-lto-objects with earlier compilers.
  5. Disk space for the checkout, about 3GB
  6. Disk space for build directory: 0.9GB for normal build (3GB with debug info), 1.3GB for LTO build with GCC (1.6GB with debug info) and 0.72GB for LTO build with LLVM (2.2GB with debug info).
  7. At least 4GB of RAM (if your machine has 4GB, some extra swap is needed to pull out linker when LTO optimizer is working) for GCC LTO and 8GB for LLVM LTO.
  8. At least 3GB of /tmp space for GCC's WHOPR streaming. If you use tmpfs, you may want to redirect GCC's temporary files to a different directory by setting TMPDIR.

Be sure you actually use LTO and have plugin configured right

One of common mistakes is to use GCC -flto without linker plugin. This has three important consequences.
  1. GCC will always build fat LTO files and your builds will take twice as long then needed
  2. GCC will not have visibility information from linker and will not optimize very well across symbols exported from the binary. You may change this with -fwhole-program, but this solution is hard to use and still has some negative impact on code quality. It is a lot better to use linker plugin and the -fvisibility=hidden if you build PIC library.
  3. If you produce static library of LTO files and you link it without the plugin, the link time optimization on those will not happen. You will get the same code as without -flto. I believe many users who report no benefits and/or ability to build very complex apps actually fall into this trap.
I hope binutils will be patched to diagnose these cases more gratefully.

    Be sure you get link-time optimization flags right.

    Both GCC and LLVM perform some optimizations at compile-time and some optimizations at link-time. Ideally the compiler should pass down the flags from compile-time to link-time but GCC (and I believe LLVM, too) is not yet fully reorganized to support per-function flags (I plan improve this for next release).

    Worse yet, GCC does some magic, like trying to get -fPIC flag right, that does not really do what it should, but it hides some of obvious to detect mistakes. Richard Biener improved it a bit for 4.9, but the sollution is still not very good.

    It is important to pass proper optimization flag to the final invocation of compiler that is used to link the binary. Often it can be done by adding your optimization flags to LDFLAGS. Failing to do this will probably result in not very well optimized binary.

      Compilers

      I used unpatched GCC 4.9 release candidate 1. I configured my compiler with:
      --prefix=/aux/hubicka/49-install --disable-plugin --enable-checking=release --disable-multilib --with-build-config=bootstrap-lto
      And built with make profiledbootstrap. If you want to use LTO and FDO together, you may want to get this patch for PR47233 that did not make it into 4.9.0 and will land 4.9.1 only. It reduces LTO+FDO memory use by about 4GB.

      I also use relatively fresh LLVM checkout 206678 (Apr 19). Eric Christopher recommended me to try recent LLVMs since the memory usage with LTO was reduced. I also briefly tried 3.4 that seemed to behave similarly modulo needing more memory. The wrong code issues I mention here as well as the crash on LTO without -march=native happens in both cases. I configured with
      --enable-optimized --disable-assertions --prefix=/aux/hubicka/llvm-install 
      and rebuilt by itself. My understanding is that this should give me fastest binary. Release builds  seems to be with assertions enabled that seems to cause noticeable slowdowns.

      To both compiler paths I install the shell wrappers for ar/nm/ranlib as mentioned above.

      Firefox changes needed

      The actual changes to Firefox are tracked by our tracking PR45375. Here are changes that actually landed upstream I write about mostly to give an idea what source code changes are needed:
      1. Patches to mark symbols used by top-level asm statements by __used__ attribute. This change is needed because unlike the statement level asm statements, the toplevel asm statements have no way of annotating them with symbols used. For this reason the symbols in ASM attributes are not visible to linker and GCC prior code is generated. If you won't use explicit __used__ attribute, GCC will freely optimize out the symbols leading to link errors.

        This solution somewhat sucs. LLVM has built in ASM parser that makes this transparent in some cases, but not in others.

        The patch is linked from Firefox's PR826481
      2. Patch to libvpx to avoid grepping bytecode assembly files.

        The problem here is that libvpx produce assembly file and then greps it for specific code in it. With -flto the assembly file consists of compressed LTO bytecode and thus the grepping fails.  One needs to pass -fno-lto to the specific unit.

        More in Firefox's PR763495
      Patches not pushed upstream
      1. A hack to prevent configure script to be confused by LTO and disable support for visibilities. This is linked here. Without this patch one gets working Firefox, but code quality is lousy.

        Confusion of confiugre scripts is unfortunately common problem for LTO. Many of the scripts was written with traditional compilation model in mind.
      Martin Liška and Markus Trippelsdorf put together bare bones of GCC LTO FAQ.

        Firefox configuration

        My mozconfig is as follows:
        ac_add_options --enable-application=browser
        ac_add_options --enable-update-channel=${MOZ_UPDATE_CHANNEL}

        ac_add_options --enable-update-packaging
        ac_add_options --disable-debug-symbols
        ac_add_options --disable-optimize
        ac_add_options --disable-tests
        export MOZ_TELEMETRY_REPORTING=1
        export MOZILLA_OFFICIAL=1
        mk_add_options MOZ_MAKE_FLAGS="-j16"
        export MOZ_PACKAGE_JSSHELL=1
        MYFLAGS="-O3 -march=native -flto"
        export CFLAGS=$MYFLAGS
        export CXXFLAGS=$MYFLAGS    
        export LDFLAGS="-O3 $MYFLAGS"              
        mk_add_options MOZ_OBJDIR=/aux/hubicka/final2-gcc-firefox-49-lto-O3-native-debug
        I use --disable-optimize to be able consistently set flags to all files by MYFLAGS. For LLVM LTO build one also need in some cases --disable-elfhack (or one gets segfaults). I will discuss effect of enabling-disabling debug symbols briefly in the next section.

        LLVM LTO builds crashes in instruction selection with LTO. I suppose it is because some units are build with custom -march flag with use of SSE intrincisc but this information gets lost to LTO. To get binary built I link with -march=native. I am in the progress of looking for testcase to report the issue. The resulting binary do not work for me, so all binaries benchmarked are with generic tuning.

        To get build with profile feedback, one additionally need to enable tests.

        Getting parallelism right

        If you want to get parallel build with LTO, it is better to parallelize the linktime optimization, too. This is currently supported by GCC only. In GCC you can pass parallelism to -flto parameter. I use -flto=16 that gets me 16 processes. There is no need for passing higher values. Parlallelism increases the memory use, so you may need to use lesser parallelism than is your CPU count if you are running into swap storms.

        Other important thing to consider is correlation with make's parallelism. If you build with -j16 then make may execute 16 linking commands in parallel that may each execute 16 worker processes killing your machine with 256 parallel compilations. For Firefox it is not an issue as there is only one dominating binary, for other projects you may want to be cautious.

        GCC support -flto=jobserv (implemented by Andi Kleen) that lets make to drive the whole process. There are few downsides of this:
        1. You need to adjust your Makefile to add + in front of the rule executing link command, otherwise GNU make will not pass down the necessary jobserver information. A lot of users seem to not read the docs far enough to notice this trap and get serial links.
        2. GCC 4.9 newly parallelizes the streaming stage and because I do not know how to communicate with Make's jobserver, this stage won't get parallelized, yet. GNU Make's jobserver seems easy to deal with and patches are welcome :)
        3. If you cut&paste command from Make's output, the linking will be serial because jobserv is not run.
        We will need some cooperation with GNU make maintainers to get these problems hammered out.  In general I think -flto should default to an "auto" mode that will by default detect number of CPUs available unless it knows about Make's jobserver and Make's jobserver should be available without the extra Makefile change.

        The parallelism is reached by breaking program into fixed set of partitions after the inter-procedural analysis is completed. The number of partitions is not controlled by -flto and is 32 by default. If you want higher parallelism or smaller memory footprint, you may increase number of partitions by --param lto-partitions=n. This will affect resulting binary but it should not really decrease code quality by a considerable fraction. The number is fixed to ensure that builds on different machines gets precisely the same output.

          Building

          To build Firefox normally, one use
          make -f client.mk 
          For the profiled build one needs X server running (I use tightvnc) and use  
          make -f client.mk profiledbuild
          If X is not running, the build passes but the resulting binary is crap, because all it is optimized for is to output the error message about missing X.

          My build machine

          My machine (courtesy of IUUK) has plenty RAM (64GB) and AMD Opteron(TM) Processor 6272 with 16 threads running 2100Mhz. I use tmpfs for /tmp directory and run modified Debian Wheezy.

          Compile time

          Wall time


          One observation is that GCC's LTO is no longer coming for extreme compile time costs. The slim LTO files avoid double optimization and the build gets parallelized pretty. The extra wall time caused by -flto=16 is comparable to switch from -O1 to -O2.

          LLVM without LTO clearly builds faster. Its -O3 compilation level is about as fast as -O0 in GCC, while GCC's -O1 is just bit slower and other optimization levels are significantly slower. With LTO LLVM builds a lot slower due to lack of any parallelism in the build stage. Also LLVM's optimization queue is organized in a way that virtually all optimization are run twice; once at compile-time and then again at link-time. While this may account extra cost, I think it is not too bad due to speed and relative simplicity of the LLVM's optimization queue. It is more problematic in a way that optimization done at compile time may inhibit optimizations at link time and I expect this will be revisited in future.

          Will debug info kill my builds?

          Other interesting topic is debug info. While GCC basically carries all expenses of debug info even with -g0. LLVM doesn't. The LTO memory usage increases many times with -g and build time with -O3 -flto -march=native -g is 37m46s (2266s), so about 46% slower. GCC needs 15m20s (extra 11%). I left out these from the graph to avoid rescaling.

          For non-LTO builds GCC -O3 -g compile time is 12m41s (761s), about 18% slower than -g0, LLVM -O3 -g compile time is 9m35s (575s), about 13% slower.

          Previous releases

          To put this into further perspective, GCC 4.7 needs 35m1s (2101s) to complete the job with -O3-flto=16 -g, GCC 4.8 needs 40m10s (1810s). Firefox build with GCC 4.7 and LTO crashes on startup for me (earlier checkouts worked). I did not investigate the problem, yet.

          CPU time

          Parallelism plays important role: while compilations can be easily done in parallel via make -j16, linking often gets stuck on one binary.  Here is how system+user time compares on individual compilation:


          Here it is visible that LTO compilation does bit more work for both compilers, but WHOPR is able to keep overall build time in bounds.

          Compiler's memory use

          Here I present some data collected from vmstat plotted by a script by Martin Liška (who originally got the idea from Andi Kleen). Red line in CPU utilization graph shows the case where only one CPU is used.

          GCC 4.9

          GCC memory/cpu use at -O0


          GCC memory/cpu use at -O3


          GCC memory/cpu use at -O3 -g

          In the graphs one can actually see that relatively a lot of time (100s) is spent before actual compilation begins. It is make parsing through Makefile and executing bunch of python scripts. The same happens after 600s to 700s where the same is repeated for different make target. The little peak at the end 700s is gold linking the final libxul.so library. We will soon see this like peak to grow up :)
          GCC memory/cpu use at -O3 -flto
          Here the memory usage during actual compilation is lowered to about 4GB and compilations are finised in about 500s (faster than with -O3 in the previous graph).
          What follows is a memory mt. Everest climb in memory usage graph starting just after 500s.  One can see the serial link stage running from 500s to 600s

          Mount Everest climb in memory usage starting just after 500s is the LTO optimization libxul.so, the largest Firefox's library:
          1. WPA (whole program analysis) type merging: The steeper walk up to 550s is compiler reading summaries all of the compilation units, merging types and symbol tables.
          2. IPA optimization: The slow walk up is inter-procedural optimization, dominated by inliner decisions.
          3. WPA streaming: Just after 600s the serial stage is finished and the steep peak up is parallel streaming and shipping new object files into copmilatoin. You can see the CPU usage to go up, too.
          4. Local transformation: Shortly after the steep valley the actual compilation starts, executed in parallel for another almost 200s.
          While the memory usage in my graph peaks 16GB, you can see it all happens in the parallel stage.  If you link Firefox at 8GB machine you can either increase --param lto-partitions or decrease -flto parallelism. Both will make the mt. Everest to sink.

          The little hill just before the mt. Everest (around 500s) has about the same shape for a good reason: it is linking of the javascript interpreter shell, the second largest binary in Firefox. It is a good example that bit of build machinery reorg would probably help to hide most of the LTO build time expenses. All one would need would be to overlap the javascript shell link with the libxul.so link. Something that is not needed in the normal build, since the link time of javascript shell is almost immediate.

          An unfortunate debug info surprise

          GCC -O3 -flto=16 -g
          This graph was supposed to show, that debug info is almost for free. This is true up to about 700s. Then it shows quite large increase in peak memory use (16Gb to 25GB) for debugging enabled in the local transformation stage. Sadly this is a new problem with recent Firefox checkout I did not noticed until now. I am looking into an solution.

          If I have 8GB or 4GB of RAM?

          The following are the worst case (debug info + -O3) builds with reduced parallelizm to 8 and 4. I regularly build firefox on my notebook with 8GB of RAM and 4 threads. On 4GB machine the build is possible, but one would need to either go to -flto=1 or play curefuly with partitioning. Again, Firefox is quite a monster.
          GCC -O3 -flto=8 -g

          GCC -O3 -flto=4 -g
          GCC -O3 -flto=16 --param lto-partitions=16 -g

          Older GCC releases


          GCC 4.7 with -O3 -flto=16 -g
          To get bit of perspective, this shows improvements since GCC4.7. Fat object files accounts for increasing the compilation stage to almost 1000s. The type merging is the footstep of the hill followed by slow inliner decision and streaming (the shanky part of graph, around 1500s). The WPA stage needs 8GB instead of 4GB in GCC 4.9.

          Again this graph shows that there is a new problem with debug info with recent Firefox checkout. Again this is fixable by reducing parallelism.

          GCC 4.8 -O3 -flto=16 -g
          While GCC 4.8 links almost twice as fast compared to 4.7, the memory usage is still bad (serial stage has about the same memory use and local compilation needs 5GB more)

          LLVM

          LLVm memory/CPU use at -O0

          LLVM memory/CPU use at -O3
          LLVM memory/CPU use at -O3 -flto -march=native
          LLVM is done with compilation in about 350 seconds, that is quite surprisingly similar to GCC's time.I would expect here cleang to beat GCC's C++ FE hands down, but apparently a lot of compiler time difference nowdays is in the back-end and also by the integrated assembler. Or perhaps it is because more optimizations are performed at LLVM's compile time. Then it follows by serial link stage.

          Without LTO LLVM uses about 1/3rd to 1/4th of memory, while with LTO it needs about twice as much (given that the link stage is not parallel at all). This is the case of compilation with debug symbols disabled only.
          LLVM with -O3 -g

          LLVM with -O3 -flto -g
          This graph shows issues with debug info memory use. LLVM goes up to 35GB. LLVM developers are also working on debug info merging improvements (equivalent to what GCC's type merging is) and the situation has improved in last two releases until the current shape. Older LLVM checkouts happily run out of 60GB memory & 60GB swap on my machine.

          File Size

          I generated the file size comparison splitting the binary into text, relocations, EH and data that accounts also the other small sections.I made EH data to appear gray, since they most of time don not need to be loaded into memory and thus their size is less critical then the size of other sections.



          One thing that LTO is good for is code size optimization. This is not as clearly visible on libxul.so, because it uses gold to do some of dead code removal, but will become more clear in other applications.

          In GCC's LTO at -Os bring about 5% reduction to both code and data segment (hover over graph to get precise numbers) and 20% reduction to EH data. For -O2/-O3 the data segment reduction stays, but the code is expanded by inlining. This is just the out-of-the-box configuration. The inliner is by default not really well behaving on large applications. Some of it I already fixed for next GCC release here. For now I recommend to reduce the inliner expansion by --param inline-unit-growth=5. This brings binaries close to non-LTO -Os, but without the runtime expenses. I wanted to make this default for GCC 4.9 but did not have chance to complete the necessary bencharking, so it is something I want to address this development cycle (in early tests done in mid 2013 there was issues with SPEC2k6 xalancbmk benchmark). I will follow on this and show some tests.

          UPDATE: I rebuilt firefox without -function-sections -fdata-sections and gold's section removal + identical code merging (which require the sections). The overall binary reduction with the default configuration is 43% (in LTO build)

          For LLVM LTO usually expands the binary including the -Os compilation, at -O3 it is about 30%. I believe it is mostly by inliner seeing too many cross-module inlining cases, too. LLVM has no global growth limit on inliner. What happens with LTO is that almost all functions become inlinable and simple minded inliner is just too happy and inlines them all. I believe that this is partly reduced by fact that LLVM does code expanding inlining at compile time that consequently disables some of cross-module inlining at link-time. Did not double-check this theory, yet.

          The data segment is reduced by about 3% and it is smaller than data segment of GCC produced binaries by about 6%. I tracked down part of the problem to GCC aligning virtual tables as if they were arrays for faster access by vector instructions. This is quite pointless and will make patch to avoid it. Other problem is the lack of optimization pass to remove write only variables that I made patch for next release cycle. Last difference seems to be LLVM bit more aggressive about optimizing out C++ type infos that GCC believe are forced to stay by C++ ABI. I looked into this and I lean towards conlcusion it is LLVM bug, but I am not expert here and further analysis are needed. Some of the symbols has appeared forced in recent LLVM checkout.

          One thing to observe is that GCC seems to produce considerably smaller code segment with -Os than LLVM (14%). As it will be shown later, LLVM's -Os code is however faster than GCC's -Os. It is the nature of the switch to trade all performance for code size however. -O2 code segment is smaller in LLVM (5%). I did not look into this in detail, yet, but it is my long time feeling that -O2 may need bit of revisiting from large application developer point of view. A lot of code generation decisions are tuned based on benchmarks of medium sized apps (SPEC2k6) that makes code expanding optimizations to look cool. I did bit work on this for GCC 4.9 and generic tuning, but more needs to be done. The size difference further increase at -O3 that I consider OK, since -O3 is about producing fast binary at any cost. As it is shown in benchmarks, -O3 in GCC does bring benefits, while for LLVM it seems to be close to -O2 performance.

          GCC also produce bigger EH tables (27%) at -O2. Partly this is concious change in GCC 4.9 where push/pop instructions are used to pass on-stack arguments. This produce shorter code and work well with modern CPUs having stack engines but the unwind info bloats. LLVM also seems to have bug where it optimizes away the EH region in:
          #include <stdio.h>
          void
          t()
          {
          }

          void
          q()
          {
            try {t();} catch(int) {printf ("catch\n");}
          }
          Even with -fPIC -O2. This is not valid, because t can be interposed by dynamic linker to different implementation that does throw. I filled in PR for this and will double check if those two are the sole reasons for the difference.

          Runtime benchmarks

          I used talos to collect some runtime performance data. Sadly I do not have LLVM LTO data, because the binary crashes (and moreover it is native build rather than generic). I plan to followup on this and do the comparison once I hammer out the problems.

          Startup time tests

          First benchmark is ts_paint that starts the browser and waits for the window to be displayed (in my understanding). I tried also bechmark called ts, but it does not produce results.

          Before each talos run I flushed the cache. Talos is running the benchmark several times, so it gave me one cold startup run and several hot startups (I hope). In the graph bellow is the cold startup and average of hot startups. Cold startup have more noise in it, because it is just one run and I suppose it also depends how well the binary landed on the hdd.

          I am not entirely happy about this benchmark and will write more later. The main thing is that Firefox uses madvise call to fully load every library disabling the kernel's page demand loading. Martin Liška implemented function reordering pass that, with profile feedback, should allow to load just small portion of the binary. I will followup on this later. See also Taras's blog post on this topic.

          The startup tests seems to be insensitive to LTO except for case when profile feedback was used (FDO saves 7%, LTO additional 4%). Cold startup shows some sensitivity to file size, but apparently my HDD is fast.

          Opening window


          Opening window is a nice test to show compiler flags, since it should fit in cache. Here LTO wins bt 5% at -O2, 2.3% at -O3 and 9.8% with FDO (that by itself brings 6.4%). One of two (Update: three) benchmarks where LLVM performs better than GCC -O2, I am not sure if it is within a noise factor.

          Scrolling


          Scrolling has similar properties to window opening: just bunch of C++ code that fits in RAM and thus is sensitive to compiler optimization.

          SVG rendering




          Here I used tsvgx test (that is said to be better than tsvg) and since it tests multiple images, I made geometric average.


          Canvas mark


          Again a gemoetric average, I did not investigate yet, why some runs has failed. Will update on it.

            Dromaeo CSS


            A geometric average of individual benchmarks.

            Dromaeo DOM


            This is the only benchmark where LLVM is not approximately in GCC -O1 to -O2 range and in fact LLVM -O2 outperforms GCC -O2 by 7%. I profiled it and again it was caused by GCC not inlining by PIC interposition rules, so the difference should disappear if the problem is fixed at LLVM side.

            Allowing the inline by adding explicit inline keyword makes GCC -O3 to match score of GCC -O3 -flto=16 where the inline happens then based on linker plugin feedback. Again someting I plan to followup on. There is bit more going on that I need to investigate deeper. For GCC, the crictical inline happens only at -O3 -flto, while LLVM gets it right at -O2 despite the fact that its -O2 binary is shorter than GCC's. The inline decisions are significantly different making it a bit harder to track down.

            Update: By mistake, I reported large improvement for -flto -O3 with LLVM. This was actually caused by fact that I pulled in some changes into firefox tree while looking for workaround for the compiler crash. One of them was patch improving DOM. I rerun the tests on original tree.

            Update: Loading common pages


            This test seems to be only one that seems to show that -O3 is not always a win due to code size bloat.

            So what we can improve?

            LTO in GCC has quite large room for improvement. I have patches for next release to improve inliner's behaviour and reduce code size considerably without giving up on the speed. There is also still a low hanging fruit on imroving memory usage and compile time. Re-tunning backend and applications themselves for the new environment will need more work.

            Comparing to LLVM it seems we should try to reduce amount of information stored into LTO object files that is almost twice of what LLVM streams. Part of the extra information is used for optimization and debug info, but the streaming was not very curefuly analyzed for what really is important where: it is more or less pickled in-memory representation. There is ongoing effort to cleanup GCC, modernize the source base that will hopefully enable some deeper cleanups, like simplifying the type representation, that will in turn improve memory footprint. Clearly speeding up C++ frontend and putting our datastructures on the diet can do wonders. (I would be curious to see clang retargeted to GCC backend).

            I also have quite few plans for profile feedback. I plan to reorgnize instrumentation so with LTO one will not need to recompile everything twice, only need to relink and hopefully AutoFDO and LIPO will be merged in.

            Next

            Hope you found the tests informative and I will try to followup on the unresolved questions in the near future. I also plan to write post on Libreoffice and for that I would welcome suggestions for benchmarking procedure.

            Linktime optimization in GCC, part 1 - brief history

            Link-time optimization (LTO) is one of the most actively developed features of GCC.
            Since 2010, I am working on getting LTO to work well with real world applications and with GCC-4.9 release candidate out, perhaps it is a good time to share my experiences.

            I plan to write about getting Firefox (Update: part 2 is here), Libreoffice, Chromium and other large applications to work with LTO. This will hopefully give an broader idea what kind of setup & portability issues one can hit. I will also show some benchmarks. Today I will however start with a short review of history and current status of implementation of link-time optimizations in GCC.

            Update: See also Martin Liška post on buliding Gentoo with LTO.

            Classical organization of the toolchain

            Classical UNIX toolchain optimize at translation unit basis (that is, more or less, a source file). The compiler parses the source, produces a representation in the intermediate language (IL), optimizes it and eventually generates assembly. Assembly file is turned into an object file by the assembler. Resulting binary is produced by the linker. At runtime the dynamic linker links multiple DSOs together and executes a program. Finally dynamic linker may be used again via dlopen to load more libraries/plugins. Practically all non-trivial optimizations happens in the compiler, while some limited tricks are performed by both assembler and linker.

            Early simple compilers internally worked on statement basis; often it was the case that the whole program would not fit into the memory (I will call this compilation scheme statement-at-a-time). Statement-at-a-time compilation greatly limits optimizations one can perform. Generally one can do only those transformations that are obvious from considering single statement in the isolation (such as expression simplification) though some basic inter-statement optimization is possible, too, such as basic common subexpression elimination or peephole optimization. Such optimizations can be implemented, at least in the limited form, as a single pass through the instruction stream without ever looking back.

            Extending the scope of the compilation from statement to function (function-at-a-time compilation) permits most of classical optimizations including constant propagation, common subexpression elimination, loop optimization, register allocation, advanced instruction selection and instruction scheduling.

            GCC started with a mixture of statement-at-a-time approach and function-at-a-time. Frontends generated intermediate code per-statement basis and immediately lowered it to very low-level (close to assembly) intermediate language called RTL (Register Transfer Language). According to Richard Stallman this was mostly done to deal with limitations of the stack size and memory of M68k machine he developed GCC on. The rest of the compiler worked at function-at-a-time. Even early GCCs was thus able to do basic function wide optimizations and until late 2.9 GCC series the optimizers has developed into strong low level optimization engine that was more portable to new architectures than any other compilers available at that time. The optimization queue consited primarily of strenghtened form of constant subexpression elimination (global value numbering), loop optimization (loop unrolling, strength reduction), jump optimization (including conditional jump generation), instruction combining, register allocation, and peephole optimization.

            Cygnus was a pioneering free software company that made its living from porting GCC to new architectures. The main focus in late 90s was put into making the originally CISC centric optimizers work well with RISC and later VLIW instruction sets and to generally improve set of intra-procedural optimizations. Important development change happened in 1997 with introduction of EGCS fork which enabled some of deeper changes in the compiler.

            Compiling C++

            In 90s and early 2000s, the popularity of C++ was on the rise while the language itself was still a quickly moving target (C++ was standarized in 1998 with an update in 2003 and C++ ABI took few extra years to settle down). This brought challenges not only for the developers of C++ front-end, library and runtime (that is a separate topic I do not know enough about to write it), but soon enough for the backend developers, too.

            In late 90's GCC was converted to function-at-a-time parsing. This was motivated mostly by a need of reasonable implementation of C++ templates. Here one want to keep close to source representation (generally called Abstract Syntax Tree, or AST) of templates until instantiation happens. Just like preprocessor macro, the template alone has no semantics that may be easily represented by classical compiler IL. This is hard to implement within a front-end that has no internal representation of the whole function body.

            Dealing with abstraction penalty

            C++ programmers more and more relied on compiler's ability to remove the abstraction penalty. It became essential to reliably inline functions, optimize out unreachable function bodies and do higher level transformation on function bodies (such as to replace data stored in instances of objects by scalar variables after all methods was inlined). C++ programers started to make strong assumptions about compiler's ability to cleanup their constructions while that are often not very well thought out from compiler developer's perspective.

            As one of main offenders, function inlining was not a strongest point of GCC, which still worked in function-at-a-time mode: GCC backend was organized as a library that was used by frontend whenever it needed to produce something into the assembly file. For that purpose the order of things compiled was completely out of backend's control and the scope of inter-procedural optimizations in general was extremely limited.

            Bottom-up inlining

            This is how the impossible (interprocedural optimization within a backend that generally works at one function at a time) was implemented.

            After function arrived from the frontend, backend performed following steps:
            1. Inliner inlined function calls of all functions it did previously memoized body of.
            2. Some early optimization passes performed some basic optimizations (dead code removal, jump optimization)
            3. Inliner checked if the optimized function is small enough, i.e. the number of instructions was less than 8*(8+num_parameters) and if so, it remembered the function for further inlining. Functions declared inline was remembered if their body did not exceed 600 instructions.
            Such inliner implementation is called bottom-up - if you imagine callgraph as a tree growing down, the inliner starts from the bottom (leaves) and propagates inline decision upwards. Obviously such inliner was limited to inline only in forwarding direction (the function must have been defined before used) and also was believed to consume a lot of memory because it remembered the expanded function bodies with all inline decisions already applied.

            Top-down inlining (GCC 3.0)

            The old inliner was rewritten in 1999 by CodeSoucery (originally for C++ only and later ported to other languages) to work on the high level form. In the new implementation inlining was moved into the front-end (in fact, both inliners has coexisted for few years before the new inliner was ported to all front-ends).

            At that time, C++ front-end already saved bodies of functions in the AST-like representation to handle templates. It also did not output most of functions just after it finished its parsing. Majority of functions was remembered in high-level form till very end of compilation when basic reachability analysis was performed to eliminate dead functions and to avoid unnecesary work of lowering them into the RTL representation. This was done because C++ include files tends to contain a lot of unused inline functions.

            Adding inlining was conceptually simple. At the end of the compilation, the compiler started with functions that was clearly necessary (i.e. their address was taken, they was exported from current compilation unit and so on). For every function the following steps was applied:
            1. It was decided if function is inlinable and of so, its body was saved for later use.
            2. Calls to all small enough inlinable functions was inlined and recursively also all such calls within the inlined function bodies.
            3. The function was optimized and output to assembly. The expanded function body was released from memory.
            4. Every function that got mentioned in the assembly output was added into list of functions that are necessary.
            This approach is called top-down because it works from root of the tree to the leaves.

            One problem was how to define small functions - the high level representation did use C++ style statements and not assembler like instructions. Initial implementation simply counted every statement to have 10 instructions at average and the functions declared inline was inlined if their size was less than 500 in this estimate. Functions was autoinlined if their size was less than 100.

            PR8361, tramp-3d and other compile time/memory use disasters

            The new inliner has turned out to have serious issues. One of more famous examples, at that time was bug PR8361 (filled in 2002 by Gerald Peiffer) that exposed exponential explosion in memory usage of the compiler while building template heavy C++ program. This bug defeated a solution using simple hacks to GCC and nicely exposed problems of the new design.

            First problem is use of inline in C++ code. Functions declared in header are often implicitly inline only because of developer's lazyness and the inline keyword in general is not as useful as one would expect. This means that compiler can not honor all inline functions in the program - inlining those alone will make compiler to explode on was majority of modern C++ sources. We had some relatively heated discussions about it in 2005 and of course making compiler to be selective about inline functions has made low level developers angry. For this reason always_inline attribute was born.

            New, and much more evil, testcase was prepared by Richard Guenther (Biener), who later became one of most profilic GCC developers. His PhD research project was implemented using pooma and till today it serves as very useful benchmark for new inliner patches. SUSE's benchmarking server nicely shows how the performance was steadily improving in 2006-2009 until the most important problems was finally resolved.

            Unit-at-a-time (GCC 3.4)

            Originally as an experiment to do something during the first SuSE-labs conference, I started to work on reorganizing GCC to work at unit-at-a-time basis. At that time it was already obvious that function-at-a-time optimization brings limitations we did not like. Most importantly the function inlining worked only in forwarding order - if you called function before defining it, it would not be inlined.

            In GCC 3.4 I merged in bare bones of the new callgraph module (described in this paper) that made compiler to first build intermediate language representation of the whole unit and then start optimizing.

            The new inliner was something I put together as a simple proof-of-concept experiment of inter-procedural optimization pass. It used simple greedy algorithm that inlined all small enough functions in the order given by badness metrics (num_calls*size) until global unit growth of function growth limits was hit. The algorithm was optimizing for number of functions fully inlined into their callees. The limits was basically a security precautions to avoid exploding on inline bombs and exposing nonlinearity of the compiler on very large function bodies..

            Bit more details about the early design appear in 2006 GCC summit paper and in  2007 GCC summit paper.

            The inliner heuristics was the incrementally improved into the today form and also the callgarph module has grown up into a full inter-procedural optimization framework with additional optimizations implemented, such as constant-propagation, function specialization, partial inlining, inter-procedural points-to analysis (still not enabled by default) and others. The inliner still remains the most important component of it, probably as in every optimizing compiler.

            Combining translation units via --combine (GCC 3.4)

            GCC 3.4 was released in 2004. Since late 90's better commercial compilers and recently open-sourced Open64 was already capable of inter-module optimization via various implementations of link-time optimization. Here GCC was held back by both technical and political problems.

            Clasical link-time optimization is implemented by storing the intermediate language into the "fake" object files. To make this possible, one need to have well defined representation of the whole compilation unit and the whole program. Because the backend has grown-up from the per-function compilation, it did not really had this.

            At a same time, the Free Software Foundation was concerned about the consequences of adding a well-defined on-disk representation of GCC IL that could be used to hook GCC into proprietary front-ends and back-ends without violating GPL 2.

            First attempt to get intermodule was implemented by Apple developer Geoff Keating. His implementation extended the C front-end to allow parsing of multiple source files at once and presenting them to backend as one translation unit. While this feature allowed some early experiments with whole program optimization, it also had serious limitations. It worked for C only and plans to extend it for C++ was never realized. It however worked well for many simpler programs and it was also able to detect programming errors that would be unnoticed with traditional scheme.

            The new middle end (tree-SSA, GCC 4.0)

            These optimizaitions became increasingly difficult to implement on GCC's low level intermediate program representation. For this reason a new middle-end was implemented. The tree-SSA project was probably the most important change that happened to GCC recently, but a topic for a different article ;)

            LLVM (2003), Open64 (2002)

            At the time Tree-SSA merge was being prepared, Chris Lattner presented at 2003 GCC Summit an proposal of itegration of LLVM into GCC as a new middle-end that would take place of the (not yet finished) Tree-SSA project. This never happened and as a result we now have two major open source compilers.

            One of major trading point at that time was implementation of LTO. People made guesses if it will be harder to modify GCC architecture for LTO or if it would be harder to complete LLVM to features and richness of backends GCC had. It is definitely interesting to observe that from today point of view.

            Update: Here are slides of Rafael Espindola talk about history of LTO in LLVM.

            As a less known fact, Open64, was open sourced 2000 working as an alternative GCC backend with full LTO support.

            Early days of LTO implementation in GCC (2005)

            The political limitations to LTO and compiler plugins was lifted by GPL 3.0. In 2005 an official proposal "Link-Time Optimization in GCC: Requirements and High-Level design" was posted to GCC mailing list. Work on LTO started jointly at Google, IBM and Codesoucery.

            The proposal took lessons from LTO designs in other compilers and proposed several important requirements starting from maximal transparency to user (basically LTO should require only -flto command line option), having well defined GNU Virtual Machine (GVN) and number of necessary changes across the GNU toolchain (assembler, linker and such).

            WHOPR: Whole program optimization (2007)

            Implementing LTO original proposal was extreme amount of work. While progress was made, some parts, like type representation in DWARF format, has shown to be progressing too slowly. Moreover Google started to consider an interesting problem on how to make LTO scale to really large applications, like one used internally by the compay.

            For this reason number of changes to the project was made. New proposal WHOPR - Fast and Scalable Whole Program Optimizations in GCC was born and it dropped some of initial plans, most importantly the well defined GNU Virtual Machine while it added new. The streaming of GCC's IL (gimple) was implemented. This has performance advantages because it permits to save a lot of compiler specific imformation that does not need to be rebuilt at link time.


            WHOPR introduced an approach to distributed link-time compilation, where the serial linking stage is minimalized. This fit well with the basic direction of the Intermodule optimization framework I was working on.

            LIPO

            LIPO is an interesting project by that apparently large difference for Google. It enables intermodule optimization based on profile feedback. Instead of doing classical and expensive whole program analysis at link-time, the basic decisions are actually done at runtime of the program being trained. The dynamic callgraph is built and the inlining decisions are made. The dynamic callgraph considers only portions of program that matters and thus it is a lot smaller. Also profile driven inlining decisions can be a lot more precise and reliable. Later, when recompiling with profile feedback, the compiler is told the inliner decisions and can parse the extra source files necessary.

            While this idea seems a bit backwards to the traditional design, it makes a lot of sense in Google's environment. Compilation needs to be highly parallelized and it needs to fit in relatively small memory limits. On the other hand the runtime construction of callgrpah is less constrained. In this situation LIPO seems to easily outperform traditional LTO.

            LTO merge (GCC 4.5)

            LIPO's success in Google got development on WHOPR/LTO into trouble (at least that is my off-sider's understanding :). Most of WHOPR core developers were moved to other projects and they were finishing it in their free time. The project however succeeded in being merged to GCC in 2009, even if it was not in perfect shape. GCC 4.5 was released with LTO working well enough for SPEC benchmarks and bootstrap, while the actual WHOPR implmentation was not really functional. Richard Biener did huge amount of work on getting the LTO into working shape.

            Update: Solving correctness issues

            Having ability to store intermediate language to disk and load back is one problem a developer of link time optimization framework needs to solve. There are many other (and less understood) challenges.

            Important part of the problem is to define precisely the semantics of the combined units and revisit rules and assumptions individual optimizers make. Most language standards deal with one compilation unit alone (well, at least this is mostly true for C, C++ has One Definition Rule). Consequently one has to think what happens when there are two types of the same name with different meaning coming from different units, what happens when pointer to datastructure in one language (C, C++, Fortran, Ada, Java) is accessed as datastructure in another language. One has to deal with conflicting optimization flags (like -fstrict-aliasing used in one unit, but not in another) because all the code is going to be mixed together by the inter-module inlining.

            At GCC 4.5 we want long way for one language and C/C++ mixed programs. Compiler was partly cleaned up from hooks allowing front-ends to customize backend behaviour and the basic type merging was implemented. Many of the issues however was left unresolved and had to be dealt with later.Some of them, in particular the command line issues, do not have completely satisfactory solution till present day.

            Binutils support

            LTO optimization is major change and it requires changes thorough the whole toolchain. Linker and other utilities to work with object files needs to be told to operate with LTO object files that are not containing any actual assembly. In GNU toolchain this was realized by  plugin support (implemented by Cary Coutant). Plugins adds a simple, but flexible API, to glue compiler into the gold linker that is useful for all LTO enabled compilers. Unlike approaches working around the linker changes this provide very precise information to the compiler about what symbols may or may not be used by non-LTO code. This is very important for optimization.

            The linker plugin support was later added to the traditional GNU-ld (by Dave Korn), enabling LTO support on non-ELF targets including Windows.

            One interesting aspect of the linker plugin design is the fact that it allows one pass linking. Most other designs dispatch compiler from the linker and once compiler delivers object files back, the linking is re-done. Gold and GNU-LD implements one pass linking. This has however caused an issue and thus alternative, two-pass linking, implementation was done by H. J. Lu.

            Final ar,nm and ranlib utilities was extended for plugin support, by Rafael Espindola, and gcc convenience wrappers added by Andi Kleen. I am still not completely happy about the need for wrappers, but I will write more about that next time.

            Getting WHOPR to work (GCC 4.6)

            Having basic infrastructure in shape, many GCC developers continued on necessary cleanups and improvements. The linker plugin became the default and we worked on many correctness issues.

            To get an specific goal I decided to work towards making GCC to compile Firefox. This was motivated by a post from Firefox developer Taras Glek about massive performance regression in GCC 4.5 (which has turned out to be problem with compiling C++ applications with -Os). I started to look into performance issues of Firefox and found them puzzling and interesting.

            In 2010 GCC summit paper we reported a success on linking Firefox in reasonable time (4 minutes on 24core machine) and 4GB of memory use. (Without the WHOPR mode linking took 19 minutes 8GB of RAM). The resulting binary did work and showed small, 1% improvement on dromaeo CSS benchamrks as well as 6% code size reduction. Not many compilers are capable of this.

            Firefox is a quickly growing target - a year later I needed 8GB of memory to build it with the same compiler,so GCC 4.7 had to be optimized carefully to squeeze it into 3GB. Today firefox, compiled with GCC 4.7 needs over 20GB of RAM. Some observations on this can be found in these 2013 slides.

            In gcc 4.6 -fwhopr was removed and became default with -flto. The non-whopr path can still be executed via -flto-partition=none.

            Kernel compilation (GCC 4.7+)

            Other long running project was Andi Kleen's effort to get Linux kernel to build with LTO. For a kernel developer, he made remarkable job on adding some missing features to the toolchain, mostly importantly the support for incremental linking. His effort is described in 2013 slides. The changes was considered for 3.15rc1 but missed the window apparently due to lack of evidence of benefits to the end-user.

            GCC 4.7 also had a rewrite of interprocedural constant propagation and a new inliner analysis that is aware of the context function is being inlined to (and, for example, can anticipate what parts of function body will be optimized out).

            Progress on correctly compiling non-trivial C++ program was made by reimplementing thunks support.

            Symbol table, scalability and stability improvements (GCC 4.8)

            One of long lasting GCC oddities was the lack of a symbol table datastructure. This made it harder to develop representation of some features that are tightly related to it, including aliases and thunks. For this reason we reorganized the callgraph module into a new symbol table datastructure that enables a better LTO optimization in programs involving those features.

            GCC 4.8 also had the usual work on memory reductions, LTO speedups. Since the LTO codegen started to be in shape, it was also the first release where the inliner went through re-tunning with link time optimization being considered.

            Today (GCC 4.9)

            GCC 4.9 is being released this week (with release candidate 1 being out since last week). It is pretty big release with many internal changes and API cleanups.

            The most important changes in LTO (besides the usual bugfixes) involve new type merging implementation by Richard Biener and many of compile time and memory use improvements. Firefox once again can be linked bellow 4GB limit despite its ongoing growth. In fact the memory use is now less than two-fold of memory needed by the linker alone that is not too bad result given how much more information GCC handle.

            An important user visible change is the new default to slim LTO files instead of fat.  GCC 4.8 and earlier produced both the intermediate language as well as assembly into every object file copmiled with -flto. This made it possible to handle LTO objects with non-LTO aware tools (ar, nm, ranlib and the linker) but it also doubles compile time, since everything is compiled twice. Since the binutils, libtool and other support matured, it is finally possible to avoid this expense.

            Martin Liška contributed pass for profile driven function reordering. This pass should enable faster startup of large applications by reducing the portion of binary that needs to be read into memory and making the accesses more sequential.

            Another new feature is the devirtualization module that allows early removal of unreachable virtual methods (that makes a huge difference on Firefox compilation) and also improves code as described in other post.

            GCC is now known to compile working Firefox, Chromium, Libreoffice, kernel and good a part of Gentoo distribution with LTO enabled.

            What next?

            LTO is far from being complete. One of main lacking features, by my opinion, is consistent handling of optimization options and debug info matching the non-LTO compilation. To get resonable performance benefits, the compiler needs to be re-tuned for the new whole program optimization mode and also some applications needs to be adjusted.

            There is also plan to merge LTO streaming based LIPO from Google branch, Martin Liška's code unification pass. I also hope the plans for well define virtual machine will be revived. In longer term I am planning to extend WHOPR to support fast recompilation in compile-edit cycle.

            Next time, i will write about series

            Friday, April 11, 2014

            Devirtualization in C++, part 5 (feedback driven devirtualization)

            After a break, I finally have time to return to the devirtualization series. Last time I was speaking about speculative devirtualization performed by static analysis of the type inheritance graph. This time I plan to speak about simple (but very effective) feedback directed devirtualization. While GCC implements basic feedback directed  devirtualization since 2003, GCC 4.9 will newly make this cross-module. I also finally show some statistics how effective the described optimizations are.

            What is feedback directed optimization? (FDO)

            Feedback directed optimization (or equivalently profile directed optimizations, PDO) allows compiler to collect information about the runtime program behavior and use it during the static compilation. This usually means compiling program once with an instrumentation (-fprofile-generate in case of GCC), training it on "typical" program behavior, and then recompiling with profile feedback (-fprofile-use).

            Feedback directed optimization is a well established optimization technique popular in performance critical applications (and benchmarking) since 80's. FDO is instrumental to get better inlining and code/data layout. It however helps many other optimizations including loop optimization, vectorization, register allocation, etc... Many optimizations needs to make tradeoffs between optimizing size or speed or tradeoffs making one code path slower while optimizing other code paths. All those can consult the profile feedback to get the decision right.

            FDO implementations

            The GCC implementation has grown up from gcov tool, written in 1990 by James E. Wilson, whose original primary purpose was to annotate source code at line-by-line basis by execution counts.

            Eventually the information collected by gcov was used to help compiler produce branch prediction hints for IA-64. Because the implementation was really ad-hoc, in 2002-2003 Zdeněk Dvořák, Pavel Nejedlý, Josef Zlomek and myself worked on school promject Infrastructure for Profile Driven Optimizations in GCC Compiler, whose purpose was to get profile feedback maintained and used thorough the whole compilation queue. Our main goal was to actually reorganize GCC backend to use common control flow graph datastructure, but the profile feedback was
            a nice marketing tool to demonstrate the importance of this work.

            FDO brings significant performance gains. We reported 14% improvement on SPEC2000 back then, where about half of the gains was also achieved by the static prediction techniques (here compiler simply guesses the profile). Never ever in my life I managed to improve SPEC score so much. The gains depends a lot on particular application, but generally they are higher than ones got by switching from -O2 to -O3 without the code size costs. (see also this gcc summit paper)

            The drawback of FDO is however the more complicated build process. Most developers do not want to care about training the program and rebuilding with feedback. While I was assured in 90's that FDO is extensively used by database vendors, even a decade later I found FDO being relatively little used in free software world where reproducible and simple builds are essential.

            In case you are interested in experimenting with FDO, building many of non-interactive packages with FDO basically means to do the following:
            CFLAGS="... -fprofile-generate" make
            make check <or any better training>
            make clean
            CFLAGS="...-fprofile-use" make
            For example, gzip's performance improves by about 6%, bzip's by about 10% and training these is trivial.

            Popularity of FDO is however improving. These days, for example, Google, Firefox and various language interpreters are heavy users of feedback directed optimizations. This motivated quite lively developments in the area recently. Google's LIPO makes interesting approach to use profile feedback to avoid expenses of full link time optimization, AutoFDO allows to use low overhead profiles instead of feedback from special purpose instrumetnation. (These two features are still off-tree, but I hope they will be merged for 4.10). Martin Liška implemented new feedback directed function reordering based on idea of Firefox developer Taras Glek

            In LLVM the feedback directed optimization became a focus recently and similar features are being implemented, too. Of course, Open64 implements decent feedback directed optimization as well.

            How feedback is collected in GCC

            To enable feedback directed optimizations, user needs to first build with -fprofile-generate. In this mode every compilation units gains a set of counters measuring various events of interest. Additionally every object file gains a static constructor which calls __gcov_init and registers the counters to the gcov runtime.

            Gcov runtime is a small librarly called libgcov implementing several helper functions and __gcov_dump that is responsible for streaming the collected data and merging them with previous runs. This library is tranparently added to linker command line when the program is linked through gcc wrapper with -fprofile-generate switch.

            The profile streaming happens from atexit hook and few other dirty tricks are done. For example calls to fork, vfork or thread creation are wrapped and redirected to libgcov to avoid double-accounting when both child and parent exits. So you are better to ensure you compile functions calling these with -fprofile-generate when profiling just part of your program.

            The on-disk profile information is saved per object-file, in files names <objectfilename>.gcda. These files can be read and merged by the runtime, used by GCC, and by external tools - gcov, gcov_exit and gcov-tool (which will be merged in 4.10).

            The basic information collected is so called edge profile, where execution count of every basic block and every edge in the control flow graph is measured. This is done using nice (and classical) instrumentation algorithm that avoids redundancies by computing minimal spanning tree. (See also interesting alternative path profiling, that is however not implemented in GCC, because the code quality benefits are not clear).

            The profile feedback infrastructure also allows so-called value profiling implemented by Zdeněk Dvořák in 2002-2003. Edge profiling allows GCC to determine values of various variables in the program that may be used for further optimization. One of the first uses of it was to optimize divisions. For a/b it may happen that b is often a nice constant and then one can optimize the sequence into someting like:
            if (b=2)
              a>>=1
            else
              a=a/b
            This simple trick often helps, for example for sily hash function implementations computing a%b where b is usually the initial hash table size, say 257.

            Feedback directed devirtualization in GCC

            Feedback directed devirutalization builds on the value profiling infrastructure. At every indirect call statement (there is no difference in between virtual call or classical indirect calls in this case) we try to determine the common destination of the indirect call and possibly speculatively inline the called function.

            Getting common value of variable cheaply

            Value profiling infrastructure implements various types of counters, one of them is one value profiler. You may see it in action if you compile the following with -O2 -fprofile-generate:
            int
            m(int a,int b)
            {
             return a/b;
            }
            You will get:
            m (int a, int b)
            {
              __gcov0.m[0]++; 
              __gcov_one_value_profiler (&__gcov3.m[0], b);
              return a / b;
            }
            Here gcov0.m increment is the edge profile counter that is no interest for us. The usual value of b is determined via __gcov_one_value_profiler call that is implemented in libgcov as follows:
            void
            __gcov_one_value_profiler (gcov_type *counters, gcov_type value)
            {
              if (value == counters[0])
                counters[1]++;
              else if (counters[1] == 0)
                {
                  counters[1] = 1;
                  counters[0] = value;
                }
              else
                counters[1]--;
              counters[2]++;
            }
            Three counters are used. counters[0] holds the usual value of b. counters[1] is increased when current value of b match the usual value and is decreased otherwise. counter[2] counts all execution of given statement.

            If b is a constant c for more than 50% of time, you will get counters[0]=c and counters[1] will give you conservative estimate how common the value actually is. GCC then simply checks counters[1]>counters[2]/2 and if so, it assumes that b==c most of the time.

            From common value to function identifiers

            Indirect call profiling was introduced by Zdeněk Dvořák in 2003 and after more than a decade I finaly reworked it for GCC 4.9 to be inter-module with LTO. Here I describe the new implementation.

            Getting common value of function pointer being called is an useless information without having a way to translate it back to the function identifier known to the compiler.  Doing so would require saving symbol table of the instrumented binary and dealing with a fact that one object file may link into many binaries and each binary may be loaded at a different address each time. Instead of that, GCC introduce a concept of profile_ids that are stable identifiers and can be easily mapped back to the symbol name within the compiler. Those are calculated as CRC of function's symbol name for external functions or as a combination of symbol name, source line and first global ID in the program for internal functions. This makes the unique in wast majority of cases.

            Lets consider a simple testcase with an indirect call:
            #include "stdio.h"
            int
            b(void)
            {
              return 1;..
            }

            int a(int (*b)(void))
            {
              return b();
            }

            m()
            {
              int i,j=0;
              for (i=0; i<1000000;i++)
               j+=a(b);
              printf ("%i\n",j);
            }
            Most compilers, with optimization enabled, will simplify m into:
            m()
            {
              printf ("%i\n", 1000000);
            }
            But this is not the case when you compile it as a shared library. This is because ELF interposition rules actually allows you to replace function a by something completely different, for example by the LD_PRELOAD mechanizm.

            GCC there correctly disables the inlining of all functions not declared inline that may be interposed with when compiling with -O2 -fpic. To my amusement, I just noticed that clang still optimize the code, I filled in PR19405 for that.

            With -O2 -fpic -fdump-tree-optimized one can see the instrumentation in the following form:
            a (int (*<T599>) (void) b)
            {
              int _4;
             

              __gcov_indirect_call_profiler_v2 (872192950, a);
              __gcov_indirect_call_callee = 0B;
              __gcov0.a[0]++
            ;
              __gcov_time_profiler (&__gcov8.a[0]);
              __gcov_indirect_call_counters = &__gcov5.a[0];
              __gcov_indirect_call_callee = b_2(D);
              _4 = b();
              __gcov0.a[1]++
            ;
              return _4;
            }
            b ()
            {

              __gcov_indirect_call_profiler_v2 (1092076645, b);
              __gcov_indirect_call_callee = 0B;
              __gcov0.b[0]++
            ;
              __gcov_time_profiler (&__gcov8.b[0]);
              return 1;
            }
            m ()
            {
              int i.j;
             

              __gcov_indirect_call_profiler_v2 (2089875789, m);
              __gcov_indirect_call_callee = 0B;
              __gcov_time_profiler (&__gcov8.m[0]);

              j=0;
             

            L3:
              __gcov0.m[1]++;

              j += a (b);
              i++;
              __gcov0.m[0]++
            ;
              if (i < 1000000)
                goto L3;
              __gcov0.m[2]++
            ;
              printf ("%i\n", i_30);
              __gcov0.m[3]++;
              c ();
              __gcov0.m[4]++;
              return;
            }

            There are quite few gcov0 increments which implements the edge profile. In fact more of them than one would expect. Most of the time, gcov0.a[0] will be the same as gcov0.a[1]. This is because GCC has to assume that each of the function calls whose target it does not see (by interposition this holds even for calls of a and b) may terminate program, invoke longjmp or throw an exception or be otherwise evil. Therefore after every function call that is not const/pure & nothrow a new counter is necessary.


            What is important for us is however call of __gcov_indirect_call_profiler_v2. This call is added into beginning of every function that may be called indirectly. The counter takes two paremeters. FIrst is the profile-id. Second is the address of current function.

            Additionally, at the indirect call site, in b(), two global variables are set:
              __gcov_indirect_call_counters = &__gcov5.a[0];
              __gcov_indirect_call_callee = b_2(D);
            Those are thread local global variables used by the indirect call profiler (version 2, introduced in GCC 4.9). This counter is implemented in libgcov as follows:

            void
            __gcov_indirect_call_profiler_v2 (gcov_type value, void* cur_func)
            {
               if (cur_func == __gcov_indirect_call_callee)
                __gcov_one_value_profiler (__gcov_indirect_call_counters, value);
            }

            The indirect call sequence then works as follows:
            1. Caller sets __gcov_indirect_call_counters and   __gcov_indirect_call_callee
            2. Called functions calls  __gcov_indirect_call_profiler_v2
            3. __gcov_indirect_call_profiler_v2 verifies that __gcov_indirect_call_callee corresponds to the real callee and if so it records the profile_id into counter selected by caller

              The verification is necessary to avoid confusion with another function in the middle that is not instrumented for profiling.

              While writting this blog, I noticed that in this case we ought to increase counters[2].  This will prevent us from devirtualizing function calls that mostly lands in a uninstrumented function. I will fix that in next release.

            Getting stable profile-id

            One of problems with profile-ids is that some symbol names may actually be random. In C++ ABI there exists public symbols that needs to be unique across whole program to be, for example, collected by linker into the static initialization lists. They are also used for anonymous namespaces and some other corner cases. Sadly, there is no safe way to invent global symbol that can not clash with anything else. GCC uses combination of first global symbol in the unit if it is present and if not a random number to get something like: _GLOBAL__I_65535_0_nsStaticXULComponents.o (ask c++filt what that means)
            Currently I am ignoring the problem for profile-id calculation assuming that these functions are not called indirectly.

            Profiling multiple targets

            One optimization GCC can't do is to speculate one call into multiple targets. According to Xinliang David Li (who has a lot of experience with profile feedback at Google), in some programs this makes an important difference as one polymorphic call typically have very few possible targets. This will need extension into one value profiler to support multiple values, something Google promised to submit for next GCC release.

            Reading and using profile feedback

            To actually use the profile, GCC goes through several steps.
            1. At compile file, profile stored in <objectfile>.gcda is read in. One can check the contents in the dump -fdump-ipa-profile:
              Indirect call value:1092076645 match:1000000 all:1000000.
              Indirect call -> direct call b_2(D)=> b transformation on insn postponned to ipa-profile: _4 = b_2(D) ();
              This mean that the indirect call was annotated with a value histogram to be used later.

              Alternative way to explore gcda files is to use gcov-dump tool.
            2. Either at a compile time or at a link time (with LTO) the ipa-profile pass is responsible for introducing the speculative call. This can be checked with -fdump-ipa-profile_estimate:
              Histogram:
                1000001: time:2 (5.56) size:2 (11.76)
                1000000: time:34 (100.00) size:11 (76.47)
                1: time:14 (100.00) size:4 (100.00)
              Overall time: 36000016
              GCOV min count: 1000000 Time:100.00% Size:76.47%
              Determined min count: 1000000 Time:100.00% Size:76.47%
              Indirect call -> direct call from other module a/12 => b/11, prob 1.00
              Introduced new external node (b.localalias.0/15).
              Indirect call -> speculative call a/12 => b.localalias.0/15

              The first part of dump is about determining where the program spends 99% of te execution time and deciding that every statement with less than 1000000 executions is not interesting (only the internal loop matters).

              The second part is about turning the indirect call into the speculative call. Here a new asymbol b.localalias is introduced. This is a static symbol that can be used to refer to body b in a way that can not be altered by ELF dynamic linking. If user decides to interpose b later, the speculative sequence will not match.
            3. Inliner inlines the call (as seen in -fdump-ipa-inline):
              Considering b/11 with 3 size
               to be inlined into a/12 in t.c:10
               Estimated badness is -1073741828, frequency 1.00.
               Called 1000000x
               Inlined into a which now has time 2 and size 7,net change of -4.
              Sadly the inliner won't inline call from m to a, because it knows that a may be interposed by different implementation. In 4.10 I plan to introduce speculative calls in this case, too.
            4. After all inline decisions are done, inliner removes speculative calls that seems not profitable.
            5. Local optimizations are performed. Those hopefully benefits from the inlined code sequences
            The resulting code looks as follows:
            int
            b(void)
            {
              return 1;..
            }

            static int b.localalias __attribute__ ((alias("b")));

            int a(int (*b)(void))
            {

              if (b == b.localalias)
                return 1;
              else
                return b();
            }

            m()
            {
              int i,j=0;
              for (i=0; i<1000000;i++)
               j+=a(b);
              printf ("%i\n",j);
            }
            This is clearly not an optimal result. A lot better would be:
            m()
            {
              int i,j=0;

              if (a == a.localalias && b == b.localalias)
                j = 1000000;
              else
                for (i=0; i<1000000;i++)
                  j+=a(b);
              printf ("%i\n",j);
            }

            This is something that I will implement for 4.10 :).

            I designed the testcase intentionally to play with the interposition to show the limits of current optimizations of -fPIC code. In less crappy cases, you will get pretty nice code. Still avoiding the indirect call and inlining a constant alone is about 30% runtime improvement. However main improvements come when later optimizations can do better work on the speculated sequence. Consider related testcase:
            int
            b(void)
            {
              return 1;
            }

             __attribute__ ((noinline))
            int a(int (*b)(void))
            {
              int i,j=0;
              for (i=0; i<1000000;i++)
               j+=b();
              printf ("%i\n",j);
            }

            main()
            {
               a(b);
            }
            GCC (without -fPIC) will go all the way optimizing a as:
             __attribute__ ((noinline))
            int a(int (*b)(void))
            {
              int i,j=0;
              if (a == b)
                j = 1000000;
              else
                for (i=0; i<1000000;i++)
                  j+=b();
              printf ("%i\n",j);
            }
            Here the important optimization is enabled by loop unswitching, which eventually splits the loop into two variants, one with known value of a and one with unknown and subsequently the first variant will be optimized to a constant store. No need to say that a() will now execute much faster.

            That's it; all the magic behind feedback directed devirtualization.

            Comparing feedback directed speculation to static methods on Firefox

            Firefox these days builds with the link time optimization (still needs a small patchset) as well as with the profile feedback. This makes it easy to get some data about effectivity of optimizations described.

            Update: The original results I presented in this section was wrong; my firefox crashed on startup during training.

            First lets look what -fdump-ipa-profile_estimate tells us while linking libxul.so:

            Histogram:
              39481368: time:3 (0.40) size:2 (0.00)
              22085009: time:2 (0.55) size:2 (0.00)
              22004637: time:10 (1.30) size:10 (0.00)
              22004631: time:10 (2.04) size:10 (0.00)
              21705795: time:3 (2.27) size:3 (0.00)
            ...
              114: time:9482 (99.89) size:4365 (7.45)
              113: time:2649 (99.89) size:1278 (7.46)
              112: time:5479 (99.89) size:2658 (7.50)
              111: time:2105 (99.89) size:993 (7.51)
              110: time:6411 (99.90) size:2819 (7.54)
              109: time:2463 (99.90) size:1279 (7.56)
              108: time:3762 (99.90) size:1893 (7.58)
              107: time:2062 (99.90) size:1017 (7.59)
              106: time:2868 (99.90) size:1364 (7.61)
              105: time:1161 (99.90) size:610 (7.62)
              104: time:5711 (99.90) size:2896 (7.65)
              103: time:1751 (99.90) size:825 (7.66)

            ...
              3: time:111016 (100.00) size:50936 (13.92)
              2: time:108097 (100.00) size:51390 (14.54)
              1: time:236334 (100.00) size:106379 (15.85)
              0: time:14095802 (100.00) size:6873694 (100.00)
            Overall time: 29481332351
            GCOV min count: 112 Time:99.89% Size:7.50%
            Determined min count: 104 Time:99.90% Size:7.65%
            In the first part GCC uses the knowledge of the whole program to determine where it spends 99.9% of its time. The training run used by firefox consume 29481332351 time units in GCC setimate.  The distribution of histogram suggest that statements executed over 104 times should be considered hot (in a way so hot code consume 99.9 execution time) and it is 7.66% of overall firefox binary size.

            It actually seem that the profiling coverage of firefox is quite low - only 15% of code is executed at all compared to about 48% executed during GCC's training.
            It is however quite common case that FDO pays back even with quite crappy training. Firefox is a difficult program to train due to its interactive nature. Life would be a lot easier, if all CLI was still ruling the world.

            In the following part of dump GCC complains that some of object files were not build with -fprofile-generate, that seems like a bug in the build machinery that forgets to pass profiling flags to some files (I even saw some files in firefox that are built without optimization on):
            Node getDynamicClassID/13759998 has no profile-id (profile feedback missing?)
            Node getStaticClassID/13759997 has no profile-id (profile feedback missing?)

            ...
            Node GetIsDebugBuild/45006 has no profile-id (profile feedback missing?)
            Node Release/45001 has no profile-id (profile feedback missing?)
            Node AddRef/45000 has no profile-id (profile feedback missing?)
             Next the actual conversions to speculative calls happen:
            Indirect call -> direct call from other module tryStatement/13609244 => bindLet/13609141, prob 1.00
            Indirect call -> speculative call tryStatement/13609244 => bindLet/13609141
            Indirect call -> direct call from other module variables/13609249 => bindVarOrConst/13609295, prob 0.60

            ....
            Indirect call -> speculative call GetBaseFile/11663 => AddRef/91845
            Indirect call -> direct call from other module __base_dtor /11645 => Release/91848, prob 1.00
            Indirect call -> speculative call __base_dtor /11645 => Release/91848
            29162 indirect calls trained.
            17227 (59.07%) have common target.
            0 (0.00%) targets was not found.
            1632 (5.60%) speculations seems useless.
            15595 (53.48%) speculations produced.
            Last week of checkout of firefox contains over 100000 polymorphic calls (not counting usual indirect calls), so about 29% of indirect calls are executed and we end up speculating 53% of those.  Not bad. It is interesting to observe that the checkout of firefox I analyzed in February had only about 60000 polymorphic calls. A 60% increase in 2 months, Firefox developers are busy!

            Next we can cross check with ipa-devirt data I reported last time, in -fdump-ipa-devirt and we get:
            109280 polymorphic calls, 0 devirtualized, 3184 speculatively devirtualized, 85942 cold
            13975 have multiple targets, 0 overwritable, 11247 already speculated (11247 agree, 0 disagree), 0 external, 0 not defined, 0 artificial

            OK compared to results from last time, the speculative devirtualization pass has saved some work (and code size) by skipping 85942 polymorphic calls that are cold by the profile. Out of the trained calls, it is able to determine the common destination in 11247 calls out of 15595 (72%) calls FDO speculation can handle. It also introduce 3184 speculative devitirtualizations that FDO did not handle: those are ones in the unprofiled code. Good news is that all speculations made by the static analysis actually agree with the runtime data!

            As I explained last time, the speculative calls not always survive to the final binary. GCC inliner has ability to undo the speculation when it seems useless. Process of this can be checked in -fdump-ipa-inline dump:
            Deciding on functions to be inlined into all callers and removing useless speculations:
            Removing speculative call operator!=/13963171 => _ZNK6icu_5217RuleBasedTimeZoneeqERKNS_8TimeZoneE.localalias.164/13766021
            Removing speculative call resolveConstructor/13963002 => GenericCreateConstructor/13369305
            Removing speculative call defaultValue/13961679 => XPC_WN_Shared_Convert/5769960
            Removing speculative call defaultValue/13961283 => XPC_WN_Shared_Convert/5769960
            Removing speculative call defaultValue/13961260 => XPC_WN_Shared_Convert/5769960

            ...
            Speculation is typically removed when called function is too large to be inlined or the caller was inlined and the offline copy became no longer interesting from performance point of view. This removal happens 9916 times. Mind that many of these speculative calls was in meantime duplicated as a result of function cloning or inlining. so this number can not easily compare with the number above. In fact Inliner informs you in a brief summary:
            Unit growth for small function inlining: 7232256->9220597 (27%)
            Inlined 183353 calls, eliminated 54652 functions
            It somewhat surprises me that 7% of firefox's code base that is considered hot bloats the code size by 27% after inlining; I will double check why it happens. Seems that the "hot" part of firefox managed to grow 4 times!

            I hacked inliner dump files to output statistic about individual calls weighted by their execution counts. In the charts bellow the percentages actually mean a probability that a call from unoptimized libxul execution (or more precisely libxul optimized only by early optimizations that are performed per-compilation unit only and include limited inlining, constant propagation and devirtualization) will be handled in a given way in the optimized binary.



            It means that 68.1% of all calls will disappear by (late feedback directed) inlining. 27.1% calls will not be inlined for whatever reason. The small remaining sections actually describe indirect calls that accunts only less than 5% of all calls executed during the train run. (Therefore Firefox's train run is probably not a good benchmarks to demonstrate benefits of inlining and devirtualization.)

            To make effects more visible, here is a graph omitting the direct calls:



            32% of all calls can be inlined via static analysis. 10% are handled only speculatively by algorithms described last time. Profile feedback enables inlining of 21.9% extra calls. About 50% of calls remains uninlined. This is not bad result - both for feedback directed devirtualization and the static speculative devirtualization.

            In the last graph I collected some data on why GCC decided to not inline a given call. Seems that actually majority of calls remaining are external, into other libraries.



            UPDATE: these numbers still needs revisiting.

             How many speculations then survive into final code? This is easy to check by in -fdump-tree-optimized. Every speculatively inlined function sequence starts with the following code:
              PROF_18 = this_3(D)->5;
              if (PROF_18 == __deleting_dtor )
                goto <bb 5>;
              else
                goto <bb 10>;
             "grep "PROF.* == " *optimized* | wc -l" tells that there are 5519 such functions.

            Further I hacked GCC to actually annotate the dumps with execution counts. This gives that 2038439 indirect calls executed in train run got inlined. 74189 polymorphic calls survived into optimized binary and are executed 316492 times. This means that GCC speculatively inlined 7% of all indirect calls (statically) that got rid of 86% indirect call executions in Firefox's the train run.

            Disabling the FDO devirtualization, the static devirtualization predicts correctly 3622 calls that are executed 857979 times.

            I shall note that these numbers are not 100% reliable, since they depends on GCC profile updating (profile is read early and then maintained thorough the optimizations), but they ought not be too far from reality. It may be interesting to confirm these results with valgrind and/or performance counters to get data from real execution.

            GCC LTO results

            While GCC does not really use any polymorphic calls, it is easy enough to profile so I did similar graphs as for firefox:






            And runtime improvements?

            All I was able to try this week was dromaeo benchmark comparing default build, LTO build and LTO build with FDO. The 7-8% improvement for LTO is actually very good given that dromeao is dominated by JIT generated code and thus it is a poor compiler benchmark.

            Update 1: 14% speedup for FDO and LTO together is acutally good, too. My initial data looked bad because the train run crashed and majority of Firefox was size optimized. I will get -Os data, too.

            Update 2:  I also run Dromaeo benchmarks comparing FDO+LTO build with link-time devirtualization disabled with the usual FDO+LTO build.  The results can be seen here. It seems that the devirtualization techniques described here accounts for about 2.7% speedup. This seems pretty good given how few indirect calls are really executed

            I am also looking for other good candidates for benchmarking, if you know of applications that are benchmarkable, buildable with FDO and needs devirtualization, I would like to know about it!

            I will definitely write a followup if I find some interesting benchmarks.

            Next time I plan to write a bit about experiments giving a feedback to users to improve devirtualization in their programs by annotating the source files.