Sunday, April 12, 2015

Link time and inter-procedural optimization improvements in GCC 5

GCC-5.1 release candidate 1 just branched. Lets take a look what changed in the inter-procedural optimization (IPA) and link-time optimization (LTO) frameworks.

Link time optimization (LTO) in GCC 4.5 - GCC 4.9

Last year I wrote quite detailed post on the history of LTO in GCC. Here is a quick summary of where we stand now.

Reliability and scalability

Link-time optimization was merged into GCC 4.5. At this time it was able to compile "simple" standard conforming C/C++/Fortran programs (for example SPEC2006 benchmarks), but failed on many real world applications by both correctness and scalability problems.

GCC 4.6 added support for parallel link-time optimization (WHOPR) that was designed to deal with large applications. It took couple months to get GCC 4.6 to build applications as large as Firefox (with several hacks and workarounds). Even with WHOPR working as designed the compilation remained slow for large projects with main bottlenecks being the time needed to stream in all types and declarations.

GCC 4.7 to 4.9 gradually improved and today GCC is able to build quite good portion of the GNU/Linux operating system with LTO enabled including the Linux kernel, Firefox, Chromium or Libreoffice. GCC 4.7 was barely useful to build such large applications, while GCC 4.8/4.9 got build times and memory use under control. According to Andi Kleen, the linux kernel build time is now only about 2%-15% slower (the smaller number is for usual configuration, larger for allyesconfig). This is down from 108% with GCC 4.8.

Thanks to the WHOPR model, GCC at link-time now typically require less memory and wall time than LLVM and without the parallelism consume typically "just" about 2 or 3 times as much memory as the linker. There is still a lot to improve. For example, LTO is reported to not be scalable enough for Google internal application.

Google is now working on similar goals on LLVM side. ThinLTO to be discussed on the next LLVM developer meeting and I am very curious about that presentation. Google driven the early stage of WHOPR implementation (it got almost complete rewrite since then) as well as a separate LIPO framework and I am curious whats next.

Code quality

LTO remains more expensive than per-file model especially with compile-edit cycle. Without a significant benefits there is not much motivation for developers to adopt it which leads to chicken-egg problem because LTO will not improve without being used and tested.

See, for example, Linus' no on the kernel LTO changes last year. Linus' opinion did not make me happy but was kind of justified. Andi Kleen's work on LTO for kernel took 3 years and he greatly helped to get LTO from early prototypes into shape it is today. I am happy that most of LTO changes found its way to kernel since then.

While majority of traditional inter-procedural optimizers in GCC was prepared to operate whole program, dropping a whole program into optimizer is not necessarily a win. It is easy for heuristics to get out of hand and produce huge spaghetti that does not work any better and is pain to debug.

The code quality benefits enabled by LTO require careful retuning of the whole optimization queue. My posts on Firefox and Libreoffice shows code quality improvements by LTO and also the progress between GCC 4.8 and 4.9.

To briefly summarize my impressions. GCC's LTO is quite good code size optimization: 17% code size reduction on Libreoffice (and 36% if profile feedback is enabled) is not bad at all. The performance implications are harder to understand though. The gains depend on how well given application was hand-tuned for per-file compilation model (technically given enough effort, most of link-time optimizations can be done by developers by carefully factoring the code into translation units and using inline code in headers). 3-5% improvement of SPEC benchmark may seem small to non-compiler developer, but it actually represent an important code quality change. Adopting LTO thus means change in development practices and focus. Manual optimization effort is good for small hot spots of the program, while LTO is very useful is to improve quality over the whole binary. Hopefully designing programs for LTO will give developers more time to concentrate on bigger picture rather than micro-optimizations in a longer run.

Changes in GCC 5

GCC 5 is definitely exciting release from IPA and LTO point of view. With more than 1400 patches to link-time and inter-procedural optimization framework it fixes several long standing problems, improves scalability and adds new features.

Lets take the look at the passes in the order they are executed. In green I also add some statistics about individual changes where I know them.

New Auto-FDO support

Feedback Directed Optimization (FDO) is a compilation model where the program is first compiled with profiling instrumentation and run on a sample of typical input (train run). In the second step the binary is recompiled enabling compiler to use the profile gathered during the train run (to guide optimizations such as inlining, loop vectorization/unrolling and other changes). While FDO was implemented in GCC for over a decade, it had long way to go from benchmarking toy to a practical tool. It found its use in places where performance matters (such as for Firefox, GCC itself, databases, and interpreters; Google is known to greatly depend on FDO internally by using a custom LIPO translation model and contributing many improvements to GCC FDO infrastructures over years).

FDO is great for code size and performance. I is however held back because it is difficult to use. One of main issues is that profile collection needs to be part of the build process (any source code, configuration or compiler change invalidates the profile). This makes it hard to deploy FDO in distribution builds. For example, Firefox needs a running X server to train (and will silently produce slow binary if you build it without X available). The profile collection is more difficult in cross-compilation environment and running the profiling runtime at embedded systems may be impossible. Instrumented application also runs significantly slower that may affect its behaviour and lead to misguided profile.

These problems are addressed by AutoFDO that uses low overhead profiling on instrumented program to collect the profile. Resulting profile has rather simple format tagged to source code line numbers and is thus not as precise as in the case of classical instrumentation. It is faster to collect and is more robust for source code and environment changes that makes it possible to ship with application's source code and reuse for future builds in individual distros.

This infrastructure was developed by Dehao Chen at Google and existed in Google's GCC branch for a while (used, for example, for Chromium builds). Profiling is currently limited to Intel chips (because it uses profiling feature tracing branch history), but the profile can be applied to non-x86 builds too. I am happy it finally found its way to mainline GCC this release cycle. (It was also ported to LLVM 3.5, but LLVM has still some way to go to get full FDO infrastructure at place)

The classical FDO instrumentation measure more than basic block and edge counts. It also looks for expected values passed to certain instructions (such as divisions), collect information about execution order and others. I think there is a lot of room for future improvements by possibly combining both infrastructures. Clearly also the instrumented profile can be easily lowered into auto-FDO format and shipped with a binary.

Google reports 4.7% performance improvement of SPEC2000 with auto-FDO, compared to 7.3% improvement given by FDO. Both on Intel CPU with LTO enabled.

Classic FDO

FDO implementation in GCC also continues envolving. GCC 5 has several fixes and improvements to profile quality, for example when profiling extern inline functions. With LTO the profiles are now almost right for C++ inlines. Rong Xu (Google) also contributed new gcov-tool utility that lets you play with the profiles. Several patches by Teresa Johnson (Google) and others also improve the code updating profile after transformations (especially after jump threading).

One measurable improvement is the increase of successfully speculated indirect calls in Firefox from 22138 to 32101, that is by 45%. I am bit surprised by this. Partly it can be accounted to profile improvements and partly perhaps to more early inlining discussed bellow.

New bounds checking infrastructure via Intel's MPX extension

New bounds checking infrastructure enables GCC to use Intel's MPX extension.While there is no hardware supporting this extension available yet, it will hopefully soon serve as an interesting alternative to the address sanitizer implementation.

This pass was contributed by Ilya Enkovich (Intel) and imply some rather interesting changes across the inter-procedural optimization framework (which gave me a bit of headache) adding support a function that have two different calling conventions with one entry point. This will be useful elsewhere too, in particular it will let us to finally solve a long standing LTO bug about false/missed warnings with FORTIFY_SOURCE

Early optimization

Early optimization is set of mostly intra-procedural optimizers that is still run at compile time even during LTO optimization. These optimizations include:
  1. early inlining, 
  2. pass to determine memory location sizes, 
  3. constant propagation, 
  4. scalar replacement of aggregates, 
  5. alias analysis, 
  6. global value numbering driven full redundancy elimination,
  7. dead code elimination with control dependency, 
  8. inter-procedural scalar replacement of aggregates,
  9. tail recursion optimization (turning tail recursion to loop in some cases),
  10. profile estimation, and
  11. discovery of nothrow, pure and const functions.
  12. unreachable function and variable removal.
These are mostly intra-procedural optimizations and changes in these is out of scope of my post (it is waay too long anyway). Their purpose is to perform the "obvious" transformation (in a sense that they improve both size and runtime) and hammer down the abstraction penalty which increases a chance that the later heuristics will not be lost and will do the right thing.

Noteworthy changes motivated by LTO include more agressive inlining by early inliner, stronger devirtualization, and better discovery of pure/const functions.

It is hard to quantify effect of early optimizations in an isolation. GCC IPA profile compute estimated size and runtime of the program based on feedback data early during the optimization queue. This statistics produced by GCC 5 while linking Firefox shows 24062594864 time and 7985344 size. GCC 4.9 reports 27206581886 time and 8071488 size. So there seem about 1% reduction of number of instructions in the program representation at this time but they are significantly faster (by about 15%). This seems expected - with more inlining one should see more faster instructions. Improvements in the optimizations apparently completely masked the bloat by inlining.

OpenMP offloading

New infrastructure for offloading was added. Contributed by Jakub Jelinek (Red Hat), Bernd Schmidt and Thomas Schwinge (CodeSourcery), Andrey Turetskiy, Ilya Verbin and Kirill Yukhin (Intel). This infrastructure makes it possible to stream out selected functions and load them into separate compiler that produce code for an accelerator. More info is available on GCC wiki page. Current implementation targets Xenon Phi, but the basic infrastructure is vendor independent.

Whole program visibility (privatization)

This is the first pass run at the link-time. For a first time the compiler sees whole program and also gets a resolution file (produced by the linker plugin) specifying usage of the symbols. For each symbol it is not know if it is used by LTO code alone or if it can be bound by non-LTO portion of the program either by static or dynamic linking. LTO only symbols are turned into static symbols enabling more optimization to happen. Unreachable function/variable removal is re-run and typically remove a lot of stuff that could not be removed at compile time.

Whole program visibility improved in GCC 5 in several subtle ways generally too technical to discuss here.  Comdat groups are now dismantled enabling unreachable code removal inside of them; references to global symbols are turned to local aliases where possible speeding up dynamic linking, etc. One of not so positive consequences is this bug, that shows that attempt to mix incremental linking with LTO without binutils support fails more miserably now than it did with earlier releases.

The overall size of dynamic relocations reduced by about 2% in Firefox binary, but the improved visibility pass has another nice consequences. Other applications may improve more, Firefox already links everything into one large DSO and most of relocations are IP relative.

Write only variables removal

GCC was always able to eliminate write only local variables. Elimination of unused static variables was added about a decade ago. Elimination of write only global variables however did not really fit the WHOPR structure and I never considered it important. I was proved wrong while looking into the data segment size of Firefox' javascript interpreter. A good part of it is a single write only static variable that is used only if DEBUG macro is defined. A similar story is also true about Libreoffice. I thus cooked up a support for this.

Identical code folding

Identical code folding is a new pass (contributed by Martin Liška, SUSE) looking for functions with the same code and variables with the same constructors. If some are found, one copy is removed and replaced one by an alias to another where possible. This is especially important for C++ code bases that tend to contain duplicated functions as a result of template instantiations.

In its nature this pass is similar to Gold's identical code folding feature. It however run a lot earlier - just before inlining and cloning. This make the code reuse explicit to the code duplicating passes and has chance to lead to better optimization decisions. GCC is also better than Gold in proving what transformations are safe and what now. On the other hand proving that two functions are identical in compiler is much harder than comparing a binary blobs with relocations though. Not only the instructions needs to match each other, but all the additional meta-data maintained by the compiler needs to be matched and merged. This include type based aliasing analysis information, polymorphic call contexts, profile, loop dependencies and more. For this reason the pass does not replace Gold's feature. The initial implementation is rather simple (and even with that it was hard to get working for GCC 5) and will see incremental improvements in next releases.

ICF also may be anoying while debugging programs, because the backtraces point to seemingly nonsential functions. There are DWARF extensions intended to handle this and some ideas are discussed in this PR. I however hope that developers will get used to the feature, like they did on some other platforms where it enabled by default for long time. ICF can be controled by -fipa-icf. More merging can be enabled by -fmerge-all-constants (provided that your program never compare addresses of two static variables) this is equivalent of Gold's agressive --icf=all, but only for variables. Option controlling function merging will be added for GCC 6.

ICF remove about 29000 functions out of 282453 functions surviving the early unreachable code removal. This is about 10% of all functions. Sadly removing 10% of functions does not translate to 10% code size reduction. Usually only the small functions gets removed and also Firefox links itself with identical code folding optimization in Gold. There seems to be some issues to be solved, still. While Gold's ICF reduce Firefox binary by about 9%, GCC ICF is less effective. GCC ICF however seems to shine on Chromium, where it saves about 10% of code size in addition to Gold's ICF.


Devirtualization pass is a special purpose pass intended to turn indirect calls to virtual functions to direct calls. It consists of five main components:
  1. the type inheritance graph builder,
  2. an analyser of polymorphic call context that is able to track real type of instance,
  3. a new propagation engine enabling easy forward propagation of the polymorphic call contexts across the program,
  4. type inheritance graph walker that given the context construct list of possible polymorphic call contexts.
  5. ipa-devirt pass that speculatively devirtualize in the case there is only one reasonable candidate on the list of possible polymorphic call contexts.
The details can be found in my series on devirtualization. Being a new pass, it received a lot of attention. Here are main changes:

1) Stronger polymorphic context analysis: Every polymorphic call has a type (that is the pointer-to type of the class pointer used to call the virtual method) and a token (index in virtual table). The purpose of polymorphic context analysis is to annotate the call with additional information about the type of instance the polymorphic call refers to.

GCC 4.9 basically looked up if particular instance is within a static or automatic variable and if so, it performed some further analysis whether the type is possibly in construction and assigned types accordingly. The common case of allocation via new was cowardly ignored.

GCC 5 can now determine the type of an instance from looking up the constructor call. For example, it does track type in the following innocent testcase that was out of reach of earlier analysers:
int foo(void)
  struct A *a = new (A);
     // This call will be devirtualized
     // older GCC will devirtualize only if ctor of A is inlined

  return a->virtual_method();
     // This call will be devirtuazed speculatively

     // Full deivrtualization is not possible
     // because external_call may use placement new on a

The effect of this change can be easily checked by looking into -fdump-ipa-devirt logs of compilation of larger C++ programs. The dump contains a summary.

Summary building Firefox with GCC 4.9:

115554 polymorphic calls, 36028 speculatively devirtualized, 2980 cold, 75887 have multiple targets
... and the same summary with GCC 5:
112733 polymorphic calls, 42131 speculatively devirtualized, 2951 cold, 66637 have multiple targets

14% fewer calls remains without a speculation after the ipa-devirt pass.
The number of speculative devirtualizations also improved by 16%. By removal of now useless polymorphic call targets, the number of functions drops from 221880 to 198467, 11%.

Out of the 112k calls, 24k are calls from instance passed as function argument and those are further analyzed during constant propagation and inlining via the new propagation engine.

2) Type inheritance graph with non-polymorphic types: identifying types by one definition rule in the link time optimizer is rather difficult job, because it is difficult to say what is the name of the type (especially in the context of templates where template parameters are also part of the name). Jason suggested an alternative solution, where C++ mangler is used to store mangled names of all non-anonymous types and the link time optimizer simply unifies them by string equality. This works great, even though it increases the amount of streamed data by about 2-3% and is now used by default by GCC (controlled by -fodr-type-merging flag). Currently the infrastructure is used only for diagnostics, but I have a prototype implementing better type based alias analysis for it.

3) Better diagnostics: I also put a lot of work into new warning about One Definition Rule violations (-Wodr) that already helped to chase some evil bugs out of Firefox and Libreoffice. Note that the quality of warnings significantly improved since my original blog post. This warning is more robust and informative than -detect-odr-violations implemented by Gold and safe enough to be enabled by default (you can use -Wno-odr to silence it). Similar option was also discussed for Clang, but as far as I know not implemented yet.

Another new warning helps users to annotate programs with C++11 final keyword (controlled by -Wsuggest-final-types and -Wsuggest-final-methods).

4) Earlier unreachable code removal: devirtualization makes most sense when the called method gets inlined. For this compiler needs to hold bodies of all virtual functions that may be potentially reached by the virtual call in hope that some of the calls will be devirtualized to them. Because these virtual functions may refer to other code that can be otherwise optimized out, this may easily turn into an expensive feature.

GCC 4.8 and earlier simply kept every virtual method alive until after inlining to make devirtualization possible. GCC 4.9 introduced the type inheritance graph and construction of "may" edge in the callgraph - that is the construction of full list of possible targets of every polymorphic call and remove those virtual functions that can not be called. Surprisingly enough this turned out to be one of major performance improvements of LTO scalability for Firefox.

New GCC push this even further by both stronger analysis (and thus getting shorter lists of possible targets of a call) and by dropping all methods that are known to be never devirtualized to by later propagation passes just after the inter-procedural devirtualization pass. This is based on an observation that the inliner and the inter-procedural constant propagation pass can devirtualize only those calls that can be tracked to instances passed by function parameters.

This further reduces the amount of code to track by later inter-procedural passes by about 10% on firefox.

Constant propagation pass

Constant propagation pass, as name suggest, propagate constant values passed in parameters across the program. When it proves that some parameter is always a known constant, it eliminates it.

Implementation in GCC-4.9 is a lot smarter than that. It is also capable of function cloning in case the known constant come only from subset of all callers of the function, implements propagation of constants passed by aggregates (that is important for C++ classes and Fortran array descriptors) and does simple inter-procedural devirtualization by forward propagating known type of the particular instance.

Martin Jambor (SUSE) reorganized the pass to be more modular and allow propagation of various information across the program. Now the propagation is split into 3 propagators:
  1. propagation of constants and constants passed in aggregates
  2. propagation of polymorphic call contexts for inter-procedural devirtualization (replacing the old instance type propagation that was never that useful)
  3. propagation of known alignments of pointers to improve vectorization.
For next release we hope to add propagation of value ranges and other tricks.
Whole compiling Firefox, the constant propagation pass modify 18412 function bodies (4917 require cloning) while GCC 5.0 50761 (6267 require cloning). This is 175% more. Number of devirtualized calls increased from 428 to 1204, 182% improvement. GCC 5.0 also does extra 941 devirtualizations as an effect of constant propagation. 

Number of function parameters turned into a constant and removed however dropped from 18412 to 16731; something I do not quite understand and need to look into.


Inliner is by far the most important inter-procedural optimization pass of every modern compiler. This is because most of optimization still happens intra-procedurally and inlining is essential to make it trigger. Modern programs use a lot of function call based wannabe zero cost abstractions that constantly bring new challenges to the inliner heuristic implementations.

The main problem of inliner is that while its decisions have huge impact, it is not really well informed about what it is doing.It is hard to predict optimizations that arise from particular inlining decisions. Most of the time inliner things that it does a lot of code duplication to just eliminate the function call overhead which does not match reality. Most inliners, including GCC one, are able to predict simple optimizations (such as unreachable constant removal caused by constant propagation across function arguments) but they can't predict more complex patterns where, for example, all calls of methods on a particular instance are must be inlined to make the actual instance to be completely optimized out.

Since 2003 the main inliner in GCC is organized as a simple greedy algorithm trying to do as many useful inlines within the bounds given by several --param values. Those limit size of inlined functions, overall unit growth and growth of large functions. (This differ from LLVM's inliner that works exclusively in the top-down order interleaved by other local optimizations similarly to GCC's early inliner. Greedy inliner for LLVM is being disussed here, but is more similar to the inliner in Open64).

In GCC 5 timeframe I put a lot of work into tuning inliner for LTO builds (without ruining performance of non-LTO) that has proved to be quite hard job to do. I will report on this separately - while there are still several issues (and probably will always be), the size of LTO optimized programs reduced nicely.

Nice work was also done by Trevor Saunders, Mozilla, and Martin Liška, SUSE, on turning the inline metric to real numbers from original fixed point arithmetics. This makes it a lot easier to work on it and chase away strange roundoff/capping artefacts. This is one of good achievements of C++ conversion - while I considered it for a while, it seemed ugly to implement. GCC can not use native float and double types, because these are not consistent across different hosts. Using hand written reals in C is ugly and moreover the Fibbonachi heap was written to work only on integer. Implementing C++ class for simple reals and turning the heap to a template did the trick. In next release cycle I plan to follow on this and turn more of calculations into the reals.

The code size of Firefox binary built with -O3 -flto reduced from 5MB (GCC 4.9) to 4.6MB (GCC 5), by 10%. With FDO the reduction is from 4MB to 3.5MB (14%). Firefox built with -O3 -flto by LLVM has 5.3MB of code segment and with FDO 5.4MB (LLVM's inliner does not really know about profile feedback yet). I am still chasing some bechmark regressions compared to 4.9 and thus these numbers may or may not change in final GCC 5 release.


IPA reference is a simple pass that collect for a function list of static variables that are known to not be accessed/modified by it. This information is then passed down to the optimizers that can better eliminate loads and stores.

The problem of ipa-reference is that it uses quite simple minded sparse bitmaps to represent the information. On large programs with many variables and many function this easily ends up being quadratic. A simple change to prune out early variables whose address is taken and can not be analysed in this simple way saved a lot of memory and compile time on Chromium and Linux kernel.

I really hope this pass will be replaced by smarter pass in a foreseeable future proper mod-ref pass.

Comdat localization pass

Inline functions in C++ headers ends up being compiled in every translation unit that uses them. To avoid duplication in the final binary, they are stored in special sections (COMDAT groups) that are later merged by the linker. An artiface of this process appears when function calls other functions outside of the comdat section and is the only user of it. Such functions will not be merged. The new pass simply adds such functions into respective comdat groups.

The reduce non-LTO firefox binary by 1% and libreoffice by 0.5%. It also has some potential for future improvements by tracking also variables and constants used purely by the function. Those are very common and I noticed the problem too late for GCC 5.

Other improvements

OK. I have completed describing noteworthy changes in individual optimization passes. There are however number of changes in the LTO infrastructure in general. Few of them are worth to mention:

Streaming command line options

Moving compilation to link-time opens a huge can of worms with respected to compile time options. These options control optimization, but also specify details of the semantics of the program (for example -ftrapv says that program should trap on overflow).

So far GCC behave in a way that some command line options are applied at compile time, while some are applied at link time and some partly at both. This brings surprises and wrong code issues. For example, some makefiles do not pass optimizations options to the linker and thus the resulting binary is not optimized. Worse yet, some options are not optional, but mandatory for code correctness. Compiling program with -fno-strict-aliasing and linking without will make LTO optimizer still assume that the program is strict aliasing safe and will lead to wrong code.

GCC 5 now address the problem by annotating every function by corresponding optimize and target attribute built from compile time options. Thus all optimization options are streamed from compile time and honored at link time. This is an important change in behaviour and will likely lead to some surprises (for example cross-module inlining of function built with -fno-strict-aliasing to function built without no longer happens), but it is also huge step forward in maturity and safety of the implementation.

A good news is that even without LTO the quality of implementation of target and optimize attributes improved (because of a lot extra testing they receive now) and you can almost expect them to work as expected when used ;)

Similar solution is apparently being discussed for LLVM, too.

Firefox is an example of program that uses different command line options across different units. Most of Firefox builds with -Os and javascript engine with -O3. Because link-time compilation is invoked with -O3, the Firefox binary is "mostly -O3 optimized".

Firefox built with GCC 4.9 has 7.7MB of code, while GCC 5 gets 5.9MB, 30% less. LLVM 3.7 compiled binary has 6.6MB. Sadly this makes also some of Firefox benchmarks to regress because they trip a lot of size optimized code. I guess it is something to judge among Firefox developers and possibly revisit placement of optimization flags. My understanding that some of performance in those benchmarks is intentionally sacrificed to reduce the binary size.

Building the whole Firefox with -O3 leads to code segments of 8.2MB (GCC 4.9), 7.7MB (GCC 5), and, 8.3MB (clang 3.7). GCC sees more similarity between both builds than clang, probably because most of inlining with clang happens at compile time while GCC inlines at compile time only for size and -O3 inlining happens at link-time.

Unfortunately until GCC IR is extended to handle most of relevant options, this also may lead to degradation of perfomrance, because certain options prevent inlining. Disabling this feature reduce Firefox code segment size by about 5%, so the correctness does not come without a cost. Maybe Firefox developers could drop using -fno-strict-aliasing at random portions of the project.

-fno-semantic-interposition flag

Symbol interposition is a feature of ELF format, where the symbol defined by one library may be interposed to symbol defined in different. For example, memory debuggers often interpose malloc. While this feature is certainly useful, it is also expensive, because it represent an boundary for compiler optimization.

While comparing some benchmarks to clang, I noticed that clang actually ignore ELF interposition rules. While it is bug, I decided to add -fno-semantic-interposition flag to GCC to get similar behaviour. If interposition is not desirable, ELF's official answer is to use hidden visibility and if the symbol needs to be exported define an alias. This is not always practical thing to do by hand.

Reducing memory use and streaming overhead

LTO streamer is responsible for storing the intermediate language to LTO object file and it is traditionally one of major bottlenecks of link time optimization. The streamer itself is implemented quite effectively, but it is fed a lot of data. Therefore it takes a long time to read everything and it consumes a lot of memory to store it.

One of major improvements of GCC 4.9 was reorganization of type merging (implemented by Richard Biener, SUSE) and it was the change making GCC 4.9 scalable enough for production Firefox builds.

GCC 5 seen many incremental improvements mostly targeted to cleaning up the data=structures and chasing unnecessary data away. A large reduction of /tmp use and improvement in partitionability of the program was achieved by pushing the devirtualization earlier in the optimization queue.

In order to make WHOPR model scalable on large umber of CPUs (or distributed compile nodes) we need to work on this area further. Important changes in this direction are underway with early debug info branch (actively developed by Aldy Hernandez, RedHat) which will clearly separate information used by code generation from debug info and allow to optimize Gimple representation of types.

GCC 4.9 streams in 22096600 trees, while GCC 5.0 20330302; 8% improvement.
GCC 4.9 needs 1614636k of memory to store the trees, while GCC 5.0 1481199k, again an 8% improvement.

A lot bigger change is seen in streaming out the modules for local optimization. Those are temporary .o files created in the /tmp directory during the final link. GCC 5 streams about 50% of types and declarations that GCC 4.9 did.

GCC now (barely) fits in 4GB memory bound for Firefox and most of kernel configurations. Chromium still shows important problems requiring almost 10GB to build.

... And more

The IPA, symbol table and callgraph APIs got a facelift in GCC 5 by conversion to C++ classes (this hard work was mostly done by Martin Liška, SUSE, David Malcolm, RedHat, Trevor Saunders, Mozilla, and a bit by myself). This makes code more compact and also gave us a chance to cleanup terminology and fix mistakes and oddities crept in over years.

As we became more aware of problems of large application, more work also went into making binary sizes smaller at -O2. As one interesting example, data alignment strategies was revisited. The x86-64 backend was especially grateful to align everything to the size of largest vector type (that went up with AVX) causing a lot of bloat. This was fixed by H.J. Lu (Intel), Jakub Jelínek (RedHat) and Uroš Bizjak. Sadly some older GCC assume that all static variables have this alignment and it was considered unsafe to always drop it. There is new -malign-data flag to control it. Data alignment was also completely dropped for virtual tales because these are not accessed in a sequential manner.

I hope new GCC will do a great job for your projects and please report bugs if it does not!

Thursday, September 18, 2014

Devirtualization in C++, part 7 (Enforcing One Definition Rule)

This time I am going to write about a feature that I implemented as a side effect of the devirtualization machinery: the verification of One Definition Rule on types.

One Definition Rule (ODR) is one of more interesting cases where C and C++ differs in very subtle way. Wikipedia describes it as follows:
In short, the ODR states that:
  1. In any translation unit, a template, type, function, or object can have no more than one definition. Some of these can have any number of declarations. A definition provides an instance.
  2. In the entire program, an object or non-inline function cannot have more than one definition; if an object or function is used, it must have exactly one definition. You can declare an object or function that is never used, in which case you don't have to provide a definition. In no event can there be more than one definition.
  3. Some things, like types, templates, and extern inline functions, can be defined in more than one translation unit. For a given entity, each definition must be the same. Non-extern objects and functions in different translation units are different entities, even if their names and types are the same.
Some violations of the ODR must be diagnosed by the compiler. Other violations, particularly those that span translation units, are not required to be diagnosed.[1]
This means that C++ type names are global and should not be re-used in different ways in different source files. (In C doing so is perfectly valid and common, types across units are considered equivalent if their structure is.)

One of motivations for ODR is to make name mangling possible. Type names are used, for example, as program wide identifiers to distinguish functions and methods of same name but taking different types of arguments (function overloading):
struct A {int a;};
int getval (struct A a) {return a.a;}
struct B {char a;};
int getval (struct B a) { return a.a;}
This compilation unit will result in exporting two function symbols:  _Z6getval1A and _Z6getval1B. A and B comes from name of the argument's type.

Now, obviously, if another unit defines completely different structure A and completely different getval taking it as argument, the function will also be called _Z6getval1A and things will fail miserably. General violations of ODR can not be diagnosed by compiler working on single translation unit in isolation. If one is lucky, ODR violation turns into linker error about duplicated symbol name. With presence of inline functions however this will not happen and one of the two functions will be eliminated from program. This may result in invocation of a function on wrong data crashing program in random ways.

-detect-odr-violations in gold

One ODR checking tool I am aware of is gold's -detect-odr-violations:
gold uses a heuristic to find potential ODR violations: if the same symbol is seen defined in two different input files, and the two symbols have different sizes, then gold looks at the debugging information in the input objects. If the debugging information suggests that the symbols were defined in different source files, gold reports a potential ODR violation. This approach has both false negatives and false positives. However, it is reasonably reliable at detecting problems when linking unoptimized code. It is much easier to find these problems at link time than to debug cases where the wrong symbol.

Ian Lance Taylor: New ELF linker, Proceedings of the GCC Developer's Summit 2008
This is very simplistic analysis, but linker can hardly do a lot better; it lacks knowledge of types. Moreover there are some valid cases where the difference can happen and thus the warning can get false positives.

-Wodr: detecting ODR violations in GCC

As discussed in earlier post about construction of type inheritance graph, establishing type equivalency based on their names is important step in building cross-compilation-unit type inheritance graph. It is tricky to do for templates and thus in GCC 4.8 I used simple trick of identifying types based on mangled names of their associated virtual tables. This works only on types with virtual methods and thus with help of our C++ maintainer, Jason Merill, I extended the mehanizm to all class and union types: Before streaming out LTO data I simply invoke C++ frontend langhook to compute mangled name of a type and save it into the LTO object file. At linktime I can use the names to establish equivalence of all named types. This way middle-end does not need to know details of C++ namespaces and templates with the exception of being able to identify types in anonymous namespace and disable merging for those.

This streaming is not free, it seems to add about 2% of extra memory use and comile time for the mangled names. One day I plan to use the mechanism to strengthen the type based alias analysis for C++ and also to reduce redundancies in debug info produced (the second is actually done by LLVM to help fight memory usage explosion).

Once ODR based type equivalency is established, I can actually compare the types that become equivalent as part of LTO linking process. Doing so is a simple recursive walk over the GCC's type representation (that is sufficiently close to parse tree of declarations) and compare it. This step is not completely trivial because one needs to get past some differences that are considered valid. In particular complete and incomplete types of the same name are equivalent and there are few cases that we want to tolerate - such as use of attributes and perhaps differences caused by command line options, such as -fsigned-char/-funsigned-char even if those are ABI breaking and could lead to wrong code.

Making the warning useful

My first attempt resulted in plenty of warning looking as follows:
/aux/hubicka/firefox4/modules/libjar/zipstruct.h:38:16: warning: type ‘struct ZipCentral_’ violates one definition rule [-Wodr]
 typedef struct ZipCentral_
../../dist/include/zipstruct.h:38:0: note: a different type is defined in another translation unit
 typedef struct ZipCentral_
My best guess of average Joe's attempt to understand the message is "those types are the same! Stupid GCC!". It is a common case that types declared same way and in the same header are actually different because the difference is carried in from earlier includes. Without being able to report why the types are different, the warning would probably be ignored and declared bogus by most developers. For this reason I ended up adding about 200 lines of code just doing kind of structural diff and trying to output useful information about the difference.

In this particular cases it winds down to:
/aux/hubicka/firefox4/modules/libjar/zipstruct.h:38:16: warning: type ‘struct ZipCentral_’ violates one definition rule [-Wodr]
 typedef struct ZipCentral_
../../dist/include/zipstruct.h:38:0: note: a different type is defined in another translation unit
 typedef struct ZipCentral_
/aux/hubicka/firefox4/modules/libjar/zipstruct.h:47:25: note: the first difference of corresponding definitions is field ‘MOZ_Z_crc32’
   unsigned char crc32 [4];
../../dist/include/zipstruct.h:47:0: note: a field with different name is defined in another translation unit
   unsigned char crc32 [4];
A fair try, but still not perfect :). Luckily, the problem is now not hard to guess. There are two zlib implementations linked into the Firefox and one of them redefine crc32 to MOZ_Z_crc32 (apparently to avoid ODR conflicts identified earlier by Gold). I thus decided that such warnings are verbose enough and did not add further information into this particular case. I expect to improve the diagnostic as we get more experience about individual errors reported.

Limitations of ODR detection

Even my ODR verifier is not a complete ODR nazi - it verifies only violations for types that survive into actual program and only for types where name matters (i.e. is part of type mangling). For example, if one unit define
typedef int mytype;
and the other:
typedef char mytype;
the violation will not be reported. This is because typedefs are ignored for type mangling purposes and thus I have no way to figure out that those two types are intended to be equivalent.

Other problem is, of course, that detection is limited to one binary. More conflicts can be introduced when linking with non-LTO library either statically, dynamically or at runtime via dlopen.

Detecting virtual table ODR violations

ODR relates not only to types, but also to functions and methods. While it is very hard to compare two methods coming from different unit because the body may differ in representation (because each is compiled with different flags or optimized in different context) but it may have the same semanticsI can however easily do is to compare virtual table. Mismatches in those are especially deadly because they may result in dispatch to a wrong virtual function.

Here is an example of ODR violation in Firefox:
../../dist/include/mozilla/a11y/DocAccessible.h:40:0: warning: type ïstruct DocAccessibleï violates one definition rule [enabled by default]
 class DocAccessible : public HyperTextAccessibleWrap,
/aux/hubicka/firefox/accessible/src/generic/DocAccessible.h:40:0: note: a type with the same name but different virtual table is defined in another translation unit
 class DocAccessible : public HyperTextAccessibleWrap,
/aux/hubicka/firefox/accessible/src/generic/DocAccessible.cpp:1232:0: note: the first different method is ïHandleAccEventï
 DocAccessible::HandleAccEvent(AccEvent* aEvent)
/aux/hubicka/firefox/accessible/src/atk/AccessibleWrap.cpp:956:0: note: a conflicting method is ïHandleAccEventï
 AccessibleWrap::HandleAccEvent(AccEvent* aEvent)

How common are ODR violations?

Fairly common :) Out of 3 programs (GCC, Firefox and Libreoffice) I tried to build with ODR checking, each of them had type ODR violations one of them (Firefox) had violation in virtual table. Today I leant that LLVM team has fixed ODR violations based on new warning, too. What are the most common causes?

Random type name clashes

Each time you create type with short name in global namespace (say hash_t), you risk clash with other units doing the same for completely unrelated reason. This is a good motivation to use namespaces and anonymous namespaces (though the anonymous namespaces are causing problems with GDB debugging, see PR16874). A firefox example looks as follows:
/aux/hubicka/firefox4/rdf/base/nsInMemoryDataSource.cpp:149:0: warning: type ‘struct Entry’ violates one definition rule [-Wodr]
 struct Entry {
/aux/hubicka/firefox4/gfx/skia/trunk/src/core/SkFlattenable.cpp:66:0: note: a different type is defined in another translation unit
 struct Entry {
/aux/hubicka/firefox4/rdf/base/nsInMemoryDataSource.cpp:150:0: note: the first difference of corresponding definitions is field ‘mHdr’
     PLDHashEntryHdr mHdr;
/aux/hubicka/firefox4/gfx/skia/trunk/src/core/SkFlattenable.cpp:67:0: note: a field with different name is defined in another translation unit
     const char*             fName;

Some examples come in this thread:


Copying code from one unit to another and adjusting the datastructures for new purpose is a common way to introduce name clashes. For example in GCC we have plenty of structures expr and occr that all originate from gcse.c that implemented the first dataflow based global optimizer.

Again, a Firefox example:
/aux/hubicka/firefox4/netwerk/sctp/datachannel/DataChannel.h:57:0: warning: type ‘struct BufferedMsg’ violates one definition rule [-Wodr]
 class BufferedMsg
../../../../dist/include/mozilla/net/DataChannel.h:57:0: note: a different type is defined in another translation unit
/aux/hubicka/firefox4/netwerk/sctp/datachannel/DataChannel.h:64:26: note: the first difference of corresponding definitions is field ‘mSpa’
   struct sctp_sendv_spa *mSpa;
../../../../dist/include/mozilla/net/DataChannel.h:64:0: note: a field of same name but different type is defined in another translation unit
/aux/hubicka/firefox4/netwerk/sctp/datachannel/../src/usrsctp.h:198:8: note: type ‘struct sctp_sendv_spa’ should match type ‘struct sctp_sendv_spa’ but is defined in different namespace  
 struct sctp_sendv_spa {
../../../../dist/include/mozilla/net/DataChannel.h:60:0: note: the incompatible type is defined here
Here the type declaration is unchanged, but one of types it is built from has changed. This case shows also a namespace clash.

"Upgrading" codebase from C to C++

Because C does not have ODR, updating compiling code that was originally written as a C code and later upgraded to C++ needs audit for ODR violations. This is what the warning found for GCC:

Linking multiple versions of given library into the program

This is particularly common case for large programs. For example:
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav53/../FFmpegDecoderModule.h:18:0: note: a type with different memory representation is defined in another translation unit
 class FFmpegDecoderModule : public PlatformDecoderModule
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav53/include/libavcodec/avcodec.h:985:0: warning: type ‘struct AVFrame’ violates one definition rule [-Wodr]
 typedef struct AVFrame {
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav54/include/libavcodec/avcodec.h:989:0: note: a different type is defined in another translation unit
 typedef struct AVFrame {
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav53/include/libavcodec/avcodec.h:997:0: note: the first difference of corresponding definitions is field ‘data’
     uint8_t *data[AV_NUM_DATA_POINTERS];
is one of many warnings issued during Firefox build.

Conditional compilation

In some cases, that looked particularly dangerous for me, there are fields that are compiled conditionally (such #ifdef DEBUG) and the condition is not set consistently across the whole program.This is also quite common case in Firefox. It usually shows itself as a warning at same position but complaining about mismatch of field names.

Non-LTO and whole program ODR checker

It seems that ODR violations are very common bugs in large C++ programs. Majority of the ODR problems found are "safe" in a way that they will not result in wrong code: the types in question are never used in overloaded functions or do not have methods attached. There are was however few real bugs and many of those random clashes have a potential to become real bugs later.

Maintainers of the programs usually welcome the reported issues and fixed them quite promptly. Especially welcome response came from Michael Meeks:
Jan - wow - that is a nice error =) are there any other ODR issues ?
they habitually bite us hard so ... great to get libmerged debugged even
more. CC'ing the list too.
It seems to me it would make sense to implement ODR checking that is not dependent on LTO and do not suffer from limitations of gold in the following way:
  1. Extend DWARF info by mangled names of types.
  2. Implement ODR based debug info merging in linker and warn on mismatches.

    Merging of ODR types would likely reduce debug info size considerably (judging from GCC and LLVM memory footprint issues).

    Linker has also chance to examine shared libraries linked into the program as long as they do have the annotated debug info available. This is not possible in my LTO implementation.
  3. Implement dynamic library that will optionally verify types at dynamic linking time. This way one can report issues caused by dlopen.
I am quite sure more interesting cases would be found. Sadly I do not think I will have time to implement it :(

Tuesday, September 9, 2014

Linktime optimization in GCC, part 3 - LibreOffice

After Firefox, I decided to look into LibreOffice performance with link time optimizations (LTO) and profile feedback (FDO). This post was in works since April, so my setup is not exactly bleeding edge - GCC 4.9.1 prerelease, LLVM 3.5 and LibreOffice checkout from April. Some real world issues (like a wedding) interrupted my work, however the data presented should be still relevant and interesting.

Update: I fixed the permissions for google drive, so you can see the charts.


In addition to all dependencies described in Building LibreOffice on Linux: Tips and Tricks the following is needed to get LTO build (copied from Firefox report):

  1. Binutils with plugin enabled. This can be checked by
    $ ld --help | grep plugin
      -plugin PLUGIN              Load named plugin
      -plugin-opt ARG             Send arg to last-loaded plugin
    I personally use gold from GNU Binutils but any reasonably recent binutils either with GNU ld or gold should do the job. One needs to however enable the plugin while configuring since it is off by default.
  2. GCC 4.7+ that is configured with plugin support (i.e. built with the plugin enabled LD). For sane memory use, GCC 4.9 is recommended, it requires about 5 times less RAM.

    To test if GCC runs with plugin support, try to link simple hello world application and do:
    $ gcc -O2 t.c -flto --verbose 2>&1 | grep collect2.*plugin
    If the output is non-empty you are having plugin set up correctly.
  3. Clang needs to be built with gold llvm plugin enabled, too. Follow the official instructions.
  4. ar/nm/ranlib wrappers on the path that dispatch to the real ar/nm/ranlib with plugin argument.

    This is somewhat anoying issue. GCC ships with gcc-ar/gcc-nm/gcc-ld, but it is not very easy to convince build script to use them. Adding to path a simple script calling gcc-* equivalent will lead to infinite recursion. I use the following hack:
    $ cat ar
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/ar $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/ $*

    $ cat nm
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/nm --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/ $*

    $ cat ranlib
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/ranlib $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/ $*
    In the LLVM case I use the same wrappers, just the plugin path is /aux/hubicka/llvm-install/lib/

    Alternatively the plugin can be installed to the default plugin sear path (/usr/lib/bfd-plugins). This solution has some limitations however. One needs not-yet-released binutils to make this work with GCC and one can not easily install multiple plugins. I hope this situation will change in not so distant future.

    Forgetting these steps will give you undefined symbols as soon as you start to link static library build with -flto and GCC 4.9+ or -flto -fno-fat-lto-objects with earlier compilers. 

GCC 4.9.0 -Os -flto bug

GCC 4.9.0 will segfault building LibreOffice with -Os -flto. I fixed it by this patch that got only into GCC 4.9.1.

    LibreOffice configuration

    I used trunk checkout from Apr 28 2014. I had to use --without-doxygen --enable-python=no for builds to pass. I built with --disable-symbols to get builds faster, especially LLVM LTO needs a lot of extra memory and compile time for debug info.

    In some builds I enabled LTO by --enable-lto and merged libts by --enable-mergelibs.

    I had to disable some unit tests because they were failing in some configurations (the writer tests failed in LTO builds only, while the rest failed in non-LTO, too): CppunitTest_cppcanvas_emfplus, CppunitTest_dbaccess_firebird_test, CppunitTest_dbaccess_embeddeddb_performancetest, CppunitTest_services, CppunitTest_writerperfect_impress, CppunitTest_writerperfect_writer.

    Building with profile feedback

    It seems that never has tried it before. After some experimentation I got LibreOffice building and working with profile feedback. This is not completely trivial:
    1. First configure with profile collection enabled:
      CFLAGS="-O2 -fprofile-generate -fprofile-correction" CXXFLAGS="-O2 -fprofile-generate -fprofile-correction" LDFLAGS="-O2 -fprofile-generate -fprofile-correction" <srcdir> --enable-mergelibs --disable-symbols --without-doxygen --enable-python=no --enable-lto
    2. Unfortunately the tree fails to build, this is because dyamic linker fails with "cannot load any more object with static TLS". While profiling uses TLS, it uses global dynamic model one and I failed to figure out what introduces static TLS into the libraries. In my limited understanding, this problem ought to happen only when objects built without -fPIC are linked into dynamic library and the dynamic library is loaded by dlopen. I build on x86-64 and in this case i would get warnings about wrong relocations used and I am not getting these.

      Anyway, there is ugly workaround to preload the largest library. This will get dynamic linker to allocate enough of space for the tables before application starts:

      CPPUNITTRACE="LD_PRELOAD=<builddir>/instdir/program/" make
      I still had to disable couple more unit tests to let the build finish. Something to debug later ;)

      It is a good idea to have DISPLAY environment variable set to working X server. In this case more testing of interactive code is done. I use vncserver to get things trained.
    3. To train the application, I started LibreOffice, went to writer and quit. I assume that the unit testing has pretty good coverage and thus the application is reasonably trained once build finishes. It is still good to be sure that the functions executed at startup are triggered. It also may be good idea to do make check (with the LD_PRELOAD hack) for some additional testing if it was not failing with even more unit tests.

      In fact there is interesting problem with this - the code layout produced by LTO build is likely more optimized for unit testing than the actual application. It may make sense to start LibreOffice enough times so the normal startup prevails or avoid the unit testing and do some more extensive training like Firefox does.
    4. Archive profile files (this step is needed, since make clean will remove them):
      tar czvf `find . -name *.gcda` /tmp/profile.tgz
    5. Build with profile feedback:
      make clean
      sed s/profile-generate/profile-use/g -i
      tar xzvf /tmp/profile.tgz

    Compile time

    LibreOffice can build in two modes - default and mergedlibs. In mergedlibs most of shared libraries are built into one library (libmerged) to reduce dynamic linking overhead. Of course mergedlibs is very interesting for LTO and thus I compiled in most cases LibreOffice in both setting and did most of testing only in mergedlib mode.  Since build takes about an hour, i did not complete the full matrix and chose only combinations I consider interesting.

    There seems to be no major surprises: LTO got significantly faster since GCC 4.8 times (by 32%). This is by disabling fat LTO files and thus not compiling everything twice. This time I did not enable LTO parallelizm (-flto=16), because it is bit hard to glue into the Makefile and as shown later, it is not causing that major difference. Clearly there is thus low hanging fruit to get LTO builds faster by getting the parallelism right. Good news is that even with this simple setup LTO is not significantly slower than normal compilation.

    GCC 4.7 can not build LibreOffice with LTO.

    LLVM at -O2 is about 32% faster than GCC 4.9. Sadly GCC has noticeably slowed down since GCC 4.7 (by 11%). This was not so visible for Firefox compilation and I believe it is caused, by large part, by more expensive initialization at startup for IRA (new register allocator introduced to 4.8) and by bazillions of the new AVX/SSE builtins introduced. GCC 4.8 was a busy release. I did some work on this problem for GCC 5. I will publish update on the build times once I get LibreOffice building reliably with that compiler.

    This graph shows system and user times. Clearly GCC consume more of system time - by requiring separate assembler and by making more pages dirty. Again cost of LTO is not really extreme. I did not investigate why GCC 4.9 mergedbuild got better than others. Seem bit odd.

    A good improvement to LTO builds would probably be to avoid the completely useless assembler stage at compile time - at the moment we produce LTO bytecode and dump it in ascii format for assembler to produce the slim LTO object. GCC has all the code to avoid this and go directly into object files that is used during the linking stage. All that needs to be done is to enable it for slim LTO compilation.

    Memory use during compilation

    GCC 4.9
    GCC 4.9 mergedlibs

    GCC 4.9 LTO

    GCC 4.9 LTO mergedlibs
    Here some extra memory (about 2GB) is needed, but it can be tweaked by reducing parallelism as in the case of Firefox builds. In my setup up to 180 parallel LTO optimization processes was run, something that is definitely not needed or optimal.

    Largest LibreOffice library is just a fraction of size of largest Firefox library, so memory usage do not change that significantly as I reported last time. Obviously the linking occupy time after 2900s and CPU utilization drops that can be tuned at Makefile side. While the LTO builds are not faster than normal builds, I think this milestone can be reached with little bit of extra tuning.


    LLVM LTO, mergedlibs

    As seen in the second graph, GCC 4.9 needs about 1000 seconds for final link stage at LTO. (Linking is easily visible by seeing drop in CPU usage by lost parallelism.) This is about 50% faster than LLVM's.

    LLVM use a lot less memory while compiling and also its compilation stage is faster. For GCC 5 I reduced startup cost of LTO compilation, so hopefully these differences will get smaller. 

    GCC 4.8 LTO, mergedlibs

    While GCC 4.9 got faster at linktime, it got slower at compile time. Again something I hope to track in GCC 5.

    Binary size

    LTO with merged library save about 30MB from the installation directory. Not bad! FDO saves 14MB. Oddly enough these two do are not additive and LTO+FDO build is about the same size as LTO build. Perhaps there are better ways how to train LibreOffice than by the default unit testing.

    GCC clearly needs some tuning for code size - LLVM builds smaller binaries with default setting and bigger with -Os. Also observe that code has grown in size for GCC 4.9. This may be partly my fault - for 4.9 I did not re-tune inliner heuristics (because I left SUSE where I have the setup). It is a good idea to re-tune each release - GCC measure function sizes after early optimizations and as early optimizations get more aggressive each release, the functions get smaller (in GCC's perception) and thus better inline candidates. With each release it is usually possible to trim down the limits somewhat without losing performance on the benchmarks. Doing so is however tedious process, because large body of benchmarking needs to be done (SPEC is not enough) and usually takes me weeks to do. The difference is however big enough so there may be some other bug that went unnoticed into the release :(.

    This graph breaks down libmergedlo (the largest library) size. LTO leads to 17% code size reduction and 15% reduction of the resulting binary. This is definitely nice. The reduction is 7MB overall. This code size difference is a lot better than one I reported for Firefox (however smaller to one reported here). I suspect this is because release builds of firefox are done with identical code folding and section garbage collection in the linker than hits a lot of unreachable code removal. Perhaps something to try with LibreOffice, too.

    FDO works even better with 23% overall binary reduction and 36% of code size.

    Finally LTO+FDO has slightly bigger code size, while overall binary size shrinks by reductions in data segment, relocations and EH.

    LLVM behaves somewhat differently - by default it produces smaller code and LTO makes the code bigger. Also LLVM -Os needs tuning - the binary size is bigger than GCC FDO optimized and definitely slower.

    Memory footprint after startup

    For this I simply opened LibreOffice writer and measured memory in TOP. The memory usage corresponds roughly to the libmergedlo size with exception of -Os builds that consume more memory than FDO builds. This is expected because FDO enables optimized code layout requiring smaller potion of library to be read. Also -Os is not good for code locality by skipping a lot of inlinng and alignment.

    GCC 4.9 mergedlibs saves 5MB, LTO additional 2MB and FDO+LTO additional 6MB. Overall 13MB (18%). Not bad. FDO alone gets close to FDO+LTO build. I also plan to work on static code layout improvements for GCC hopefully getting LTO closer to FDO+LTO scores.

    LLVM with mergedlibs is 3MB smaller than GCC 4.9 with mergedlib and 3MB bigger than GCC 4.9 with -Os and mergedlib (again the code size is probably to blame). LLVM with LTO and mergedlibs beats all GCC builds except one with FDO.


    I am not expert on LibreOffice benchmarking and I plan to update this section with more reliable data. I did very simple test of converting Open document specification into PDF:
    time program/soffice --headless --convert-to pdf --outdir /tmp ~/OpenDocument-v1.2-os.odt

    Judging from the stdout output, this test consists of about 50% of startup time and 50% of conversion time. I measured both hot runs (after first run) and cold runs (with disk cache flushed)

    (Note that the graph does not start with 0)

    Overall I found LibreOffice less sensitive to compiler optimization than Firefox, probably because I found no way to measure any interesting internal loops.

    The mergedlibs build shows small improvement from GCC 4.7 to 4.8. LTO seems to make no difference and FDO leads to just small improvement. LTO+FDO together seems to do a job improving the performance by 5%. The graphs also shows that -Os is not a good way for overall performance.

    LLVM seems to be slightly slower than GCC 4.7 while LTO build is same as GCC 4.8+ non-LTO.

    I also measured cold startup times. This graph has enough noise to be only able to tell that -Os is not making the startup faster.

    Beyond GCC 4.9

    As usual, when looking on a given problem for first time, there is a lot of low hanging fruit. Here are few things implemented for GCC 5.

    Removing write only global/static variables

    I never consider this important - having write-only variables in your program is stupid and no developer would do that? Well, it is not the case of large C++ projects. Both Firefox and LibreOffice seems to have quite few write only variables (in case of Firefox they were accessible only with debugging mode) that occupy BSS segment.  Implementing this simple transformation saved about 12% of BSS space.

    Reducing alignment of virtual tables

    x86-64 ABI requires to increase alignment of all arrays greater than 16 bytes to 16 byte boundary (as opposed to the natural alignment of the data). This decision was made in early 2000s in the anticipation that autovectorization will become more important in future. As an extension, GCC now also aligns every array greater than 32bytes to 32byte boundary and handles structures same way as arrays.

    This turns out to be extremely wasteful for C++ virtual tables (that are arrays of pointers but their accesses will never be vectorized, rather they are kind of random access structures) and for RTTI.

    LLVM aligns to 16 bytes in both cases that seems to cause over 10% difference in data segment size. I updated GCC 5 to align virtual tables to pointer size only.

    Localizing symbols into comdat groups

    C++ inline functions are often duplicated into every translation unit that uses them. These are later merged by linker to avoid duplicates in binaries.  What is however not handled by linker is the case where inline function access other function that is used purely by it.  Such functions may end up duplicated. For this reason I added a simple pass that makes such functions shared, too. This saves additional 0.5%

    Reducing default data alignment

    Data alignment was bumped up to 16 byte in early 2000s to make it possible to safely use SSE accesses to them. This was done for all datastructures above 32 bytes. With arrival of larger vector types (AVX), this was bumped up and in turn caused noticeable bloat. As was benchmarked by Intel guys, this however do not cause any performance improvement - even a decade and half later accessing data by vector register is quite rare event not worth to be optimizing for by default. Moreover vectorizer now knows how to increase data alignment on demand. For this reason this alignment was trimmed down.

    Code and data placement optimizations

    By default GCC puts little effort to code and data placement. Situation is better with FDO, where code is split into hot and cold sections and moreover with FDO+LTO the code gets ordered by approximate execution order. I believe that LTO builds ought to start considerably faster and consume less memory than non-LTO builds (but does not at the moment) and that we should look into an code placement optimizations.

    Identical code folding

    Use of templates in C++ tends to produce a lot of identical functions. Martin Liška works on identical code folding pass in GCC that should be more effetive than linker implementation. Hope it will land into mainline soon.


    Both LTO and FDO shows a good improvements on LibreOffice: 36% reduction of code size, 23% reduction in the overall binary, 18% in memory use after startup and 5% in rather lame runtime benchmark. Build times and memory requirements seems to be in reasonable bounds, but some unit testing issues needs to be analyzed before this can be declared production ready. Also FDO needs ugly hack because of static TLS issues that definitely needs to be fixed. It may be interesting to invest some effort into finding decent train run for LibreOffice even though the current unit testing seems to do quite well.

    I hope situation will get better with GCC 5 and I also hope LibreOffice build system will be updated to make FDO builds easy. LibreOffice has definitely turned out to be interesting target to optimize for and I will try to keep eye one it during the GCC 5 development cycle.

    GCC LTO alone is a good tool to reduce code size, but runtime performance and memory use does not change much. At least I saw no off noise changes in very limited benchmarking I did. With LLVM the situation is oposite: LTO leads to small code size increase but also to measurable performance improvements (so the performance gets similar to GCC without LTO).

    More direct comparison of GCC and LLVM produced code would be interesting - LLVM seems to do better on code size at -O2 (by about 3%) and subsequently needs less memory for application to load. Even if the resulting code runs a bit slower, there may be more stupid and easy to fix issues that makes GCC code bigger. Quick look at the produced binaries identified few issues I fixed for GCC 5. We also may want to revisit our tuning tradeoffs because most of them was set over a decade ago and tested on much smaller benchmarks. While GCC 5 produced binary is now smaller than GCC 4.8 one (resolving the regression) it is still bigger than LLVM's.

    LLVM also does traditionally better at compile time (by about 35%) and memory usage. Sadly there is very noticeable regression in GCC 4.8 (increasing compile time by 11%). I am in progress of analyzing and fixing the compile time issues for GCC 5. On the other hand, GCC gets faster LTO linktimes when parallelism is enabled (or with debug symbols that I did not analyze in detail in this post) making it more useful for day to day development.

    I will try to update on GCC 5 performance once I get all trees updated and working again.

    Wednesday, August 20, 2014

    Devirtualization in C++, part 6 - asking user for help

    In previous posts of the series I described most of the new GCC devirtualization machinery. To cut the story short, GCC implements two kinds of devirtualization: the full and speculative one.

    Full devirtualization replaces given polymorphic by a direct call (that can get inlined later).  Speculative devirtualization is a weaker form turning polymorphic call to a conditional testing whether the target is the predicted one and going by direct call path if it happens to be so and doing the indirect call otherwise.

    Whole performing speculative devirtualization the compiler can play somehwat unsafe and make assumptions that are not valid 100% of cases. The main assumption made is that there are no new derived types introduced by code not visible to compiler. With link time optimization this is usually true, but not necessarily so - new derived types can be introduced by libraries or plugins. While compiler can generally attempt to track what instances may not be introduced by external code, this analysis is difficult and it is a question whether it can be implemented with reasonable precision within statically optimizing compiler.

    The following example shows speculative devirtualization:
    struct A {virtual void foo() {}};
    void test (struct A *a)
     Here GCC 4.9 produces:
            movq    (%rdi), %rax
            movq    (%rax), %rax
            cmpq    $a::foo(), %rax
            jne     .L5
            jmp     *%rax
    Instead of:
            movq    (%rdi), %rax
            movq    (%rax), %rax
            jmp     *%rax
    If the target of polymorphic call happens to be a::foo(), the speculatively devirtualized code is significantly faster saving the call overhead (and enabling further optimizations).

    As I measured in part 5, the speculative devirtualization can capture significant percentage of calls full devirtualization can not and is correct about its prediction in wast majority of cases.

    C++ provides for ways that may be used to inform compiler that it has a complete information on derived types:
    1. declaring a type in an anonymous namespace,
    2. declaring a type in local scope (within function body),
    3. using the final specifier on a type, or,
    4. using the final specifier on a method.
    Among these hints the first one is the most powerful. Declaring a type in the anonymous namespace not only ensures that no new derived types will appear in other translation units, but also makes it impossible for code in the other compilation units to access the data stored in a given type. This enables other optimizations, such as changing the memory layout. This is however not implemented yet in GCC.

    In this post I will focus on C++11 final specifier and a way how compiler can aid user to annotate the code to improve code quality.

    New warnings

    The example given can be "improved" either by annotating struct A as final:
    struct A final {virtual void foo() {}};
    void test (struct A *a)
    Or by annotating the method foo as final:
    struct A {virtual void foo() final {}};
    void test (struct A *a)
    Both allows full devirtualization. Compiling both examples with GCC 4.9 will lead into an empty function test (that is a result of the full devirtualization and inlining the empty function A::foo):
    For this reason I added two warnings: -Wsuggest-final-types and -Wsuggest-final-methods.

    As name suggest, the warnings tells users about types or methods where final specifier would improve code quality. Both warnings are selective and do not buzz about all types with no derived types nor about all virtual methods with no override. They also do not warn in the cases where the compiler can easily determine the optimization itself (thus they are necessarily compiler version and optimization setting sensitive and thus not good candidates for -Werror).
    $ gcc -O2 -Wsuggest-final-types -Wsuggest-final-methods t.C
    t.C:1:8: warning: Declaring type ‘struct A’ final would enable devirtualization [-Wsuggest-final-types]
     struct A {virtual void foo() {}};
    t.C:1:24: warning: Declaring method ‘virtual void A::foo()’ final would enable devirtualization [-Wsuggest-final-methods]
     struct A {virtual void foo() {}};
    The implementation on the top of the existing devirtualization machinery is quite straighforward. A (Speculative) devirtualization is implemented by filling in a polymorphic call context for a given call.  This context consists of outer type that is the type of instance the virtual table belongs to, offset within the outer type where the virtual table pointer is located and two flags.  First (maybe_in_construction) specify that type may be in construction and thus virtual functions of base types needs to be considered while the second (maybe_derived_type) specify that the instance actually may be of a type derived from the outer type. In most cases maybe_derived_type is set, because the type is determined from references in the source code.

    Once filled in, the polymorphic call context is then turned into a list of methods that may be called by a given polymorphic call. This is done by recursive walk of the type inheritance tree as described in part 4 of the series. The resulting list is either complete or incomplete. Incomplette lists contains all possible targets within the current compilation unit, but the call may lead to virtual functions defined outside. Complete lists contains all possible targets. Incomplete lists are useful for speculation and profile prediction. Complete lists can drive full devirtualization as well as optimization propagating information across the callgraph.

    When the new warnings are requested, the speculative devirtualization pass looks for contextes that have maybe_derived_type set, that contain precisely one method but they are incoplete. If such context is found then there are two options:
    1. If outer type has no derivations, -Wsuggest-final-types will output a warning that the type may be declared final
    2. If the method has no override, -Wsuggest-final-methods will suggest declaring it final.

    How useful are the warnings?

    The main purpose of the final keyword is not to assist code optimization. It is meant to be an syntactic sugar making APIs more robust. Unlike in Java codebases, the use of final keyword in C++ is not very widespread because it was introduced to C++ only recently. There are large code bases that was developed in pre-C++11 era and programmers are not used to the new feature.

    I do not expect users to blindly annotate every type or method suggested. Instead I would advise to first try -Wsuggest-final-types and inspect individual types found.  Once suitable ones are annotated -Wsuggest-final-methods can be used to annotate suitable methods: annotating a type final renders final specifiers on its methods pointless.

    The warnings can be used with both normal compilation and with link time optimization.  The second makes them a lot more precise: during the link time optimization the compiler has more information about derivations of a given type. Moreover more devirtualization and inlining happens in link time optimization builds.

    Prioritizing the importantce

    As an experiment I produced warnings from compilation of Firefox's libxul
    library.  -Wsuggest-final-types identified 1772 types and -Wsuggest-final-methods 7215 methods (because of a non-linearity in GCC's line information lookup it takes about 5 minutes to just output them out of compiler and it would take ages to inspect all of them by hand!).  It is not a surprise that Firefox developers was unimpressed and asked me whether I can prioritize it somehow.

    For this reason I implemented simple accounting into devirtualization pass and made it to output the warnings in the priority order, where priority is given by number of calls that would be fully devirtualized by the annotation either statically or dynamically when profile feedback is available.  The Firefox warnings now looks as follows:
    /aux/hubicka/firefox/editor/libeditor/html/nsHTMLEditor.h:66:0: warning: Declaring type ‘struct nsHTMLEditor’ final would enable devirtualization of 575 calls [-Wsuggest-final-types]
     class nsHTMLEditor : public nsPlaintextEditor,
    /aux/hubicka/firefox/docshell/base/nsDocShell.h:124:0: warning: Declaring type ‘struct nsDocShell’ final would enable devirtualization of 216 calls [-Wsuggest-final-types]
     class nsDocShell : public nsDocLoader,
    /aux/hubicka/firefox/intl/icu/source/common/unicode/uniset.h:276:0: warning: Declaring type ‘struct UnicodeSet’ final would enable devirtualization of 146 calls [-Wsuggest-final-types]
     class U_COMMON_API UnicodeSet : public UnicodeFilter {
    ../../dist/include/mozilla/Selection.h:48:0: warning: Declaring type ‘struct Selection’ final would enable devirtualization of 131 calls [-Wsuggest-final-types]
    /aux/hubicka/firefox/intl/icu/source/common/unicode/unistr.h:245:0: warning: Declaring type ‘struct UnicodeString’ final would enable devirtualization of 115 calls [-Wsuggest-final-types]
     class U_COMMON_API UnicodeString : public Replaceable
    /aux/hubicka/firefox/intl/icu/source/i18n/unicode/decimfmt.h:663:0: warning: Declaring type ‘struct DecimalFormat’ final would enable devirtualization of 92 calls [-Wsuggest-final-types]
     class U_I18N_API DecimalFormat: public NumberFormat {
    /aux/hubicka/firefox/netwerk/protocol/http/nsHttpChannel.h:40:7: warning: Declaring type ‘struct nsHttpChannel’ final would enable devirtualization of 65 calls [-Wsuggest-final-types]
     class nsHttpChannel : public HttpBaseChannel
     Followed by method warnings:
    /aux/hubicka/firefox/dom/bindings/CallbackObject.cpp:32:0: warning: Declaring method ‘AddRef’ final would enable devirtualization of 148 calls [-Wsuggest-final-methods]
    /aux/hubicka/firefox/content/media/MediaDecoder.cpp:1440:0: warning: Declaring method ‘GetReentrantMonitor’ final would enable devirtualization of 108 calls [-Wsuggest-final-methods]
     ReentrantMonitor& MediaDecoder::GetReentrantMonitor() {
    /aux/hubicka/firefox/dom/base/nsGlobalWindow.cpp:9294:0: warning: Declaring method ‘GetExistingListenerManager’ final would enable devirtualization of 103 calls [-Wsuggest-final-methods]
     nsGlobalWindow::GetExistingListenerManager() const
    /aux/hubicka/firefox/dom/base/nsGlobalWindow.cpp:9281:0: warning: Declaring method ‘GetOrCreateListenerManager’ final would enable devirtualization of 101 calls [-Wsuggest-final-methods]
    /aux/hubicka/firefox/dom/bindings/CallbackObject.cpp:33:0: warning: Declaring method ‘Release’ final would enable devirtualization of 76 calls [-Wsuggest-final-methods]
    /aux/hubicka/firefox/intl/icu/source/common/unistr.cpp:398:0: warning: Declaring virtual destructor of ‘struct UnicodeString’ final would enable devirtualization of 68 calls [-Wsuggest-final-methods]
    /aux/hubicka/firefox/intl/icu/source/common/uvector.cpp:86:0: warning: Declaring virtual destructor of ‘struct UVector’ final would enable devirtualization of 63 calls [-Wsuggest-final-methods]
     UVector::~UVector() {
    /aux/hubicka/firefox/editor/libeditor/html/nsHTMLEditor.cpp:3189:0: warning: Declaring method ‘DeleteNode’ final would enable devirtualization of 49 calls [-Wsuggest-final-methods]
     nsHTMLEditor::DeleteNode(nsIDOMNode* aNode)
    Overall 9270 polymorphic calls (out of 68162 in libxul binary) can be devirtualized by declaring types final, while declaring top 100 calls will handle 3876 virtual calls.

    13952 calls (23% of all polymorphic calls in libxul binary) can be devirtualized by declaring methods final. The distribution is however significantly more flat - declaring first 500 methods would devirtualize only 4976 calls.  As the quoted warning suggest, however many cases arise from macros and template declarations, so there is some hope this can be handled more centrally.

    Firefox experiment

    Wiill developers follow the warnings? Fortunately I did not need to wait too long to figure out :)

    Based on warnings I posted to GCC IRC, Trevor Saunders prepared a patch adding couple 145 finals into Firefox sources and filled in PR1047696. I found interesting to read the reactions of other developers who wonder why the annotation is a good idea. Trevor found that roughly 3600 indirect calls are saved by his patch - not far from my estimate above. Reportedly the patch is on its way to upstream. 

    Using profile feedback

    Can we do better? Well, with profile feedback we can. I implemented this in this patch. Firefox has profiledbuild mode where the browser train itself on renderring various pages and rebuilds with precise execution counts information on every basic blocks.  With this GCC can prioritize warnings by execution counts rather by static counts. With this, only 10 types are identified as how candidates:
    /aux/hubicka/trunk-install/bin/ld: warning: hidden symbol 'pixman_add_triangles' in /aux/hubicka/build-firefox5-inst7-marxin-profile-nolt/toolkit/library/build/../../../gfx/cairo/libpixman/src/pixman-trap.o is referenced by DSO /usr/lib/x86_64-linux-gnu/
    /aux/hubicka/firefox-martin/gecko-dev/docshell/base/nsDocShell.h:125:0: warning: Declaring type ‘struct nsDocShell’ final would enable devirtualization of 576 calls executed 8840 times [-Wsuggest-final-types]
     class nsDocShell : public nsDocLoader,
    /aux/hubicka/firefox-martin/gecko-dev/content/base/src/nsScriptLoader.h:38:0: warning: Declaring type ‘struct nsScriptLoader’ final would enable devirtualization of 64 calls executed 608 times [-Wsuggest-final-types]
     class nsScriptLoader : public nsIStreamLoaderObserver
    /aux/hubicka/firefox-martin/gecko-dev/xpcom/base/nsCycleCollector.cpp:2019:7: warning: Declaring type ‘struct GCGraphBuilder’ final would enable devirtualization of 20 calls executed 262 times [-Wsuggest-final-types]
     class GCGraphBuilder : public nsCycleCollectionTraversalCallback,
    /aux/hubicka/firefox-martin/gecko-dev/layout/base/nsCaret.h:23:0: warning: Declaring type ‘struct nsCaret’ final would enable devirtualization of 130 calls executed 228 times [-Wsuggest-final-types]
     class nsCaret : public nsISelectionListener
    /aux/hubicka/firefox-martin/gecko-dev/netwerk/protocol/http/nsHttpConnectionInfo.h:32:7: warning: Declaring type ‘struct nsHttpConnectionInfo’ final would enable devirtualization of 2 calls executed 100 times [-Wsuggest-final-types]
     class nsHttpConnectionInfo
    /aux/hubicka/firefox-martin/gecko-dev/layout/base/../../content/base/src/nsXMLHttpRequest.h:183:0: warning: Declaring type ‘struct nsXMLHttpRequest’ final would enable devirtualization of 124 calls executed 78 times [-Wsuggest-final-types]
     class nsXMLHttpRequest : public nsXHREventTarget,
    /aux/hubicka/firefox-martin/gecko-dev/parser/html/nsHtml5StreamListener.h:31:0: warning: Declaring type ‘struct nsHtml5StreamListener’ final would enable devirtualization of 8 calls executed 70 times [-Wsuggest-final-types]
     class nsHtml5StreamListener : public nsIStreamListener,
    /aux/hubicka/firefox-martin/gecko-dev/xpfe/appshell/src/nsWebShellWindow.h:26:0: warning: Declaring type ‘struct nsWebShellWindow’ final would enable devirtualization of 62 calls executed 12 times [-Wsuggest-final-types]
     class nsWebShellWindow : public nsXULWindow,
    /aux/hubicka/firefox-martin/gecko-dev/dom/base/nsScreen.h:20:0: warning: Declaring type ‘struct nsScreen’ final would enable devirtualization of 16 calls executed 2 times [-Wsuggest-final-types]
     class nsScreen : public mozilla::DOMEventTargetHelper
    and 31 methods with one few executed over 100 times:
    /aux/hubicka/firefox-martin/gecko-dev/docshell/base/nsDocShell.h:157:0: warning: Declaring method ‘_ZThn424_N10nsDocShell17GetIsBrowserOrAppEPb’ final would enable devirtualization of 4 calls executed 7878 times [-Wsuggest-final-methods]
    /aux/hubicka/firefox-martin/gecko-dev/image/src/imgRequestProxy.cpp:93:0: warning: Declaring method ‘AddRef’ final would enable devirtualization of 70 calls executed 4924 times [-Wsuggest-final-methods]
    /aux/hubicka/firefox-martin/gecko-dev/dom/base/nsGlobalWindow.h:386:0: warning: Declaring method ‘_ZThn32_N14nsGlobalWindow10GetRealTopEPP12nsIDOMWindow’ final would enable devirtualization of 6 calls executed 1590 times [-Wsuggest-final-methods]
    /aux/hubicka/firefox-martin/gecko-dev/layout/base/nsPresContext.cpp:332:0: warning: Declaring method ‘Release’ final would enable devirtualization of 302 calls executed 1112 times [-Wsuggest-final-methods]
    /aux/hubicka/firefox-martin/gecko-dev/dom/base/nsGlobalWindow.h:386:0: warning: Declaring method ‘_ZThn32_N14nsGlobalWindow13GetRealParentEPP12nsIDOMWindow’ final would enable devirtualization of 8 calls executed 826 times [-Wsuggest-final-methods]
    /aux/hubicka/firefox-martin/gecko-dev/dom/bindings/CallbackObject.cpp:33:0: warning: Declaring method ‘Release’ final would enable devirtualization of 3262 calls executed 670 times [-Wsuggest-final-methods]
    /aux/hubicka/firefox-martin/gecko-dev/content/base/src/nsScriptLoader.cpp:179:0: warning: Declaring method ‘AddRef’ final would enable devirtualization of 14 calls executed 608 times [-Wsuggest-final-methods]
     NS_IMPL_ISUPPORTS(nsScriptLoader, nsIStreamLoaderObserver)
    The execution counts are actually surprisingly small. Partly because developers still tends to try to keep polymorphic calls out of the hot paths and partly because the train run is actually not very computationally intensive. I would be very interested if the benefits can be measured in some practical benchmarks.

    Next time I plan to write on determing dynamic types of heap allocated memory.