Honza Hubička's Blog

Thursday, May 2, 2019

GCC 9: Link-time and inter-procedural optimization improvements

GCC 9 was branched last Friday (see a partial list of changes) and if all goes well, it will be released tomorrow. It is thus time for me to write an update on recent developments in my favourite areas: the inter-procedural analysis (IPA) and link-time optimization (IPO).

This continues the series I wrote on GCC 8, GCC 6, GCC 5, and GCC 4.9 (part 1 and 2). (Sorry, no GCC 7 blog post at all. I was lazy.)

Big picture

GCC was originally optimizing at a function basis (with a simple inliner being the only inter-procedural optimization done by the compiler). Inter-procedural framework dates back to 2003 starting as a simple hack to solve compile time explosion caused by template heavy C++ code which started to appear at that time. Since this framework is fully transparent to the users, it was enabled by default for many years.

Link-time optimization branch was created in 2005 and merged into mainline in 2009. It took many more years to turn LTO from a "benchmark toy" to something that can build significant portion of programs we run today.

At a time of my first report, in 2014 GCC 4.9 was able to link-time optimize real world applications (I am testing Firefox quite regularly since 2010 and keey my eye on other programs including Libreoffice, Chromium and Linux kernel). Resulting binaries were better than non-LTO builds but still the widespread adoption of LTO was blocked by many "small" issues including debug info quality, linking times, memory use and various bugs showing up in more complicated projects. It took a while, but basically all those issues was addressed.

The main LTO related change of GCC 8 was the completion of early debug infrastructure. This fixed the last real showstopper and enabled number of cleanups in the implementation by properly separating the intermediate language used by GCC (Gimple) from the debug information (which is now turned into DWARF early and, in large part, copied through into final executable).

In GCC 9 we was finally able to fine-tune the infrastructure by dropping a lot of data previously used for debug-info generation from the LTO streams. We plan to make LTO more mainstream now.

Enabling LTO by default for OpenSUSE Tumbleweed

LTO in GCC was always intended to become default mode of compilation for release builds (rather than being limited to high performance computing and benchmarks). We believe that things are ready now (and basically since GCC 8 already).

At SUSE we work on turning LTO by default for OpenSUSE Tumbleweed. Martin Liška did a lot of work to get working staging build with LTO and plan to do the switch after GCC 9 lands Tumbleweed which should not be too long after the release. Initial tests looks promising - only about 150 packages needs LTO disabled so far.

The resulting Linux distro works and passes testing. So far we just gathered data on the size of individual binaries in the system.

The overall size of installed binaries decreases by 5% and debug info size by 17%. Note that these numbers include also packages not built with LTO (for example go that uses its own compiler) and packages that use LTO by default already (like LibreOffice). Thus the real savings for enabling LTO are higher than this.

Histogram of binary size changes comparing GCC 9 without LTO to GCC 9 with LTO.

To summarize:

65% binaries (counted as portion of overall size, not number of files) decrease in size by more than 1%. Some of them quite substantially.
12% of binaries does not change at all. This may be due to fact that they are built without LTO, built by other compiler than GCC or there is a configure problem. I have removed the five biggest offenders before producing the histogram: go, postgress, libqt-declarative, boost-extra and skopeo.
8% of binaries increase in size by more than 1%. Some of them because of aggressive optimization flags, but there are cases which seems like build machinery issues. For example all binaries in python-qt packge increase 27 times!

To avoid confusion, I should stress that prior every major release of GCC Tumbleweed is rebuilt and open-QA tested with the new compiler. Same testing is also done by RedHat and others. This has very important effect on the quality of the compiler - building 10k packages is more of testing than bootstrapping compiler itself and running the testsuite. This time we (well, mainly Martin Liška) did it, for the first time, also with LTO.

Similar experiments are also done by Gentoo. Mandriva also switched from GCC non-LTO to LLVM LTO for their builds, but there seems to be no quality data available. It would be very interesting to get some comparison done.

What is new in GCC 9?

Changes motivated by Firefox benchmarking

Around Christmas I've spent quite a lot of time testing Firefox with GCC and comparing it to Clang builds using the official Firefox benchmark servers. My findings are discussed in this post.

Overall GCC 8 seemed to compare well to GCC 6, GCC 7, Clang 6,7 and 8 builds. Most importantly leading to significantly smaller and measurably faster binaries than Clang 7 when profile feedback and LTO is used. This became default for some distributions including RedHat. Funnily enough, SUSE still does not build with profile feedback but LTO was enabled. What needs to be resolved is to get Firefox train run working in the containers used by SUSE's build service.

Many interesting issues was uncovered during this project including:

GCC inliner settings are too restrictive for large C++ code-bases with LTO. Increasing the limits leads to signficant runtime improvements
GCC's -O2 does not auto-inline that seems to be important limiting factor for modern C++ code.
For example Firefox's has function isHTMLWhitespace which is called very many times during parsing. It is not inline and GCC built binaries are thus slower unless -O3 or profile feedback is used. See PR88702.
Firefox is using Skia for rendering some of more fancy primitives. Skia contains SSE and AVX optimized rendering routines which are written using Clang only vector extensions. It seems it is not hard to get them ported to GNU vector extensions (supported by both compilers) and some work in this direction was already done (for example, Jakub Jelínek added support for builtins for vector conversions). So far i did not however have chance to work on this more.
Some bugs in GCC was fixed. Most importantly a nasty wrong code issue with devirtualizing a calls in thunks.

All correctness issues made it into GCC 9 and was also backported to GCC 8 and GCC 7. I also squeezed in some late changes to inliner to improve code quality. However -O2 and bigger inliner retuning will be only done for GCC 10.

Results of this retuning can be seen here (takes while to load). Probably most notable improvement is 13% speedup of tp5o responsiveness benchmark. This benchmark tests the response time while rendering 50 most common pages from 2011. GCC built Firefox binary is also significantly smaller as I show below.

LTO streaming cleanups and scalability improvements

Separating debug info from the intermediate language permits a lot of cleanups. We went through all streamed data and verified it is still relevant. The changes are too numerous to list all, but let me describe the most important changes.

Type simplification

Probably most surprising improvement was accomplished by simplifying types prior streaming. GCC merges identical types at link-time, but it turned out that there was many duplicated caused by merging, for example:

struct a {int a;};
struct b {struct a *a;};
foo (struct b) { }

with

struct a;
struct b {struct a *a;};
extern foo (struct b);
bar () {foo (NULL);}

Here in the first compile unit the struct b contains pointer to struct a which is complete. In the second unit the representation of struct b has pointer to incomplete tyep struct a. This prevents the two types to be merged.

While this seems like a minor problem, it is not. Large projects contains complicated data structure and the bigger the structure is, more pointer it contains, and thus the higher is the chance that the representations will end up being different. For example, in GCC, the core structure gimple, representing our intermediate language. It ended up duplicated 320 times. As a consequence all declarations, constants and other objects having type gimple ended up being duplicated, too. This chain reaction turned out to be surprisingly expensive.

To solve this problem, GCC free lang data pass was extended to produce incomplete variant of every pointed-to type and replace pointers to complete structures by pointers to incomplete structures.

Improved scalability of partitioning

Simplifying types enabled further improvements. GCC link-time optimization supports parallel (and distributed) link-time optimization (with -flto=n where n is number of threads or with -flto=jobserv). This is implemented by:

reading all global information (types, declarations, symbol table and optimizations summaries) into the link-time optimizer (known as WPA, or whole program analysis, stage) and performing inter-procedural optimization (unreachable code removal, inlining, identical code folding, devirtualization, constant propagation, cloning and more).

Once inter-procedural optimization is done program is partitioned into a fixed number of partitions and streamed into temporary object files.
Optimizing every partition independently applying the global decisions done at the WPA stage (this is known as ltrans, or local transformation, stage)

Clearly the serial stage is a main concern when it comes to scalability to higher number of cores. However until GCC 9 we also had important problems with increasing number of partitions. This number needs to be fixed across build hosts (so it can not depend on actual parallelism available on the given machine) and is controlled by --param lto-partitions. I have set this parameter to 32 in 2010. My testing machine had 16 threads and I though we should set the bar bit higher. While in 2010 you hardly had CPU with more threads than that, this is definitely not true in 2019.

We managed to get two digit improvements on the streaming performance each major release for several years, but it was still impossible to increase default number of partitions by default without significantly penalizing build times on hosts with few cores.
Fortunately it has turned out that the type simplification and few other cleanups almost completely solved this for GCC 9.

I have omitted sizes for gcc 8 with larger number of partitions to avoid re-scaling graph and keeping gcc 9 data visible.

As can be seen on the chart, increasing number of partitions from 32 to 128 in GCC 8 almost doubled the amount of streamed data. With GCC 9 it grows only by 18% and even on my 16 thread testing machine I now get speedups for increasing the default number of partitions past 16 because of improved memory locality during the ltrans compilation.

I have decided to set new default to 128 but I do not think it would be problem to go higher. In particular it should be manageable to keep partitions corresponding to the original object files and cache the files between LTO builds to improve compile/edit cycle.

Faster WPA stage

As an effect the mentioned (and more) changes the WPA (whole program analysis) stage got significantly faster and more memory efficient. Comparing GCC 8 and 9 linking Firefox libxul.so with profile feedback I get:

Overall WPA time reduced from 128 to 103 seconds (24% improvement) .
Time needed to stream in types and declaration reduced from 69 to 59 seconds (16% improvement) and uses 15% less memory. This is mostly because of the described streaming improvements enabled by early debug work in GCC 8.
Inliner heuristics time reduced from 24.4 seconds to 20.4 secons (17% improvement). This is mostly because of Martin Liška work on faster was to attach optimization summaries to symbol table.
Time needed to stream out partitions reduced from 11 seconds to 7 (36% improvement). Despite the fact that now 128 partitions rather than 32 are produced. The main win here is due caused by type simplification.

WPA stage still remains a bottleneck. Fortunately there is still relatively low hanging fruit here. In particular we plan to work on speeding up the identical code folding pass (that take about 15% of WPAtime), inliner heuristics, parallelizing the stream in stage and improving stream out stage by using threads rather than fork. Hopefully GCC 10 will see significant improvements here again.

Benchmarks

LTO link-times and memory use

This is a brief sumary of Firefox build times with link-time optimization and profile guided optimization (PGO) on my somewhat aged testing machine (an 8 core Buldozer from 2014). The first thing to worry about when switching to LTO is the linking time and link-time memory use. There is noticeable improvements from GCC 8 to GCC 9:

While compile times look very good compared to Clang (and in general I think the compile times of both compilers are now largely comparable), GCC still consumes more memory. Partly this is due to more ambitious implementation of LTO which does more of whole program analysis compared to Clang's thin LTO.

Here is a graph showing memory and CPU use during the link. It clearly separates the WPA serial stage from parallel ltrans. Note that Clang thinLTO chooses to use 8 threads rahter than 16. Increasing parallelism does not help build times because overall user time grows as well.

GCC 8 CPU and memory LTO optimizing Firefox libxul

GCC 9 shows memory use improvements by about 30%.

I am not at all happy about the peak at the end of serial stage which bumps overall memory use from 5GB to 7.5GB. This is caused by the streaming parallelism and can be controlled by --param lto-streaming-parallelism. This pass did not used to be problem since memory use was dominated by local transformation. For GCC 10 I will work on eliminating this issue and perhaps backport it to GCC 9.2. In a bigger issue it is not a disaster, because Firefox build consumes up to 10GB elsewhere. For this reason I decided to not rush with a solution for GCC 9.1.

Clang 8 with ThinLTO does fewer whole program decisions saving the serial stage of build.

Overall I am very curious how the compile time and memory use of the two different LTO implementation in GCC and LLVM will continue evolving. It would make sense to implement thin-LTO layer atop of our WHOPR LTO compilation model. Until now I however think the development effort is better spend in optimizing WPA stage still.

Overall build times and code quality

I have done some performance comparison of various builds of Firefox at Christmas, see my previous blog post on it.

In summary GCC built binaries performs better than Clang 8 but one needs to work around problems with Skia having vector optimizer graphics rendering routines only in Clang builds. Difference between GCC 8 and GCC 9 is not huge, but still noticeable. Most importantly it turned out that GCC 8 inliner was tuned bit too low and in some benchmark the LTO binaries ended up being slower than non-LTO. GCC 9 increases limits and gets more consistent. This however also lead to some code size increases:

GCC with LTO produces significantly smaller binaries than Clang. This is consistent across bigger applications I tested, not only limited to Firefox (note that in my testing I used default optimization flags used by Firefox, which is a combination of -O2 and -O3). I think there are two main reasons:

With LTO I believe the main reason is the fact that GCC does inlining decisions at whole program basis, while Clang's thinLTO decides locally. This makes it possible for GCC to do better code size/performance tradeoffs.
With profile feedback another factor is the fact that GCC optimizes cold regions for size. It seems that LLVM folks had worked on similar feature recently.

As not so good news, non-PGO code segment increase by 7% between GCC 8 and GCC 9. Increasing code size, of course, also makes compiler bit slower. This is a result of the inliner re-tuning and was overall painful decision to make, but I think necessary. It makes to sense for LTO binaries to be considerably slower than non-LTO and across compilers I tested, GCC LTO implementation is most code size aware. These parameters was never seriously tuned except for SPEC benchmarks which are too small to show the issues.

Testing was done on my 8 core 16 thread AMD Buldozer machine. Both GCC and Clang was built with LTO and profile feedback. As in the past years, the compile times are not very different. GCC tends to win without debug info and lose with debug info. Part of the reason is that GCC produces significantly more detailed debug info compared to clang and it thus takes longer to generate it and assemble together. GCC 8 produce 2.3GB of debug info sections, GCC 9 2.5GB and LLVM7 1.8GB. It is hard to qualitatively compare debug info, but this blog has some statistics.

SPEC 2017 benchmarks

In case you did not have enough numbers to look at, here is performance of SPEC2017 normalized to GCC 8 with -Ofast -march=native this time run on more current Zen based CPU. As more common with SPEC, bigger numbers are better.
This is based on benchmarks collected by Martin Jambor. Full LTO was used for Clang.

Overall enabling LTO improves integer part of SPEC2017 suite by about 4%, profile guided optimization by about 2% and together one gets 6.5% improvement. GCC 9 outperforms Clang 8 by 6-10%. We used AOCC 1.3 to compile exchange that is the only Fortran benchmark in the suite.

For floating point part LTO brings about 3% improvements, same does PGO and together one can expect about 6% improvements. Some benchmarks are missing with Clang 8 because they does not build (mostly due to use of Fortran). Again we used AOCC 1.3 for bwaves, cam4 and roms.

Once we set up the new Fortran frontends for LLVM, we may be able to fill in the blanks. I am aware that PGI compare GCC to Flang at their slides. I believe the comparison is flawed because -march=native was not used for GCC. Without this flag SSE is used for floating point code generation and this is significantly slower than AVX2. Also Flang comes with its own math library that used to be faster than glibc and provide vectorized math functions. Glibc was improved significantly recently and vector functions was added. For serious math performance thus glibc update is greatly recommended :)

This is a summary how SPEC numbers developed in last 3 years giving little bit extra context. Again the exchange benchmark for Clang 8 was replaced by one built by AOCC.

If you are not a compiler developer single digit SPEC differences may look small, but the benchmark suite is generally memory bound and changes reported here are important. Overall LTO (and even more so LTO+PGO) can be big win for complex programs with flat profiles (such as Perl, GCC in the SPEC testsuite). It is less important for projects where internal loops are rather small and can be hand-optimized and put into a single file (like, say xz or exchange). Sometimes compiler just gets lucky - for example the 50% speedup of Povray is mostly due to one fortunately partial inlining decision. This is something which could have been easily fixed at source level if the original developers noticed the problem by profiling.

In longer term I believe switching to LTO enabled build environment should make the code bases easier to maintain. Letting compiler to do more work reduces the pressure to hand optimize code placement (such as putting more code to inline functions in headers), add bit of extra robustness (such as ODR violation warnings) and therefore let developers to focus on more important issues. At the moment we compare performance of LTO optimized programs which were for ages tuned for non-LTO compilers. I would expect the importance of LTO to grow not only because new optimizations will be added but also because code bases will be re-tuned for the new environment.

I believe GCC 9 is a solid release. With LTO framework mature enough I hope we can finally see the performance and code size gains from enabling it by default for distribution builds. With bit of luck this should happen for OpenSUSE Tumbleweed really soon and hopefully others will follow. It took only 14 years to implement that :)

For GCC 10 I plan to continue working on improving WPA times and also again look into implementing new fancy optimizations.

Sunday, December 30, 2018

Even more fun with building and benchmarking Firefox with GCC and Clang

Recent switch of official Firefox builds to Clang on all platforms has triggered some questions about the quality of code produced by GCC compared to Clang. I use Firefox as an ocassional testcase for GCC optimization and thus I am somewhat familiar with its build procedure and benchmarking. In my previous post I ran benchmarks of Firefox64 comparing GCC 8 built binary to official builds using Clang which turned out to look quite good for GCC.

My post triggered a lot of useful discussion and thus I can write an update. Thanks to Nathan Froyd I obtained level 1 access to Mozilla repository and Mozilla's try server which allows me to do tests on the official setup. Mozilla's infrastructure features some very cool benchmarking tools that reliably separates useful data from the noise and run more tests than I did last time

I am also grateful to Jakub Jelínek, who spent a lot of time bisecting a nasty misoptimization of spell checker initialization. I also did more joint work with Jakub, Martin Liška and Martin Stránský on enabling LTO and PGO on OpenSUSE and Fedora packages. (Based on LTO enabled git branch Martin Liška maintained for a while.)

Treeherder

Treeherder, the build system used by Mozilla developers seems surprisingly flexible and kind of fun to use. I had impression that it is major pain to change its configuration and update to newer toolchain. In fact it is as easy as one can hope for. I was surprised that with minimal rights I can do it myself. With an outline from Nathan I have decided to do the following:

revert builds back to GCC 6,
update build system to GCC 8,
enable link-time-optimization (LTO) and compare performance to official builds

to see if there are code quality issues or bugs I can fix.

It took three days to get configuration updated and at Christmas eve I got first working LTO build with GCC 8. The build metrics looked great, but indeed was number of performance regressions (unlike in tests I did two weeks ago where basically all benchmarks looked good).

I had to see a lot of cat dancing animations, sacrifice few cows to Sr. Ignucius and replace 5 object files by precompiled Clang assembler (these contains hand optimized AVX vectorized rendering routines written in the Clang only extensions of GNU vector extensions which I explain later). 4 days later I resolved most of performance problems (and enjoyed the festivities as well).

In this post I discuss what I learnt. I focus on builds with link-time-optimization (LTO) and profile-guided optimization (PGO) only because it is what official builds use now and because it makes it easier to get apple-to-apple comparisons. I plan to also write about performance with -O2 and -O2 -flto. Here the fact that both compilers differs in interpretation of optimizations levels shows up.

Try server benchmarks: Talos comparison of GCC 8 and Clang 6 binaries build with link-time-optimization (LTO) and profile guided optimization (PGO)

This dancing cat is overseeing Firefox's benchmarks.
If you do some Firefox performance work you are better to be a cat person.
.

Treeherder has so far the best benchmarking infrastructure I worked with. Being able to run benchmarks in controlled and reproducible environment is very important and I really like to ability to click on individual benchmarks and see history and noise of Firefox mainline. It is also cute that one can mark individual regressions, link them with bugzilla and there are performance serifs doing so.

GCC benchmarking has been for years ruled by simple scripts and GNU plot. Martin Liška recently switched to LNT which has a lot of nice environments but I guess LNT can borrow some ideas :)

This is a screenshot of the dancing cat page comparing binary sizes of GCC 8 LTO+PGO binary to Clang 6 LTO+PGO and other build metrics. Real version may take a while to load and will consider all changes as insignificant because it does not have data from multiple builds. Screenshot actually compares my build to trunk of 28th December 2018 which is not far from my branchpoint.

The "section sizes" values are reduction of binaries. Here largest is libxul.so that goes down from 111MB to 82, 26% reduction. Installer size reduces by 9.5%. What is not reported but is important is that code segment reduces by 33%. Build time difference are within noise.

Number of warnings is red, but I guess more warnings are good it comes to compiler comparison.

Screenshot of dancing cat comparsion of my GCC 8 LTO+PGO build with Clang 6 LTO+PGO to the official build from the point I branched. Dancing cat will give you a lot of extra information: you can look at sub-tests and it shows tests where changes are not considered important or within noise. You can also click to graph and see progress over time.

The following benchmarks sees important improvements:

tp5o responsiveness (11.45%) tracks reponsiveness of the browser on the tp5o page set. This is 51 most popular webpages from 2011 accroding to Alexa with 3 noisy ones removed. List of the webpages is seen in the subtests of tp5o benchmark

There is interesting discussion about it in bugzila. This is complex test and I would like to believe that the win is caused by the careful choice of code size wrt performance, but I have no real proof for that.
tps (5.09%) is a tab switching benchmark on the tp5o pageset.
dromaeo_dom (4.74%) is a synthetic benchmark testing manipulations with DOM trees. It consists of 4 subtests and thus the profile is quite flat. See subtests. Run it from official dromaeo page.
sessionrestore (3.57%), as name suggests, measures time to restore a session. Again it looks like quite interesting benchmark training quite large part of the Firefox.
sessionrestore_no_auto_restore (3.05%) seems similar to previous benchmark.
dromaeo_css (2.99%) is synthetic benchmark testing CSS. It consists of 6 subtests and thus the profile is quite flat. See subtests. Run it from official dromaeo page.
tp5o (2.43%) is benchmark loading tp5o webpages. This is really nice overall performance test i think. See subtests which also lists the page. The improvements are quite uniform across the spectra.

The following 3 benchmarks sees important regressions:

displaylist_mutate (6.5%) is measuring time taking to render page after changing display list. It looks like a good benchmark because its profile is very flat which also made it hard for me to quickly pinpoint the problem. One thing I noticed is that GCC build binary has some processes showing up in profile that does not show in clang's so it may be some kind of configuration difference.
You can run it yourself.
cpstartup (2.87%) is testing time opening new tab (which starts component process I think) and getting ready to load page. This looks like an interesting benchmark but since it is just 2.96% I did not run it locally. It may be just the fact that train run does not really open/close many tabs and thus a lot of code is considered cold
rasterflood_svg (2.7%) is testing speed of rendering square patterns. It spends significant time in hand optimized vector rendering loops. I analysed the benchmarks and reduced regressions as described below since profile is simle. I did not look at the remaining 2% difference. Run it yourself.

There are some additional changes considered unimportant but off-noise:

Improvements:

tpaint (4.77%)
tp5o_webext (2.59%)
tp6_facebook (2.53%)
tabpaint (2.43%)
about_preferences_basic (2.23%)
tp6_google (1.66%)
a11r (1.66%)

Regressions:

tsvgx (1.28%)
tart (0.67%)

All these tests are described in Talos webpage.

Out of 40 performance tests, had 20 off-noise changes and except 5 in favour of GCC.
You can compare it with report on the benefits for switch from GCC 6 PGO to Clang 6 LTO+PGO in this bug 1481721. Note that speedometer is no longer run as part of the Talos benchmarks. I ran it locally and improvement over Clang was 5.3%. Clearly for both compilers LTO does have important effect on the performance.

It is interesting to see that Firefox official tests mix rather simple micro-benchmarks with more complex tests. This makes it bit more difficult to actually understand the overall performance metrics.

Overall I see nothing fundamentally inferior in GCC's code generation and link-time optimization capabilities compared to Clang. In fact GCC implemetnation of scalable LTO (originally called WHOPR, see also here) is more aggressive about whole program analysis (that is, it does almost all inter-procedural decision on whole program scope) than Clang's ThinLTO (which by design makes as much as possible on translation unit scope where translation workers may pick some code from other translation units as instructed by this linker). ThinLTO design is inspired by the fact that almost all code quality benefits from LTO in today compilers originate from unreachable code removal and inlining. On other other hand, optimizing at whole program scope makes it possible to better balance code size and performance and implement more transforms. I have spent a lot of time on optimizing compiler to get WHOPR scalable (which, of course, helped to cleanup the middle-end in general). I am happy that so far the build times with GCC looks very competitive and we have more room for experimenting with advanced LTO transformations.

Performance regressions turned out to be mostly compiler tuning issues that are easy to solve. Important exception is the problem with Clang only extensions which affects rasterflood_gradiend and some tsvg subtest explained in section about Skia. Making Skia vector code GCC compatible should not be terribly hard as described later.

Update: I gave second chance to displaylist_mutate and found it is actually missed inline. GCC inliner is bit tuned down for Firefox and and can trade some more size for speed. Using --param inline-unit-growth=40 --param ealry-inlining-insns=20 fixes the regression and brings some really good improvements over the spectra. While binary is still 22% smaller than Clang build. If I increase limits even more, I get even more improvements. I will now celebrate end of year and once next year I will analyse this and writemore.

I am in process of fine-tuning inlined for GCC 9 so I will take Firefox as additional testcase.

Getting GCC 8 LTO+PGO builds to work.

Following Nathan's outline it was actually easy to update configuration to fetch GCC8 and build it instead of GCC6.

I enabled LTO same way as for Clang build and added:

export AR="$topsrcdir/gcc/bin/gcc-ar"
export NM="$topsrcdir/gcc/bin/gcc-nm"
export RANLIB="$topsrcdir/gcc/bin/gcc-ranlib"

to build configuration in build/unix/mozconfig.unix. This is needed to get LTO static libraries working correcly. Firefox already has same defines for llvm-ar, llvm-nm and llvm-ranlib. Without this change one gets undefined symbols at link-time.

I added patch to disable watchdog to get profile data collected correctly. This is problem I noticed previously and is now bug 1516081 which is my first experiment with Firefox patch submission procedure (Phabricator) which I found particularly entertaining by requiring me to sacrifice few games from my phone in order to install some autentificating app.

Next problem to solve was undefined symbol in sandbox. This is fixed by the following patch taken from Martin Liška's Firefox RPM:

diff --git a/security/sandbox/linux/moz.build b/security/sandbox/linux/moz.build
--- a/security/sandbox/linux/moz.build
+++ b/security/sandbox/linux/moz.build
@@ -99,9 +99,8 @@ if CONFIG['CC_TYPE'] in ('clang', 'gcc')
# gcc lto likes to put the top level asm in syscall.cc in a different partition
# from the function using it which breaks the build. Work around that by
# forcing there to be only one partition.
-for f in CONFIG['OS_CXXFLAGS']:
-    if f.startswith('-flto') and CONFIG['CC_TYPE'] != 'clang':
-        LDFLAGS += ['--param lto-partitions=1']
+if CONFIG['CC_TYPE'] != 'clang':
+    LDFLAGS += ['--param', 'lto-partitions=1']

DEFINES['NS_NO_XPCOM'] = True
DisableStlWrapping()

The code to add necessary --param lto-partitions=1 already exists, but somehow it is not enabled correctly. I guess it was not updated for new --enable-lto. The problem here is that sandbox contains toplevel asm statement defining symbols. This is not supported for LTO (because there is no way to tell compiler that the symbol exists) and it is recommended to simply disable LTO in such cases. This is now bug 1516803.

Silencing new warnings

Official build uses -Werror so compilation fails when warnings are produced.I had to disable some warnings few where GCC complains and Clang is happy:

I ended up disabling:

-Wodr. This is C++ One Definition Rule violation detector I wrote 5 years ago. It reports real bugs even though some of them may be innocent in practice.

In short C++ Definition Rule (ODR) says that you should not have more than one definition of same name. This is very hard to keep in program of size of Firefox unless you are very consistent with namespaces. ODR violation can leads to surprises where, for example, virtual method ends up being dispatched to virtual method of completely different class which happens to clash in name mangling. This is particularly dangerous when, as Firefox does, you link multiple versions of same library into one binary.

These warnings are detected only with LTO. I started to look into fixes and found that GCC 8 is bit too verbose. For example it outputs ODR violation on the class itself and then on every single method the class has (because its this pointer parameter is mismatched). I silenced some of wranings for GCC 9. GCC 9 now finds 23 violations which are reported as bug 1516758. GCC 8 reported 97.
-Wlto-type-mismatch. This is warning about mismatched declarations across compilation units such as when you declare variable int in one unit but unsigned int in another. Those are real bugs and should be fixed. Again I reduced verbosity of this warning for GCC 9 so things are easier to analyse. Reported as bug 1516793.
-Walloc-size-larger-than=. This produces warnings when you try to allocate very large arrays. In case of Firefox the size is pretty gigantic.

GCC produces 20 warnings on Firefox and they do not seem particularly enlightening here.

audio_multi_vector_unittest.cc:36:68: warning: argument 1 value ‘18446744073709551615’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]

array_interleaved_ = new int16_t[num_channels_ * array_length()];
What is says is that the function was inlined and array_length ended up being compile time constant of 18446744073709551615=FFFFFFFFFFFFFFFF. It is a question why such context exists
-Wfree-nonheap-objects. As name suggests this warns when you call free on something which is clearly not on heap, such as automatic variable. It reported 4 places where this happens across quite large inline stacks so I did not debug them yet.
-Wstringop-overflow=. This warns when string operation would overrun its target. This detects 8 places which may or may not be possible buffer overflows. I did not analyse them.
-Wformat-overflow. This warns when i.e. sprintf formatting string can lead to output larger than is the destination buffer. It outputs diagnostics like:

video_capture_linux.cc:151:21: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 10 [-Wformat-overflow=]

sprintf(device, "/dev/video%d", (int) _deviceId);
Where someone apparently forgot about the 0 terminating the string. It trigger 15 times.
Martin Liška and Jakub Jelínek have patches which I hope will get upstream soon.

Supplying old libstdc++ to pass ABI compatibility test

After getting first binary, I ran into problem that GCC 8 built binaries require new libstdc++ which was not accepted by ABI checks and if you disable those tests the benchmarking server will not run the binary.

Normally one can install multiple versions of GCC and use -I and -L to link against older libstdc++. I did not want to spent too much time on figuring out how to get official build system to install two GCC versions at once, I simply made my own tarball of GCC 8.2 where I replaced libstdc++ by one from GCC 6.4.

Working around the two GCC bugs

First bug I wrote about previously and is related to the way GCC remembers optimization options from individual compilation commands and combines them together during link-time optimization. There was an omission in the transformation which merges static constructors and destructors together which made it to pick random flags. By bad luck those flags happened to include -mavx2 and thus binaries crashed on Bulldozer machine I use for remote builds (they would probably still work in Talos). It is easy to work this around by adding -fdisable-ipa-cdtor to LDFLAGS.

Second bug reproduces with GCC 8 and PGO builds only. Here GCC decides to inline into thunk which in turn triggers an omission in the analysis pass of devritualization. Normally C++ methods take pointer to the corresponding objects as this pointer. Thunks are special because they take pointer adjusted by some offset. It needs a lot of bad luck to trigger this and get wrong code and I am very grateful to Jakub Jelínek who spent his afternoon by isolating a testcase.

I use:

diff --git a/extensions/spellcheck/src/moz.build b/extensions/spellcheck/src/moz.build
--- a/extensions/spellcheck/src/moz.build
+++ b/extensions/spellcheck/src/moz.build
@@ -28,3 +28,8 @@ EXPORTS.mozilla += [

if CONFIG['CC_TYPE'] in ('clang', 'gcc'):
CXXFLAGS += ['-Wno-error=shadow']
+
+# spell checker triggers bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88561
+# in gcc 7 and 8. It will be fixed in GCC 7.5 and 8.3
+if CONFIG['CC_TYPE'] in ('gcc'):
+ CXXFLAGS += ['-fno-devirtualize']

I am not happy GCC bugs was triggered (and both mine) but they are clearly rare enough that they was never reported before.

Performance analysis

My first run of benchmarks has shown some notable regressions (note that this is with only one run of the tests, so smaller changes are considered noise).

Important regressions were:

rasterflood_gradient (75%). This renders some pink-orange blobs. Run it yourself.
rasterflood_svg 13%. This renders rotating squares (with squary hairs).
tsvg_static 18%.This consists of 4 subtests with regressions: rendering transparent copies of Kyoto and rendering rotated copies of Kyoto.
tsvgx 36%. This again consist of 7 subtests. Massive 493% ression is on test drawing blue-orange blobs. Second test renders nothing in my version of Firefox (official tumbleweed RPM) but it renders weird looking green-blue jigsaw pieces for Chrome. Last test which regresses renders green map.

I have filled bug 1516791 but it may be caused by fact that I need to add some java-script into the test to get it running out of Talos and I am no javascript expert.

Update: as explained in the bug, it was my mistake. This test needs additional file smallcats.gif and that strange jigsaw puzzle is actually broken image icon. So indeed my mistake.
displaylist_mutate 8%.. This benchmark I did not analyse easily. It consists of sub-tests that all look alike and have flat profile.

Fortunately all those benchmarks except for diplaylist_mutate are of micro-benchmark nature and it is easy to analyse them. Overall I think it is good sign about GCC 8 built performance out of the box: if you do not need on rendering performance of shader animations, GCC 8 built binary will probably perform well for you.

Skia (improving rasterflood_gradient and tsvgx)

Skia is a graphic rendering library which is shared by Firefox and Chrome. It is responsible for performance in two benchmarks: rasterflood_gradient and the massively regressing tsvgx subtest. I like pink-orrange blobs better and thus looked into rasterflood_gradient. Official build profile was:

Samples: 155K of event 'cycles:uppp', Event count (approx.): 98151072755
Overhead Command          Shared Object                  Symbol
13.32% PaintThread      libxul.so                      [.] hsw::lowp::gradient
   7.88% PaintThread      libxul.so                      [.] S32A_Blend_BlitRow32_SSE2
   5.20% PaintThread      libxul.so                      [.] hsw::lowp::xy_to_radius
   4.14% PaintThread      libxul.so                      [.] hsw::lowp::matrix_scale_translate
   3.97% PaintThread      libxul.so                      [.] hsw::lowp::store_bgra
   3.77% PaintThread      libxul.so                      [.] hsw::lowp::seed_shader

while GCC profile was:

Samples: 151K of event 'cycles:uppp', Event count (approx.): 101825662252
Overhead Command          Shared Object               Symbol
17.64% PaintThread      libxul.so                   [.] hsw::lowp::gradient
   6.51% PaintThread      libxul.so                   [.] hsw::lowp::store_bgra
   6.36% PaintThread      libxul.so                   [.] hsw::lowp::xy_to_radius
   5.40% PaintThread      libxul.so                   [.] S32A_Blend_BlitRow32_SSE2
   4.73% PaintThread      libxul.so                   [.] hsw::lowp::matrix_scale_translate
   4.53% PaintThread      libxul.so                   [.] hsw::lowp::seed_shader

So only few percent difference on my setup (as opposed 75% on the try server) but clearly related to hsw::lowp::gradient which promptly pointed me to the Skia library. Hsw actually stands for haswell and it is hand optimized vector rendering code which is used for my Skylake CPU.

#if defined(clang) issues

I looked into sources and noticed two funny hacks. Someone enabled always_inline attribute only for Clang which I fixed by this patch. And there was apparently leftover hack in the Firefox copy of Skia (which was never part of official Skia) disabling AVX optimization on all non-Clang compilers. Fixed by this patch. That patch also fixes another disabled always_inline with comment about compile time regression with GCC. Those did not reproduce to me. I also experimented with setting -mtune=haswell on those files since I suspected that AVX vectorization on generic tuning may be off - I never got an idea to test it since I expected people to use -march=<cpu> in this case.

I was entertained to noice that Clang actually defines __GNUC__. Shall also GCC define __clang__?

With this rasterflood_mutate regression reduced from 75% to 39% and the tsvgx subtest from 493% to 75%.

AVX256 versus SSE2 code

From profiles I worked out that the rest of difference is again caused by #ifdef machinery. This time is however not so easy to undo. Firefox version of Skia contains two implementations of the internal loops. One is Clang only using vector extensions while other is used for GCC and MSVC using the official Intel's ?mmintrin.h API. The second version was never ported to AVX and the avx/hsw loops still used 128bit SSE vector and API just compiled with new ISA enabled.

I have downloaded upstream Skia sources and realized that few weeks ago the MSVC and GCC path was dropped completely and Skia now defaults to scalar implementation on those compilers.

Its webpage mentions:

A note on software backend performance
A number of routines in Skia’s software backend have been written to run fastest when compiled by Clang. If you depend on software rasterization, image decoding, or color space conversion and compile Skia with GCC, MSVC or another compiler, you will see dramatically worse performance than if you use Clang.
This choice was only a matter of prioritization; there is nothing fundamentally wrong with non-Clang compilers. So if this is a serious issue for you, please let us know on the mailing list.

Clang's vector extensions are in fact GNU vector extensions and thus I concluded that it should not be that hard to port Skia to GCC again and to give it a try. It is about 3000 lines of code. I got it to compile with GCC in an hour or two, but it turned out that more work would be necessary. It does not make sense to spend too much time on it if it can not be upstreamed so I plan to discuss it with the Skia developers.

The problem is in:

template <typename T> using V = T __attribute__((ext_vector_type(size)));

This is not understood by GCC. ext_vector_type is Clang only extension of GNU vector extensions for which I found brief mention of in the Clang Langugage Extensions manual. Here they are called "OpenCL vectors".

I also noticed that replacing ext_vector_type by GNU equivalent vector_size makes GCC unhappy about using attributes on right hand side for which I filled PR88600 and Alexander advised me to use instead:

template <typename T> using V [[gnu::vector_size (sizeof(T)*8)]] = T;

Which gives equivalent vector type, but unfortunately not equivalent semantics. Clang, depending on the attribute (ext_vector_type or vector_size), accepts different kind of code. In particular following constructions are rejected for gnu::vector_size by both GCC and Clang but accepted for ext_vector_type:

typedef __attribute__ ((ext_vector_type (4))) int int_vector;
typedef __attribute__ ((vector_size (16))) float float_vector;

int_vector a;
float_vector b;

int
test()
{
  a=1;
  a=b;
}

Here I declare one openCL variable a consisting of 4 integers and one GNU vector extension variable b consisting of 4 floats (4*4=16 is the size of vector type in bytes). The code has semantics of writing integer vector [1,1,1,1] to a and then moving it into float vector without any conversion (thus getting funny floating point values).

Replacing openCL vector by GNU vector makes both compilers reject both statements. But one can fix it as:

typedef __attribute__ ((vector_size (16))) int int_vector;
typedef __attribute__ ((vector_size (16))) float float_vector;

int_vector a;
float_vector b;

int
test()
{
  a=(int_vector){} + 1;
  a=(int_vector)b;
}

Construct (int_vector){} + 1 was recommended to me by Jakub Jelínek and the first part builds vector zero and then adds 1 which is an vector-scalar addition that is accepted by GNU vector extensions.

The explicit casts are required intentionally because semantics contradicts the usual C meaning (removing vector attribute, integer to float conversion would happen) and that is why users are required to write it by hand. It is funny that Clang actually also requires the cast when both vectors are OpenCL or both are GNU. It only accepts a=b if one vector is GNU and other OpenCL. Skia uses these casts often on places it calls ?mmintrin.h functions.

I thus ended up with longish patch adding all the conversions. I had to also work around non-existence of __builtin_convertvector which would be reasonable builtin to have (and there is PR85052 for that). I ended up with code that compiles and renders some stuff correctly but incorrectly other.

I therefore decided to discuss this with upstream maintainers first and made only truly non-upstreamable hack. I replaced the four source files SkOpts_avx.cpp, SkOpts_hsw.cpp, SkOpts_sse41.cpp, SkOpts_sse42.cpp, SkOpts_ssse3.cpp. This, of course, solved the two regressions but there is work ahead.

I also filled PR88602 to track the lack of ext_vector_size in GCC, but I am not quite sure whether it is desirable to have. I hope there is some kind of specification and/or design rationale somewhere.

2d rendering performance (improving rasterflood_svg)

The performance seems is caused by fact that GCC honors always_inline across translation units. I ended up disabling always_inline in mfbt/Attributes.h which solved most the regression. I will identify the wrong use and submit patch later.

The rest of performance difference turned out to be CPU tuning compiling:

#include <emmintrin.h>
__m128i test (short a,short b,short c,short d,short e,short f,short g)
{
  return _mm_set_epi16(a,b,c,d,e,f,g,255);
}

GCC 8 with generic compiles this into rather long sequence of integer operations, store and vector loads, while Clang uses integer to SSE stores. This is because GCC 8 still optimizes for Bulldozer in its generic tuning model and integer to vector stores are expensive there. I did pretty detailed re-tuning of generic setting for GCC 8 and I remember that I decided to keep this flag as it was not hurting much new CPUs and was important for Bulldozer, but I have missed effect to hand vectorized code. I will change the tuning flag for GCC 9.

I added -mtune-ctrl=inter_unit_moves_to_vec to compilation flags to change the bits. Clearly the benchmarking servers are not using Bulldozer where indeed GCC version runs about 6% faster.

Tweaking train run

While looking into the performance problems of svg rendering I noticed that the code is optimized for size because it was never executed during the train run. Because Clang does not optimize for size cold regions, this problem is not very visible for Clang benchmarks, but I have displaylist_mutate.html, rasterflood_svg.html and hixie-007.html into the train run. These are same as used by Talos, just modified to run longer and out of talos scripting machinery. I thus did not fill bug report for this yet.

I have tested and it seems to have only in-noise effect on Clang, but same issues was hit by hears ago with MSVC as reported in bug 977658.

It seems there are some more instances where train run can be improved and probably those tests could be combined into one to not increase build times too much. A candidates are those two regressions in perf micro benchmarks which I have verified to indeed execute cold code.

I have also noticed that training of hardware specific loops is done only on the architecture which is used by build server. This is now bug 1516809.

-fdata-sections -ffunction-sections

These two flags puts every function and variable into separate linker section. This lets linker to manipulate with them independently which is used to remove unreachable code and also for identical code folding in Gold.They however also have overhead by adding extra alignment padding and preventing assembler from using short form of branch instructions.

For GCC without LTO the linker optimization saves about 5MB of binary size for GCC build and 8MB for Clang. With LTO it however does not make much sense because compiler can do these transforms itself. While GCC may not merge all functions with identical assembly linker can detect and vice versa, it is a win to disable those flags, by about 1MB and better link-times. This is now bug 1516842.

-fno-semantic-interposition

Clang ignores the fact that in ELF libraries symbols can be interposed by different implementation. When comparing perofmrance of PIC code between compiler, it is always good to use -fno-semantic-interposition in GCC to get apple-to-apple comparison. Effect on Firefox is not great (about 100Kb of binarry size differnece) because it declares many symbols as hidden, but it prevents more perofrmance surprises.

Implementation monoculture

I consider Firefox an amazing test-case for link-time optimization development because::

it has huge and dynamically changing code-base where large part is in modern C++,
it encompasses different projects with divergent coding styles,
it has a lot of low level code and optimization hacks which use random extensions,
it has decent benchmarks where some of them has been hand optimized for a while,
it can be benchmarked from command line with reasonable confidence.

This makes it possible to do tests that can not be done by running usual SPEC benchmarks which are order of magnitude smaller, sort of standard compliant, and written many years ago. Real world testing is essential both to make individual features (such as LTO or PGO) production ready and to fine tune configuration of the toolchain for practical use.

For this reason I am bit sad about the switch to single compiler. It is not clear to me whether Firefox will stay portable next year. Nathan wrote interesting blog "when an implementation monoculture might be the right thing". I understand his points, but on the other hand it seems that no super-human efforts were needed to get decent performance out of GCC builds. From my personal experience maitaining a portable codebase may not be always fun but pays back in long term.

Note that Chromium had switched to Clang only builds compiler some time ago it still builds with GCC (GCC is used to build Chromium for Tumbleweed RPM package, for example), so there is some hope that community will maintain compatibility, but I would not bet my shoes on it.

It would make sense to me to add both GCC and Clang builds into Mozilla nightly testing. This should:

Uncover portability issues and other bugs due to different warnings and C++ implementations in both compilers.
Make the code-base easier to maintain and port in long term
Keep GCC and LLVM toolchains developers interested in optimizing Firefox codebase by providing more real-world benchmarks than SPEC, Polyhedron and Phoronix benchmarks (which are way too small for LTO and often departed from reality).
Benefit from fact that toolchain bugs will be identified and fixed before new releases

At least in my short experiment I was easily able to identify problems in all three project (and fix some of them).

Communication with maintainers of Firefox packages in individual Linux distros

It seems to me that it would be good to communicate better performance settings with authors of packages used by individual distros. I think most of us do not download Firefox by hand and simply use one provided by the particular Linux distribution. Seeing communication I had with Martins concerning SUSE and RedHat's packages, it seems very hard for packagers to reproduce Firefox PGO build setup which is critical to get good binary. One of things that would improve situation is to make build system fail if train run fails, which I filled as bug 1516872.

It would be nice to set up some page which lists things that are important for a quality build and which provides links to benchmarks checking that the distribution provided Firefox is comparable to official one.

It may sound funny, but even though I look at Firefox performance since 2010 until this December it did not cross my mind that I can actually download official package and benchmark against it. It is not usual that reproducing quality build would need such effort, but it is a consequence of complexity of the code-base.

Future plans

I already did some useful work on GCC 9 during last two weeks:

I plan to continue working on this by updating my setup to GCC 9 and making sure it will do well on Firefox once released. I will also look deeper into performance with -O2 and the remaining displaylist_mutate regression and try to find time to write another update.


This dancing cat is overseeing Firefox's benchmarks. If you do some Firefox performance work you are better to be a cat person. .