Adapts __ppc_get_timebase to the upcoming GCC 4.8 that provides
__builtin_ppc_get_timebase. Building applicationns with previous
versions of GCC will continue to use the internal implementation.
Initially based on the versions found in wcsmbs/* ; these files have
been changed by hand unrolling, and adding some additional variables
to allow some read-ahead to occur, which then relieves some of the
wait-for-increment/wait-for-load/wait-for-compare-results pressure
that was slowing down every iteration through the while-loop.
For 64-bit Power7, These changes give an approx 20% throughput boost
for the wcschr and wcsrchr functions; and approx 40% boost for the
wcscpy function. 32-bit improvements appear to be slightly better
with ~ %30 and ~ %45 respectively. Results for Power6 closely match
those for power7.
Assorted tweaking, twisting and tuning to squeeze a few additional cycles
out of the memchr code. Changes include bypassing the shift pairs
(sld,srd) when they are not required, and unrolling the small_loop that
handles short and trailing strings.
Per scrollpipe data measuring aligned strings for 64-bit, these changes
save between five and eight cycles (9-13% overall) for short strings (<32),
Longer aligned strings see slight improvement of 1-3% due to bypassing the
shifts and the instruction rearranging.
[BZ #13743]
A new class of installed headers has been documented for low-level
platform-specific functionality. PowerPC added the first instance with a
function to provide time base register access (__ppc_get_timebase). This
is required for applications that measure time at high frequencies with
high precision that can't afford a syscall.
Adjustments for libm ulps added with commit d8b82cad1b525bdcbfff88d218c7c45032e4a3af,
495fd99f3a119e5c0c542ccc6cf9c93b1fb9e892, and 5ba3cc691c856e5c67a7d4cd4713f20a79f7ba81.
I also adjusted some exp10 ulps definition that was higher than needed.
In the past the "-ftree-loop-linear" switch provided a measurable
improvement in performance for certain functions. At some point it
was assigned as the responsibility of Graphite in GCC. It has been
found that even with Graphite enabled these flags no longer perform
any appreciable improvement over the baseline.
Graphite now has some open bugs which need to be fixed in order for it
to provide measurable performance improvements but it lacks active
development. As a result some compiler distributors may disable
Graphite. If Graphite is disabled then building GLIBC will fail if
the "-ftree-loop-linear" switch is used.
This patch removes the use of "-ftree-loop-linear" as unnecessary.
This patch provides optimized logb (1.2x on PPC32 and 2.5x on PPC64),
logbf (1.1x on PPC32 and 2.2x on PPC64), and logbl (1.3x on PPC32 and
50% on PPC64) for the POWER7 processor.