1. 05 Apr, 2019 1 commit
    • Peter Zijlstra's avatar
      lib/int_sqrt: optimize initial value compute · 627f9c3a
      Peter Zijlstra authored
      commit f8ae107e upstream.
      
      The initial value (@m) compute is:
      
      	m = 1UL << (BITS_PER_LONG - 2);
      	while (m > x)
      		m >>= 2;
      
      Which is a linear search for the highest even bit smaller or equal to @x
      We can implement this using a binary search using __fls() (or better when
      its hardware implemented).
      
      	m = 1UL << (__fls(x) & ~1UL);
      
      Especially for small values of @x; which are the more common arguments
      when doing a CDF on idle times; the linear search is near to worst case,
      while the binary search of __fls() is a constant 6 (or 5 on 32bit)
      branches.
      
            cycles:                 branches:              branch-misses:
      
      PRE:
      
      hot:   43.633557 +- 0.034373  45.333132 +- 0.002277  0.023529 +- 0.000681
      cold: 207.438411 +- 0.125840  45.333132 +- 0.002277  6.976486 +- 0.004219
      
      SOFTWARE FLS:
      
      hot:   29.576176 +- 0.028850  26.666730 +- 0.004511  0.019463 +- 0.000663
      cold: 165.947136 +- 0.188406  26.666746 +- 0.004511  6.133897 +- 0.004386
      
      HARDWARE FLS:
      
      hot:   24.720922 +- 0.025161  20.666784 +- 0.004509  0.020836 +- 0.000677
      cold: 132.777197 +- 0.127471  20.666776 +- 0.004509  5.080285 +- 0.003874
      
      Averages computed over all values <128k using a LFSR to generate order.
      Cold numbers have a LFSR based branch trace buffer 'confuser' ran between
      each int_sqrt() invocation.
      
      Link: http://lkml.kernel.org/r/20171020164644.936577234@infradead.org
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Suggested-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Anshul Garg <aksgarg1989@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: David Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Michael Davidson <md@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      627f9c3a
  2. 27 Mar, 2019 1 commit
    • Peter Zijlstra's avatar
      lib/int_sqrt: optimize small argument · e6008a05
      Peter Zijlstra authored
      commit 3f329570 upstream.
      
      The current int_sqrt() computation is sub-optimal for the case of small
      @x.  Which is the interesting case when we're going to do cumulative
      distribution functions on idle times, which we assume to be a random
      variable, where the target residency of the deepest idle state gives an
      upper bound on the variable (5e6ns on recent Intel chips).
      
      In the case of small @x, the compute loop:
      
      	while (m != 0) {
      		b = y + m;
      		y >>= 1;
      
      		if (x >= b) {
      			x -= b;
      			y += m;
      		}
      		m >>= 2;
      	}
      
      can be reduced to:
      
      	while (m > x)
      		m >>= 2;
      
      Because y==0, b==m and until x>=m y will remain 0.
      
      And while this is computationally equivalent, it runs much faster
      because there's less code, in particular less branches.
      
            cycles:                 branches:              branch-misses:
      
      OLD:
      
      hot:   45.109444 +- 0.044117  44.333392 +- 0.002254  0.018723 +- 0.000593
      cold: 187.737379 +- 0.156678  44.333407 +- 0.002254  6.272844 +- 0.004305
      
      PRE:
      
      hot:   67.937492 +- 0.064124  66.999535 +- 0.000488  0.066720 +- 0.001113
      cold: 232.004379 +- 0.332811  66.999527 +- 0.000488  6.914634 +- 0.006568
      
      POST:
      
      hot:   43.633557 +- 0.034373  45.333132 +- 0.002277  0.023529 +- 0.000681
      cold: 207.438411 +- 0.125840  45.333132 +- 0.002277  6.976486 +- 0.004219
      
      Averages computed over all values <128k using a LFSR to generate order.
      Cold numbers have a LFSR based branch trace buffer 'confuser' ran between
      each int_sqrt() invocation.
      
      Link: http://lkml.kernel.org/r/20171020164644.876503355@infradead.org
      Fixes: 30493cc9
      
       ("lib/int_sqrt.c: optimize square root algorithm")
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Suggested-by: default avatarAnshul Garg <aksgarg1989@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michael Davidson <md@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6008a05
  3. 30 Apr, 2013 1 commit
    • Davidlohr Bueso's avatar
      lib/int_sqrt.c: optimize square root algorithm · 30493cc9
      Davidlohr Bueso authored
      Optimize the current version of the shift-and-subtract (hardware)
      algorithm, described by John von Newmann[1] and Guy L Steele.
      
      Iterating 1,000,000 times, perf shows for the current version:
      
       Performance counter stats for './sqrt-curr' (10 runs):
      
               27.170996 task-clock                #    0.979 CPUs utilized            ( +-  3.19% )
                       3 context-switches          #    0.103 K/sec                    ( +-  4.76% )
                       0 cpu-migrations            #    0.004 K/sec                    ( +-100.00% )
                     104 page-faults               #    0.004 M/sec                    ( +-  0.16% )
              64,921,199 cycles                    #    2.389 GHz                      ( +-  0.03% )
              28,967,789 stalled-cycles-frontend   #   44.62% frontend cycles idle     ( +-  0.18% )
         <not supported> stalled-cycles-backend
             104,502,623 instructions              #    1.61  insns per cycle
                                                   #    0.28  stalled cycles per insn  ( +-  0.00% )
              34,088,368 branches                  # 1254.587 M/sec                    ( +-  0.00% )
                   4,901 branch-misses             #    0.01% of all branches          ( +-  1.32% )
      
             0.027763015 seconds time elapsed                                          ( +-  3.22% )
      
      And for the new version:
      
      Performance counter stats for './sqrt-new' (10 runs):
      
                0.496869 task-clock                #    0.519 CPUs utilized            ( +-  2.38% )
                       0 context-switches          #    0.000 K/sec
                       0 cpu-migrations            #    0.403 K/sec                    ( +-100.00% )
                     104 page-faults               #    0.209 M/sec                    ( +-  0.15% )
                 590,760 cycles                    #    1.189 GHz                      ( +-  2.35% )
                 395,053 stalled-cycles-frontend   #   66.87% frontend cycles idle     ( +-  3.67% )
         <not supported> stalled-cycles-backend
                 398,963 instructions              #    0.68  insns per cycle
                                                   #    0.99  stalled cycles per insn  ( +-  0.39% )
                  70,228 branches                  #  141.341 M/sec                    ( +-  0.36% )
                   3,364 branch-misses             #    4.79% of all branches          ( +-  5.45% )
      
             0.000957440 seconds time elapsed                                          ( +-  2.42% )
      
      Furthermore, this saves space in instruction text:
      
         text    data     bss     dec     hex filename
          111       0       0     111      6f lib/int_sqrt-baseline.o
           89       0       0      89      59 lib/int_sqrt.o
      
      [1] http://en.wikipedia.org/wiki/First_Draft_of_a_Report_on_the_EDVAC
      
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr.bueso@hp.com>
      Reviewed-by: default avatarJonathan Gonzalez <jgonzlez@linets.cl>
      Tested-by: default avatarJonathan Gonzalez <jgonzlez@linets.cl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30493cc9
  4. 07 Mar, 2012 1 commit
  5. 03 Feb, 2006 1 commit
  6. 16 Apr, 2005 1 commit
    • Linus Torvalds's avatar
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds authored
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4