1. 29 Jul, 2019 1 commit
  2. 19 Jun, 2019 1 commit
    • Chris Wilson's avatar
      drm/i915: Make the semaphore saturation mask global · 44d89409
      Chris Wilson authored
      The idea behind keeping the saturation mask local to a context backfired
      spectacularly. The premise with the local mask was that we would be more
      proactive in attempting to use semaphores after each time the context
      idled, and that all new contexts would attempt to use semaphores
      ignoring the current state of the system. This turns out to be horribly
      optimistic. If the system state is still oversaturated and the existing
      workloads have all stopped using semaphores, the new workloads would
      attempt to use semaphores and be deprioritised behind real work. The
      new contexts would not switch off using semaphores until their initial
      batch of low priority work had completed. Given sufficient backload load
      of equal user priority, this would completely starve the new work of any
      GPU time.
      
      To compensate, remove the local tracking in favour of keeping it as
      global state on the engine -- once the system is saturated and
      semaphores are disabled, everyone stops attempting to use semaphores
      until the system is idle again. One of the reason for preferring local
      context tracking was that it worked with virtual engines, so for
      switching to global state we could either do a complete check of all the
      virtual siblings or simply disable semaphores for those requests. This
      takes the simpler approach of disabling semaphores on virtual engines.
      
      The downside is that the decision that the engine is saturated is a
      local measure -- we are only checking whether or not this context was
      scheduled in a timely fashion, it may be legitimately delayed due to user
      priorities. We still have the same dilemma though, that we do not want
      to employ the semaphore poll unless it will be used.
      
      v2: Explain why we need to assume the worst wrt virtual engines.
      
      Fixes: ca6e56f6
      
       ("drm/i915: Disable semaphore busywaits on saturated systems")
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
      Reviewed-by: default avatarTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190618074153.16055-8-chris@chris-wilson.co.uk
      44d89409
  3. 14 Jun, 2019 2 commits
  4. 28 May, 2019 1 commit
  5. 04 May, 2019 1 commit
    • Chris Wilson's avatar
      drm/i915: Disable semaphore busywaits on saturated systems · ca6e56f6
      Chris Wilson authored
      
      Asking the GPU to busywait on a memory address, perhaps not unexpectedly
      in hindsight for a shared system, leads to bus contention that affects
      CPU programs trying to concurrently access memory. This can manifest as
      a drop in transcode throughput on highly over-saturated workloads.
      
      The only clue offered by perf, is that the bus-cycles (perf stat -e
      bus-cycles) jumped by 50% when enabling semaphores. This corresponds
      with extra CPU active cycles being attributed to intel_idle's mwait.
      
      This patch introduces a heuristic to try and detect when more than one
      client is submitting to the GPU pushing it into an oversaturated state.
      As we already keep track of when the semaphores are signaled, we can
      inspect their state on submitting the busywait batch and if we planned
      to use a semaphore but were too late, conclude that the GPU is
      overloaded and not try to use semaphores in future requests. In
      practice, this means we optimistically try to use semaphores for the
      first frame of a transcode job split over multiple engines, and fail if
      there are multiple clients active and continue not to use semaphores for
      the subsequent frames in the sequence. Periodically, we try to
      optimistically switch semaphores back on whenever the client waits to
      catch up with the transcode results.
      
      With 1 client, on Broxton J3455, with the relative fps normalized by %cpu:
      
      x no semaphores
      + drm-tip
      * patched
      +------------------------------------------------------------------------+
      |                                                    *                   |
      |                                                    *+                  |
      |                                                    **+                 |
      |                                                    **+  x              |
      |                                x               *  +**+  x              |
      |                                x  x       *    *  +***x xx             |
      |                                x  x       *    * *+***x *x             |
      |                                x  x*   +  *    * *****x *x x           |
      |                         +    x xx+x*   + ***   * ********* x   *       |
      |                         +    x xx+x*   * *** +** ********* xx  *       |
      |    *   +         ++++*  +    x*x****+*+* ***+*************+x*  *       |
      |*+ +** *+ + +* + *++****** *xxx**********x***+*****************+*++    *|
      |                                   |__________A_____M_____|             |
      |                           |_______________A____M_________|             |
      |                                 |____________A___M________|            |
      +------------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x 120       2.60475       3.50941       3.31123     3.2143953    0.21117399
      + 120        2.3826       3.57077       3.25101     3.1414161    0.28146407
      Difference at 95.0% confidence
      	-0.0729792 +/- 0.0629585
      	-2.27039% +/- 1.95864%
      	(Student's t, pooled s = 0.248814)
      * 120       2.35536       3.66713        3.2849     3.2059917    0.24618565
      No difference proven at 95.0% confidence
      
      With 10 clients over-saturating the pipeline:
      
      x no semaphores
      + drm-tip
      * patched
      +------------------------------------------------------------------------+
      |                     ++                                        **       |
      |                     ++                                        **       |
      |                     ++                                        **       |
      |                     ++                                        **       |
      |                     ++                                    xx ***       |
      |                     ++                                    xx ***       |
      |                     ++                                    xxx***       |
      |                     ++                                    xxx***       |
      |                    +++                                    xxx***       |
      |                    +++                                    xx****       |
      |                    +++                                    xx****       |
      |                    +++                                    xx****       |
      |                    +++                                    xx****       |
      |                    ++++                                   xx****       |
      |                   +++++                                   xx****       |
      |                   +++++                                 x x******      |
      |                  ++++++                                 xxx*******     |
      |                  ++++++                                 xxx*******     |
      |                  ++++++                                 xxx*******     |
      |                  ++++++                                 xx********     |
      |                  ++++++                               xxxx********     |
      |                  ++++++                               xxxx********     |
      |                ++++++++                             xxxxx*********     |
      |+ +  +        + ++++++++                           xxx*xx**********x*  *|
      |                                                         |__A__|        |
      |                 |__AM__|                                               |
      |                                                            |__A_|      |
      +------------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x 120       2.47855        2.8972       2.72376     2.7193402   0.074604933
      + 120       1.17367       1.77459       1.71977     1.6966782   0.085850697
      Difference at 95.0% confidence
      	-1.02266 +/- 0.0203502
      	-37.607% +/- 0.748352%
      	(Student's t, pooled s = 0.0804246)
      * 120       2.57868       3.00821       2.80142     2.7923878   0.058646477
      Difference at 95.0% confidence
      	0.0730476 +/- 0.0169791
      	2.68622% +/- 0.624383%
      	(Student's t, pooled s = 0.0671018)
      
      Indicating that we've recovered the regression from enabling semaphores
      on this saturated setup, with a hint towards an overall improvement.
      
      Very similar, but of smaller magnitude, results are observed on both
      Skylake(gt2) and Kabylake(gt4). This may be due to the reduced impact of
      bus-cycles, where we see a 50% hit on Broxton, it is only 10% on the big
      core, in this particular test.
      
      One observation to make here is that for a greedy client trying to
      maximise its own throughput, using semaphores is the right choice. It is
      only the holistic system-wide view that semaphores of one client
      impacts another and reduces the overall throughput where we would choose
      to disable semaphores.
      
      The most noticeable negactive impact this has is on the no-op
      microbenchmarks, which are also very notable for having no cpu bus load.
      In particular, this increases the runtime and energy consumption of
      gem_exec_whisper.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: default avatarTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190504070707.30902-1-chris@chris-wilson.co.uk
      ca6e56f6
  6. 26 Apr, 2019 4 commits
  7. 24 Apr, 2019 4 commits
  8. 19 Mar, 2019 2 commits
  9. 08 Mar, 2019 3 commits