1. 30 May, 2018 2 commits
  2. 09 May, 2018 2 commits
  3. 20 Jan, 2018 1 commit
  4. 19 Jan, 2018 1 commit
    • Joseph Qi's avatar
      blk-throttle: track read and write request individually · b889bf66
      Joseph Qi authored
      In mixed read/write workload on SSD, write latency is much lower than
      read. But now we only track and record read latency and then use it as
      threshold base for both read and write io latency accounting. As a
      result, write io latency will always be considered as good and
      bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
      tg to be checked will be treated as idle most of the time and still let
      others dispatch more ios, even it is truly running under low limit and
      wants its low limit to be guaranteed, which is not we expected in fact.
      So track read and write request individually, which can bring more
      precise latency control for low limit idle detection.
      Signed-off-by: default avatarJoseph Qi <qijiang.qj@alibaba-inc.com>
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b889bf66
  5. 18 Jan, 2018 1 commit
  6. 20 Dec, 2017 1 commit
    • Shaohua Li's avatar
      block-throttle: avoid double charge · 111be883
      Shaohua Li authored
      If a bio is throttled and split after throttling, the bio could be
      resubmited and enters the throttling again. This will cause part of the
      bio to be charged multiple times. If the cgroup has an IO limit, the
      double charge will significantly harm the performance. The bio split
      becomes quite common after arbitrary bio size change.
      
      To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
      If the bio is cloned/split, we copy the flag to new bio too to avoid a
      double charge. However, cloned bio could be directed to a new disk,
      keeping the flag be a problem. The observation is we always set new disk
      for the bio in this case, so we can clear the flag in bio_set_dev().
      
      This issue exists for a long time, arbitrary bio size change just makes
      it worse, so this should go into stable at least since v4.2.
      
      V1-> V2: Not add extra field in bio based on discussion with Tejun
      
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      111be883
  7. 21 Nov, 2017 1 commit
    • Kees Cook's avatar
      treewide: setup_timer() -> timer_setup() · e99e88a9
      Kees Cook authored
      This converts all remaining cases of the old setup_timer() API into using
      timer_setup(), where the callback argument is the structure already
      holding the struct timer_list. These should have no behavioral changes,
      since they just change which pointer is passed into the callback with
      the same available pointers after conversion. It handles the following
      examples, in addition to some other variations.
      
      Casting from unsigned long:
      
          void my_callback(unsigned long data)
          {
              struct something *ptr = (struct something *)data;
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, ptr);
      
      and forced object casts:
      
          void my_callback(struct something *ptr)
          {
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);
      
      become:
      
          void my_callback(struct timer_list *t)
          {
              struct something *ptr = from_timer(ptr, t, my_timer);
          ...
          }
          ...
          timer_setup(&ptr->my_timer, my_callback, 0);
      
      Direct function assignments:
      
          void my_callback(unsigned long data)
          {
              struct something *ptr = (struct something *)data;
          ...
          }
          ...
          ptr->my_timer.function = my_callback;
      
      have a temporary cast added, along with converting the args:
      
          void my_callback(struct timer_list *t)
          {
              struct something *ptr = from_timer(ptr, t, my_timer);
          ...
          }
          ...
          ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;
      
      And finally, callbacks without a data assignment:
      
          void my_callback(unsigned long data)
          {
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, 0);
      
      have their argument renamed to verify they're unused during conversion:
      
          void my_callback(struct timer_list *unused)
          {
          ...
          }
          ...
          timer_setup(&ptr->my_timer, my_callback, 0);
      
      The conversion is done with the following Coccinelle script:
      
      spatch --very-quiet --all-includes --include-headers \
      	-I ./arch/x86/include -I ./arch/x86/include/generated \
      	-I ./include -I ./arch/x86/include/uapi \
      	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
      	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
      	--dir . \
      	--cocci-file ~/src/data/timer_setup.cocci
      
      @fix_address_of@
      expression e;
      @@
      
       setup_timer(
      -&(e)
      +&e
       , ...)
      
      // Update any raw setup_timer() usages that have a NULL callback, but
      // would otherwise match change_timer_function_usage, since the latter
      // will update all function assignments done in the face of a NULL
      // function initialization in setup_timer().
      @change_timer_function_usage_NULL@
      expression _E;
      identifier _timer;
      type _cast_data;
      @@
      
      (
      -setup_timer(&_E->_timer, NULL, _E);
      +timer_setup(&_E->_timer, NULL, 0);
      |
      -setup_timer(&_E->_timer, NULL, (_cast_data)_E);
      +timer_setup(&_E->_timer, NULL, 0);
      |
      -setup_timer(&_E._timer, NULL, &_E);
      +timer_setup(&_E._timer, NULL, 0);
      |
      -setup_timer(&_E._timer, NULL, (_cast_data)&_E);
      +timer_setup(&_E._timer, NULL, 0);
      )
      
      @change_timer_function_usage@
      expression _E;
      identifier _timer;
      struct timer_list _stl;
      identifier _callback;
      type _cast_func, _cast_data;
      @@
      
      (
      -setup_timer(&_E->_timer, _callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, &_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, &_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
       _E->_timer@_stl.function = _callback;
      |
       _E->_timer@_stl.function = &_callback;
      |
       _E->_timer@_stl.function = (_cast_func)_callback;
      |
       _E->_timer@_stl.function = (_cast_func)&_callback;
      |
       _E._timer@_stl.function = _callback;
      |
       _E._timer@_stl.function = &_callback;
      |
       _E._timer@_stl.function = (_cast_func)_callback;
      |
       _E._timer@_stl.function = (_cast_func)&_callback;
      )
      
      // callback(unsigned long arg)
      @change_callback_handle_cast
       depends on change_timer_function_usage@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _origtype;
      identifier _origarg;
      type _handletype;
      identifier _handle;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *t
       )
       {
      (
      	... when != _origarg
      	_handletype *_handle =
      -(_handletype *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle =
      -(void *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle;
      	... when != _handle
      	_handle =
      -(_handletype *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle;
      	... when != _handle
      	_handle =
      -(void *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      )
       }
      
      // callback(unsigned long arg) without existing variable
      @change_callback_handle_cast_no_arg
       depends on change_timer_function_usage &&
                           !change_callback_handle_cast@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _origtype;
      identifier _origarg;
      type _handletype;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *t
       )
       {
      +	_handletype *_origarg = from_timer(_origarg, t, _timer);
      +
      	... when != _origarg
      -	(_handletype *)_origarg
      +	_origarg
      	... when != _origarg
       }
      
      // Avoid already converted callbacks.
      @match_callback_converted
       depends on change_timer_function_usage &&
                  !change_callback_handle_cast &&
      	    !change_callback_handle_cast_no_arg@
      identifier change_timer_function_usage._callback;
      identifier t;
      @@
      
       void _callback(struct timer_list *t)
       { ... }
      
      // callback(struct something *handle)
      @change_callback_handle_arg
       depends on change_timer_function_usage &&
      	    !match_callback_converted &&
                  !change_callback_handle_cast &&
                  !change_callback_handle_cast_no_arg@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _handletype;
      identifier _handle;
      @@
      
       void _callback(
      -_handletype *_handle
      +struct timer_list *t
       )
       {
      +	_handletype *_handle = from_timer(_handle, t, _timer);
      	...
       }
      
      // If change_callback_handle_arg ran on an empty function, remove
      // the added handler.
      @unchange_callback_handle_arg
       depends on change_timer_function_usage &&
      	    change_callback_handle_arg@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _handletype;
      identifier _handle;
      identifier t;
      @@
      
       void _callback(struct timer_list *t)
       {
      -	_handletype *_handle = from_timer(_handle, t, _timer);
       }
      
      // We only want to refactor the setup_timer() data argument if we've found
      // the matching callback. This undoes changes in change_timer_function_usage.
      @unchange_timer_function_usage
       depends on change_timer_function_usage &&
                  !change_callback_handle_cast &&
                  !change_callback_handle_cast_no_arg &&
      	    !change_callback_handle_arg@
      expression change_timer_function_usage._E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type change_timer_function_usage._cast_data;
      @@
      
      (
      -timer_setup(&_E->_timer, _callback, 0);
      +setup_timer(&_E->_timer, _callback, (_cast_data)_E);
      |
      -timer_setup(&_E._timer, _callback, 0);
      +setup_timer(&_E._timer, _callback, (_cast_data)&_E);
      )
      
      // If we fixed a callback from a .function assignment, fix the
      // assignment cast now.
      @change_timer_function_assignment
       depends on change_timer_function_usage &&
                  (change_callback_handle_cast ||
                   change_callback_handle_cast_no_arg ||
                   change_callback_handle_arg)@
      expression change_timer_function_usage._E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type _cast_func;
      typedef TIMER_FUNC_TYPE;
      @@
      
      (
       _E->_timer.function =
      -_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -(_cast_func)_callback;
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -(_cast_func)&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -&_callback;
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -(_cast_func)_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -(_cast_func)&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      )
      
      // Sometimes timer functions are called directly. Replace matched args.
      @change_timer_function_calls
       depends on change_timer_function_usage &&
                  (change_callback_handle_cast ||
                   change_callback_handle_cast_no_arg ||
                   change_callback_handle_arg)@
      expression _E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type _cast_data;
      @@
      
       _callback(
      (
      -(_cast_data)_E
      +&_E->_timer
      |
      -(_cast_data)&_E
      +&_E._timer
      |
      -_E
      +&_E->_timer
      )
       )
      
      // If a timer has been configured without a data argument, it can be
      // converted without regard to the callback argument, since it is unused.
      @match_timer_function_unused_data@
      expression _E;
      identifier _timer;
      identifier _callback;
      @@
      
      (
      -setup_timer(&_E->_timer, _callback, 0);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, 0L);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, 0UL);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0L);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0UL);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0L);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0UL);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0);
      +timer_setup(_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0L);
      +timer_setup(_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0UL);
      +timer_setup(_timer, _callback, 0);
      )
      
      @change_callback_unused_data
       depends on match_timer_function_unused_data@
      identifier match_timer_function_unused_data._callback;
      type _origtype;
      identifier _origarg;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *unused
       )
       {
      	... when != _origarg
       }
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      e99e88a9
  8. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  9. 10 Oct, 2017 1 commit
    • Jiufei Xue's avatar
      blk-throttle: fix null pointer dereference while throttling writeback IOs · 53cfdc10
      Jiufei Xue authored
      A null pointer dereference can occur when blkcg is removed manually
      with writeback IOs inflight. This is caused by the following case:
      
      Writeback kworker submit the bio and set bio->bi_cg_private to tg
      in blk_throtl_assoc_bio.
      Then we remove the block cgroup manually, the blkg and tg would be
      freed if there is no request inflight.
      When the submitted bio come back, blk_throtl_bio_endio() fetch the tg
      which was already freed.
      
      Fix this by increasing the refcount of blkg in funcion
      blk_throtl_assoc_bio() so that the blkg will not be freed until the
      bio_endio called.
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJiufei Xue <jiufei.xjf@alibaba-inc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      53cfdc10
  10. 03 Oct, 2017 1 commit
    • Joseph Qi's avatar
      blk-throttle: fix possible io stall when upgrade to max · 4f02fb76
      Joseph Qi authored
      There is a case which will lead to io stall. The case is described as
      follows.
      /test1
        |-subtest1
      /test2
        |-subtest2
      And subtest1 and subtest2 each has 32 queued bios already.
      
      Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
      bios as follows:
      1) tg=subtest1, do nothing;
      2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
      left, no need to schedule next dispatch;
      3) tg=subtest2, do nothing;
      4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
      left, no need to schedule next dispatch;
      5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
      test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
      to /; note that test1 and test2 each still has 16 queued bios left;
      6) tg=/, try to schedule next dispatch, but since disptime is now
      (update in tg_update_disptime, wait=0), pending timer is not scheduled
      in fact;
      7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
      32 left. test1 and test2 each has 16 queued bios;
      8) throtl_pending_timer_fn sees the left over bios, but could do
      nothing, because throtl_select_dispatch returns 0, and test1/test2 has
      no pending tg.
      
      The blktrace shows the following:
      8,32   0        0     2.539007641     0  m   N throtl upgrade to max
      8,32   0        0     2.539072267     0  m   N throtl /test2 dispatch nr_queued=16 read=0 write=16
      8,32   7        0     2.539077142     0  m   N throtl /test1 dispatch nr_queued=16 read=0 write=16
      
      So force schedule dispatch if there are pending children.
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJoseph Qi <qijiang.qj@alibaba-inc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4f02fb76
  11. 23 Aug, 2017 1 commit
    • Shaohua Li's avatar
      blk-throttle: cap discard request size · ea0ea2bc
      Shaohua Li authored
      discard request usually is very big and easily use all bandwidth budget
      of a cgroup. discard request size doesn't really mean the size of data
      written, so it doesn't make sense to account it into bandwidth budget.
      Jens pointed out treating the size 0 doesn't make sense too, because
      discard request does have cost. But it's not easy to find the actual
      cost. This patch simply makes the size one sector.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ea0ea2bc
  12. 29 Jul, 2017 2 commits
    • Shaohua Li's avatar
      block: use standard blktrace API to output cgroup info for debug notes · 35fe6d76
      Shaohua Li authored
      Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
      way. Now we have standard blktrace API for this, so convert them to use
      it.
      
      Note, this changes the behavior a little bit. cgroup info isn't output
      by default, we only do this with 'blk_cgroup' option enabled. cgroup
      info isn't output as a string by default too, we only do this with
      'blk_cgname' option enabled. Also cgroup info is output in different
      position of the note string. I think these behavior changes aren't a big
      issue (actually we make trace data shorter which is good), since the
      blktrace note is solely for debugging.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      35fe6d76
    • Shaohua Li's avatar
      block: always attach cgroup info into bio · 007cc56b
      Shaohua Li authored
      blkcg_bio_issue_check() already gets blkcg for a BIO.
      bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
      operation. There is no point we don't attach the cgroup info into bio at
      blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
      info.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      007cc56b
  13. 07 Jun, 2017 2 commits
    • Shaohua Li's avatar
      blk-throttle: set default latency baseline for harddisk · 6679a90c
      Shaohua Li authored
      hard disk IO latency varies a lot depending on spindle move. The latency
      range could be from several microseconds to several milliseconds. It's
      pretty hard to get the baseline latency used by io.low.
      
      We will use a different stragety here. The idea is only using IO with
      spindle move to determine if cgroup IO is in good state. For HD, if io
      latency is small (< 1ms), we ignore the IO. Such IO is likely from
      sequential IO, and is helpless to help determine if a cgroup's IO is
      impacted by other cgroups. With this, we only account IO with big
      latency. Then we can choose a hardcoded baseline latency for HD (4ms,
      which is typical IO latency with seek).  With all these settings, the
      io.low latency works for both HD and SSD.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6679a90c
    • Joseph Qi's avatar
      blk-throttle: fix NULL pointer dereference in throtl_schedule_pending_timer · a41b816c
      Joseph Qi authored
      I have encountered a NULL pointer dereference in
      throtl_schedule_pending_timer:
        [  413.735396] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
        [  413.735535] IP: [<ffffffff812ebbbf>] throtl_schedule_pending_timer+0x3f/0x210
        [  413.735643] PGD 22c8cf067 PUD 22cb34067 PMD 0
        [  413.735713] Oops: 0000 [#1] SMP
        ......
      
      This is caused by the following case:
        blk_throtl_bio
          throtl_schedule_next_dispatch  <= sq is top level one without parent
            throtl_schedule_pending_timer
              sq_to_tg(sq)->td->throtl_slice  <= sq_to_tg(sq) returns NULL
      
      Fix it by using sq_to_td instead of sq_to_tg(sq)->td, which will always
      return a valid td.
      
      Fixes: 297e3d85 ("blk-throttle: make throtl_slice tunable")
      Signed-off-by: default avatarJoseph Qi <qijiang.qj@alibaba-inc.com>
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a41b816c
  14. 22 May, 2017 4 commits
  15. 20 Apr, 2017 1 commit
  16. 28 Mar, 2017 17 commits
    • Shaohua Li's avatar
      blk-throttle: add latency target support · 53696b8d
      Shaohua Li authored
      One hard problem adding .low limit is to detect idle cgroup. If one
      cgroup doesn't dispatch enough IO against its low limit, we must have a
      mechanism to determine if other cgroups dispatch more IO. We added the
      think time detection mechanism before, but it doesn't work for all
      workloads. Here we add a latency based approach.
      
      We already have mechanism to calculate latency threshold for each IO
      size. For every IO dispatched from a cgorup, we compare its latency
      against its threshold and record the info. If most IO latency is below
      threshold (in the code I use 75%), the cgroup could be treated idle and
      other cgroups can dispatch more IO.
      
      Currently this latency target check is only for SSD as we can't
      calcualte the latency target for hard disk. And this is only for cgroup
      leaf node so far.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      53696b8d
    • Shaohua Li's avatar
      blk-throttle: add a mechanism to estimate IO latency · b9147dd1
      Shaohua Li authored
      User configures latency target, but the latency threshold for each
      request size isn't fixed. For a SSD, the IO latency highly depends on
      request size. To calculate latency threshold, we sample some data, eg,
      average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
      threshold of each request size will be the sample latency (I'll call it
      base latency) plus latency target. For example, the base latency for
      request size 4k is 80us and user configures latency target 60us. The 4k
      latency threshold will be 80 + 60 = 140us.
      
      To sample data, we calculate the order base 2 of rounded up IO sectors.
      If the IO size is bigger than 1M, it will be accounted as 1M. Since the
      calculation does round up, the base latency will be slightly smaller
      than actual value. Also if there isn't any IO dispatched for a specific
      IO size, we will use the base latency of smaller IO size for this IO
      size.
      
      But we shouldn't sample data at any time. The base latency is supposed
      to be latency where disk isn't congested, because we use latency
      threshold to schedule IOs between cgroups. If disk is congested, the
      latency is higher, using it for scheduling is meaningless. Hence we only
      do the sampling when block throttling is in the LOW limit, with
      assumption disk isn't congested in such state. If the assumption isn't
      true, eg, low limit is too high, calculated latency threshold will be
      higher.
      
      Hard disk is completely different. Latency depends on spindle seek
      instead of request size. Currently this feature is SSD only, we probably
      can use a fixed threshold like 4ms for hard disk though.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b9147dd1
    • Shaohua Li's avatar
      blk-throttle: add interface for per-cgroup target latency · ec80991d
      Shaohua Li authored
      Here we introduce per-cgroup latency target. The target determines how a
      cgroup can afford latency increasement. We will use the target latency
      to calculate a threshold and use it to schedule IO for cgroups. If a
      cgroup's bandwidth is below its low limit but its average latency is
      below the threshold, other cgroups can safely dispatch more IO even
      their bandwidth is higher than their low limits. On the other hand, if
      the first cgroup's latency is higher than the threshold, other cgroups
      are throttled to their low limits. So the target latency determines how
      we efficiently utilize free disk resource without sacifice of worload's
      IO latency.
      
      For example, assume 4k IO average latency is 50us when disk isn't
      congested. A cgroup sets the target latency to 30us. Then the cgroup can
      accept 50+30=80us IO latency. If the cgroupt's average IO latency is
      90us and its bandwidth is below low limit, other cgroups are throttled
      to their low limit. If the cgroup's average IO latency is 60us, other
      cgroups are allowed to dispatch more IO. When other cgroups dispatch
      more IO, the first cgroup's IO latency will increase. If it increases to
      81us, we then throttle other cgroups.
      
      User will configure the interface in this way:
      echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low
      
      latency is in microsecond unit
      
      By default, latency target is 0, which means to guarantee IO latency.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ec80991d
    • Shaohua Li's avatar
      blk-throttle: ignore idle cgroup limit · fa6fb5aa
      Shaohua Li authored
      Last patch introduces a way to detect idle cgroup. We use it to make
      upgrade/downgrade decision. And the new algorithm can detect completely
      idle cgroup too, so we can delete the corresponding code.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      fa6fb5aa
    • Shaohua Li's avatar
      blk-throttle: add interface to configure idle time threshold · ada75b6e
      Shaohua Li authored
      Add interface to configure the threshold. The io.low interface will
      like:
      echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low
      
      idle is in microsecond unit.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ada75b6e
    • Shaohua Li's avatar
      blk-throttle: add a simple idle detection · 9e234eea
      Shaohua Li authored
      A cgroup gets assigned a low limit, but the cgroup could never dispatch
      enough IO to cross the low limit. In such case, the queue state machine
      will remain in LIMIT_LOW state and all other cgroups will be throttled
      according to low limit. This is unfair for other cgroups. We should
      treat the cgroup idle and upgrade the state machine to lower state.
      
      We also have a downgrade logic. If the state machine upgrades because of
      cgroup idle (real idle), the state machine will downgrade soon as the
      cgroup is below its low limit. This isn't what we want. A more
      complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
      when queue gets upgraded to lower state, other cgroups could dispatch
      more IO and this cgroup can't dispatch enough IO, so the cgroup is below
      its low limit and looks like idle (fake idle). In this case, the queue
      should downgrade soon. The key to determine if we should do downgrade is
      to detect if cgroup is truely idle.
      
      Unfortunately it's very hard to determine if a cgroup is real idle. This
      patch uses the 'think time check' idea from CFQ for the purpose. Please
      note, the idea doesn't work for all workloads. For example, a workload
      with io depth 8 has disk utilization 100%, hence think time is 0, eg,
      not idle. But the workload can run higher bandwidth with io depth 16.
      Compared to io depth 16, the io depth 8 workload is idle. We use the
      idea to roughly determine if a cgroup is idle.
      
      We treat a cgroup idle if its think time is above a threshold (by
      default 1ms for SSD and 100ms for HD). The idea is think time above the
      threshold will start to harm performance. HD is much slower so a longer
      think time is ok.
      
      The patch (and the latter patches) uses 'unsigned long' to track time.
      We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
      precision, should not a big deal.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9e234eea
    • Shaohua Li's avatar
      blk-throttle: make bandwidth change smooth · 7394e31f
      Shaohua Li authored
      When cgroups all reach low limit, cgroups can dispatch more IO. This
      could make some cgroups dispatch more IO but others not, and even some
      cgroups could dispatch less IO than their low limit. For example, cg1
      low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
      120M/s for the workload. Their bps could something like this:
      
      cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80
      
      At T1, all cgroups reach low limit, so they can dispatch more IO later.
      Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
      T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
      than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
      LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
      will have bandwidth below its low limit at most time.
      
      The big problem here is we don't know the maximum bandwidth of the
      workload, so we can't make smart decision to avoid the situation. This
      patch makes cgroup bandwidth change smooth. After disk upgrades from
      LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
      their max limit immediately. Their bandwidth limit will be increased
      gradually to avoid above situation. So above example will became
      something like:
      
      cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
      -> 45/75 -> 22/98
      
      In this way cgroups bandwidth will be above their limit in majority
      time, this still doesn't fully utilize disk bandwidth, but that's
      something we pay for sharing.
      
      Scale up is linear. The limit scales up 1/2 .low limit every
      throtl_slice after upgrade. The scale up will stop if the adjusted limit
      hits .max limit. Scale down is exponential. We cut the scale value half
      if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
      fully downgrade the queue to LIMIT_LOW state.
      
      Note this doesn't completely avoid cgroup running under its low limit.
      The best way to guarantee cgroup doesn't run under its limit is to set
      max limit. For example, if we set cg1 max limit to 40, cg2 will never
      run under its low limit.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      7394e31f
    • Shaohua Li's avatar
      blk-throttle: detect completed idle cgroup · aec24246
      Shaohua Li authored
      cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
      cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
      we can't move the state machine to higher level and all cgroups will be
      throttled to their lower limit, so we waste bandwidth. Detecting idle
      cgroup is hard. This patch handles a simple case, a cgroup doesn't
      dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
      the bandwidth.
      
      Please note this will be replaced with a more sophisticated algorithm
      later, but this demonstrates the idea how we handle idle cgroups, so I
      leave it here.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      aec24246
    • Shaohua Li's avatar
      blk-throttle: choose a small throtl_slice for SSD · d61fcfa4
      Shaohua Li authored
      The throtl_slice is 100ms by default. This is a long time for SSD, a lot
      of IO can run. To make cgroups have smoother throughput, we choose a
      small value (20ms) for SSD.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d61fcfa4
    • Shaohua Li's avatar
      blk-throttle: make throtl_slice tunable · 297e3d85
      Shaohua Li authored
      throtl_slice is important for blk-throttling. It's called slice
      internally but it really is a time window blk-throttling samples data.
      blk-throttling will make decision based on the samplings. An example is
      bandwidth measurement. A cgroup's bandwidth is measured in the time
      interval of throtl_slice.
      
      A small throtl_slice meanse cgroups have smoother throughput but burn
      more CPUs. It has 100ms default value, which is not appropriate for all
      disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
      it tunable.
      
      Since throtl_slice isn't a time slice, the sysfs name
      'throttle_sample_time' reflects its character better.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      297e3d85
    • Shaohua Li's avatar
      blk-throttle: make sure expire time isn't too big · 06cceedc
      Shaohua Li authored
      cgroup could be throttled to a limit but when all cgroups cross high
      limit, queue enters a higher state and so the group should be throttled
      to a higher limit. It's possible the cgroup is sleeping because of
      throttle and other cgroups don't dispatch IO any more. In this case,
      nobody can trigger current downgrade/upgrade logic. To fix this issue,
      we could either set up a timer to wakeup the cgroup if other cgroups are
      idle or make sure this cgroup doesn't sleep too long. Setting up a timer
      means we must change the timer very frequently. This patch chooses the
      latter. Making cgroup sleep time not too big wouldn't change cgroup
      bps/iops, but could make it wakeup more frequently, which isn't a big
      issue because throtl_slice * 8 is already quite big.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      06cceedc
    • Shaohua Li's avatar
      blk-throttle: add downgrade logic · 3f0abd80
      Shaohua Li authored
      When queue state machine is in LIMIT_MAX state, but a cgroup is below
      its low limit for some time, the queue should be downgraded to lower
      state as one cgroup's low limit isn't met.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3f0abd80
    • Shaohua Li's avatar
      blk-throttle: add upgrade logic for LIMIT_LOW state · c79892c5
      Shaohua Li authored
      When queue is in LIMIT_LOW state and all cgroups with low limit cross
      the bps/iops limitation, we will upgrade queue's state to
      LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if
      the cgroup has pending request. Since cgroup is throttled according to
      the limit, pending request means the cgroup reaches the limit.
      
      If a cgroup has limit set for both read and write, we consider the
      combination of them for upgrade. The reason is read IO and write IO can
      interfere with each other. If we do the upgrade based in one direction
      IO, the other direction IO could be severly harmed.
      
      For a cgroup hierarchy, there are two cases. Children has lower low
      limit than parent. Parent's low limit is meaningless. If children's
      bps/iops cross low limit, we can upgrade queue state. The other case is
      children has higher low limit than parent. Children's low limit is
      meaningless. As long as parent's bps/iops (which is a sum of childrens
      bps/iops) cross low limit, we can upgrade queue state.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c79892c5
    • Shaohua Li's avatar
      blk-throttle: configure bps/iops limit for cgroup in low limit · b22c417c
      Shaohua Li authored
      each queue will have a state machine. Initially queue is in LIMIT_LOW
      state, which means all cgroups will be throttled according to their low
      limit. After all cgroups with low limit cross the limit, the queue state
      gets upgraded to LIMIT_MAX state.
      For max limit, cgroup will use the limit configured by user.
      For low limit, cgroup will use the minimal value between low limit and
      max limit configured by user. If the minimal value is 0, which means the
      cgroup doesn't configure low limit, we will use max limit to throttle
      the cgroup and the cgroup is ready to upgrade to LIMIT_MAX
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b22c417c
    • Shaohua Li's avatar
      blk-throttle: add .low interface · cd5ab1b0
      Shaohua Li authored
      Add low limit for cgroup and corresponding cgroup interface. To be
      consistent with memcg, we allow users configure .low limit higher than
      .max limit. But the internal logic always assumes .low limit is lower
      than .max limit. So we add extra bps/iops_conf fields in throtl_grp for
      userspace configuration. Old bps/iops fields in throtl_grp will be the
      actual limit we use for throttling.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      cd5ab1b0
    • Shaohua Li's avatar
      blk-throttle: prepare support multiple limits · 9f626e37
      Shaohua Li authored
      We are going to support low/max limit, each cgroup will have 2 limits
      after that. This patch prepares for the multiple limits change.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9f626e37
    • Shaohua Li's avatar
      blk-throttle: use U64_MAX/UINT_MAX to replace -1 · 2ab5492d
      Shaohua Li authored
      clean up the code to avoid using -1
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      2ab5492d
  17. 28 Feb, 2017 1 commit