1. 05 Apr, 2019 2 commits
    • Oleg Nesterov's avatar
      cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · b8498a26
      Oleg Nesterov authored
      [ Upstream commit 51bee5ab ]
      
      The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
      needs pids_free() to uncharge the pid.
      
      However, ->free() is called from __put_task_struct()->cgroup_free() and this
      is too late. Even the trivial program which does
      
      	for (;;) {
      		int pid = fork();
      		assert(pid >= 0);
      		if (pid)
      			wait(NULL);
      		else
      			exit(0);
      	}
      
      can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
      implies an RCU gp after the task/pid goes away and before the final put().
      
      Test-case:
      
      	mkdir -p /tmp/CG
      	mount -t cgroup2 none /tmp/CG
      	echo '+pids' > /tmp/CG/cgroup.subtree_control
      
      	mkdir /tmp/CG/PID
      	echo 2 > /tmp/CG/PID/pids.max
      
      	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
      	echo $! > /tmp/CG/PID/cgroup.procs
      
      Without this patch the forking process fails soon after migration.
      
      Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
      into the new helper, cgroup_release(), called by release_task() which actually
      frees the pid(s).
      Reported-by: default avatarHerton R. Krzesinski <hkrzesin@redhat.com>
      Reported-by: Jan Stancek's avatarJan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b8498a26
    • Tejun Heo's avatar
      cgroup, rstat: Don't flush subtree root unless necessary · 55d7152d
      Tejun Heo authored
      [ Upstream commit b4ff1b44 ]
      
      cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups
      on flush.  While it was only visiting updated ones in the subtree, it
      was visiting @root unconditionally.  We can easily check whether @root
      is updated or not by looking at its ->updated_next just as with the
      cgroups in the subtree.
      
      * Remove the unnecessary cgroup_parent() test.  The system root cgroup
        is never updated and thus its ->updated_next is always NULL.  No
        need to test whether cgroup_parent() exists in addition to
        ->updated_next.
      
      * Terminate traverse if ->updated_next is NULL.  This can only happen
        for subtree @root and there's no reason to visit it if it's not
        marked updated.
      
      This reduces cpu consumption when reading a lot of rstat backed files.
      In a micro benchmark reading stat from ~1600 cgroups, the sys time was
      lowered by >40%.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      55d7152d
  2. 23 Mar, 2019 1 commit
    • Al Viro's avatar
      fix cgroup_do_mount() handling of failure exits · 781bcac5
      Al Viro authored
      commit 399504e2 upstream.
      
      same story as with last May fixes in sysfs (7b745a4e
      "unfuck sysfs_mount()"); new_sb is left uninitialized
      in case of early errors in kernfs_mount_ns() and papering
      over it by treating any error from kernfs_mount_ns() as
      equivalent to !new_ns ends up conflating the cases when
      objects had never been transferred to a superblock with
      ones when that has happened and resulting new superblock
      had been dropped.  Easily fixed (same way as in sysfs
      case).  Additionally, there's a superblock leak on
      kernfs_node_dentry() failure *and* a dentry leak inside
      kernfs_node_dentry() itself - the latter on probably
      impossible errors, but the former not impossible to trigger
      (as the matter of fact, injecting allocation failures
      at that point *does* trigger it).
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      781bcac5
  3. 28 Dec, 2018 3 commits
    • yuzhoujian's avatar
      mm, oom: reorganize the oom report in dump_header · ef8444ea
      yuzhoujian authored
      OOM report contains several sections.  The first one is the allocation
      context that has triggered the OOM.  Then we have cpuset context followed
      by the stack trace of the OOM path.  The tird one is the OOM memory
      information.  Followed by the current memory state of all system tasks.
      At last, we will show oom eligible tasks and the information about the
      chosen oom victim.
      
      One thing that makes parsing more awkward than necessary is that we do not
      have a single and easily parsable line about the oom context.  This patch
      is reorganizing the oom report to
      
      1) who invoked oom and what was the allocation request
      
      [  515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
      
      2) OOM stack trace
      
      [  515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
      [  515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
      [  515.906821] Call Trace:
      [  515.908062]  dump_stack+0x5a/0x73
      [  515.909311]  dump_header+0x55/0x28c
      [  515.914260]  oom_kill_process+0x2d8/0x300
      [  515.916708]  out_of_memory+0x145/0x4a0
      [  515.917932]  __alloc_pages_slowpath+0x7d2/0xa16
      [  515.919157]  __alloc_pages_nodemask+0x277/0x290
      [  515.920367]  filemap_fault+0x3d0/0x6c0
      [  515.921529]  ? filemap_map_pages+0x2b8/0x420
      [  515.922709]  ext4_filemap_fault+0x2c/0x40 [ext4]
      [  515.923884]  __do_fault+0x20/0x80
      [  515.925032]  __handle_mm_fault+0xbc0/0xe80
      [  515.926195]  handle_mm_fault+0xfa/0x210
      [  515.927357]  __do_page_fault+0x233/0x4c0
      [  515.928506]  do_page_fault+0x32/0x140
      [  515.929646]  ? page_fault+0x8/0x30
      [  515.930770]  page_fault+0x1e/0x30
      
      3) OOM memory information
      
      [  515.958093] Mem-Info:
      [  515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
       active_file:4402672 inactive_file:483963 isolated_file:1344
       unevictable:0 dirty:4886753 writeback:0 unstable:0
       slab_reclaimable:148442 slab_unreclaimable:18741
       mapped:1347 shmem:1347 pagetables:58669 bounce:0
       free:88663 free_pcp:0 free_cma:0
      ...
      
      4) current memory state of all system tasks
      
      [  516.079544] [    744]     0   744     9211     1345   114688       82             0 systemd-journal
      [  516.082034] [    787]     0   787    31764        0   143360       92             0 lvmetad
      [  516.084465] [    792]     0   792    10930        1   110592      208         -1000 systemd-udevd
      [  516.086865] [   1199]     0  1199    13866        0   131072      112         -1000 auditd
      [  516.089190] [   1222]     0  1222    31990        1   110592      157             0 smartd
      [  516.091477] [   1225]     0  1225     4864       85    81920       43             0 irqbalance
      [  516.093712] [   1226]     0  1226    52612        0   258048      426             0 abrtd
      [  516.112128] [   1280]     0  1280   109774       55   299008      400             0 NetworkManager
      [  516.113998] [   1295]     0  1295    28817       37    69632       24             0 ksmtuned
      [  516.144596] [  10718]     0 10718  2622484  1721372 15998976   267219             0 panic
      [  516.145792] [  10719]     0 10719  2622484  1164767  9818112    53576             0 panic
      [  516.146977] [  10720]     0 10720  2622484  1174361  9904128    53709             0 panic
      [  516.148163] [  10721]     0 10721  2622484  1209070 10194944    54824             0 panic
      [  516.149329] [  10722]     0 10722  2622484  1745799 14774272    91138             0 panic
      
      5) oom context (contrains and the chosen victim).
      
      oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0
      
      An admin can easily get the full oom context at a single line which
      makes parsing much easier.
      
      Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.comSigned-off-by: default avataryuzhoujian <yuzhoujian@didichuxing.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <yang.s@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef8444ea
    • Tejun Heo's avatar
      cgroup: Add named hierarchy disabling to cgroup_no_v1 boot param · 3fc9c12d
      Tejun Heo authored
      It can be useful to inhibit all cgroup1 hierarchies especially during
      transition and for debugging.  cgroup_no_v1 can block hierarchies with
      controllers which leaves out the named hierarchies.  Expand it to
      cover the named hierarchies so that "cgroup_no_v1=all,named" disables
      all cgroup1 hierarchies.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarMarcin Pawlowski <mpawlowski@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      3fc9c12d
    • Ondrej Mosnáček's avatar
      cgroup: fix parsing empty mount option string · e250d91d
      Ondrej Mosnáček authored
      This fixes the case where all mount options specified are consumed by an
      LSM and all that's left is an empty string. In this case cgroupfs should
      accept the string and not fail.
      
      How to reproduce (with SELinux enabled):
      
          # umount /sys/fs/cgroup/unified
          # mount -o context=system_u:object_r:cgroup_t:s0 -t cgroup2 cgroup2 /sys/fs/cgroup/unified
          mount: /sys/fs/cgroup/unified: wrong fs type, bad option, bad superblock on cgroup2, missing codepage or helper program, or other error.
          # dmesg | tail -n 1
          [   31.575952] cgroup: cgroup2: unknown option ""
      
      Fixes: 67e9c74b ("cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type")
      [NOTE: should apply on top of commit 5136f636 ("cgroup: implement "nsdelegate" mount option"), older versions need manual rebase]
      Suggested-by: Stephen Smalley's avatarStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: Ondrej Mosnáček's avatarOndrej Mosnacek <omosnace@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e250d91d
  4. 08 Dec, 2018 1 commit
  5. 03 Dec, 2018 1 commit
  6. 01 Dec, 2018 1 commit
  7. 20 Nov, 2018 1 commit
    • Tejun Heo's avatar
      cgroup: fix CSS_TASK_ITER_PROCS · e9d81a1b
      Tejun Heo authored
      CSS_TASK_ITER_PROCS implements process-only iteration by making
      css_task_iter_advance() skip tasks which aren't threadgroup leaders;
      however, when an iteration is started css_task_iter_start() calls the
      inner helper function css_task_iter_advance_css_set() instead of
      css_task_iter_advance().  As the helper doesn't have the skip logic,
      when the first task to visit is a non-leader thread, it doesn't get
      skipped correctly as shown in the following example.
      
        # ps -L 2030
          PID   LWP TTY      STAT   TIME COMMAND
         2030  2030 pts/0    Sl+    0:00 ./test-thread
         2030  2031 pts/0    Sl+    0:00 ./test-thread
        # mkdir -p /sys/fs/cgroup/x/a/b
        # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
        # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
        # echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs
        # cat /sys/fs/cgroup/x/a/cgroup.threads
        2030
        2031
        # cat /sys/fs/cgroup/x/cgroup.procs
        2030
        # echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads
        # cat /sys/fs/cgroup/x/cgroup.procs
        2031
        2030
      
      The last read of cgroup.procs is incorrectly showing non-leader 2031
      in cgroup.procs output.
      
      This can be fixed by updating css_task_iter_advance() to handle the
      first advance and css_task_iters_tart() to call
      css_task_iter_advance() instead of the inner helper.  After the fix,
      the same commands result in the following (correct) result:
      
        # ps -L 2062
          PID   LWP TTY      STAT   TIME COMMAND
         2062  2062 pts/0    Sl+    0:00 ./test-thread
         2062  2063 pts/0    Sl+    0:00 ./test-thread
        # mkdir -p /sys/fs/cgroup/x/a/b
        # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
        # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
        # echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs
        # cat /sys/fs/cgroup/x/a/cgroup.threads
        2062
        2063
        # cat /sys/fs/cgroup/x/cgroup.procs
        2062
        # echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads
        # cat /sys/fs/cgroup/x/cgroup.procs
        2062
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatar"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Fixes: 8cfd8147 ("cgroup: implement cgroup v2 thread support")
      Cc: stable@vger.kernel.org # v4.14+
      e9d81a1b
  8. 13 Nov, 2018 2 commits
  9. 08 Nov, 2018 11 commits
    • Waiman Long's avatar
      cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug · 5cf8114d
      Waiman Long authored
      For debugging purpose, it will be useful to expose the content of the
      subparts_cpus as a read-only file to see if the code work correctly.
      However, subparts_cpus will not be used at all in most use cases. So
      adding a new cpuset file that clutters the cgroup directory may not be
      desirable.  This is now being done by using the hidden "cgroup_debug"
      kernel command line option to expose a new "cpuset.cpus.subpartitions"
      file.
      
      That option was originally used by the debug controller to expose
      itself when configured into the kernel. This is now extended to set an
      internal flag used by cgroup_addrm_files(). A new CFTYPE_DEBUG flag
      can now be used to specify that a cgroup file should only be created
      when the "cgroup_debug" option is specified.
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      5cf8114d
    • Waiman Long's avatar
      cpuset: Use descriptive text when reading/writing cpuset.sched.partition · bb5b553c
      Waiman Long authored
      Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
      read. A person who is not familiar with the partition code may not
      understand what they mean.
      
      In order to make cpuset.sched.partition more user-friendly, it will
      now display the following descriptive text on read:
      
        "root" - A partition root (top cpuset of a partition)
        "member" - A non-root member of a partition
        "root invalid" - An invalid partition root
      
      Note that there is at least one partition in the whole cgroup hierarchy.
      The top cpuset is the root of that partition.  The rests are either a
      root if it starts a new partition or a member of a partition.
      
      The cpuset.sched.partition file will now also accept "root" and
      "member" besides 1 and 0 as valid input values. The "root invalid"
      value is internal only and cannot be written to the file.
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bb5b553c
    • Waiman Long's avatar
      cpuset: Expose cpus.effective and mems.effective on cgroup v2 root · 5776cecc
      Waiman Long authored
      Because of the fact that setting the "cpuset.sched.partition" in
      a direct child of root can remove CPUs from the root's effective CPU
      list, it makes sense to know what CPUs are left in the root cgroup for
      scheduling purpose. So the "cpuset.cpus.effective" control file is now
      exposed in the v2 cgroup root.
      
      For consistency, the "cpuset.mems.effective" control file is exposed
      as well.
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      5776cecc
    • Waiman Long's avatar
      cpuset: Make generate_sched_domains() work with partition · 0ccea8fe
      Waiman Long authored
      The generate_sched_domains() function is modified to make it work
      correctly with the newly introduced subparts_cpus mask for scheduling
      domains generation.
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      0ccea8fe
    • Waiman Long's avatar
      cpuset: Make CPU hotplug work with partition · 4b842da2
      Waiman Long authored
      When there is a cpu hotplug event (CPU online or offline), the partitions
      may need to be reconfigured and regenerated. So code is added to the
      hotplug functions to make them work with new subparts_cpus mask to
      compute the right effective_cpus for each of the affected cpusets.
      It may also change the state of a partition root from real one to an
      erroneous one or vice versa.
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4b842da2
    • Waiman Long's avatar
      cpuset: Track cpusets that use parent's effective_cpus · 4716909c
      Waiman Long authored
      In the default hierarchy, a cpuset will use the parent's effective_cpus
      if none of the requested CPUs can be granted from the parent. That can
      be a problem if a parent is a partition root with children partition
      roots. Changes to a parent's effective_cpus list due to changes in a
      child partition root may not be properly reflected in a child cpuset
      that use parent's effective_cpus because the cpu_exclusive rule of a
      partition root will not guard against that.
      
      In order to avoid the mismatch, two new tracking variables are added to
      the cpuset structure to track if a cpuset uses parent's effective_cpus
      and the number of children cpusets that use its effective_cpus. So
      whenever cpumask changes are made to a parent, it will also check to
      see if it has other children cpusets that use its effective_cpus and
      call update_cpumasks_hier() if that is the case.
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4716909c
    • Waiman Long's avatar
      cpuset: Add an error state to cpuset.sched.partition · 3881b861
      Waiman Long authored
      When external events like CPU offlining or user events like changing
      the cpu list of an ancestor cpuset happen, update_cpumasks_hier()
      will be called to update the effective cpus of each of the affected
      cpusets. That will then call update_parent_subparts_cpumask() if
      partitions are impacted.
      
      Currently, these events may cause update_parent_subparts_cpumask()
      to return error if none of the requested cpus are available or it will
      consume all the cpus in the parent partition root. Handling these errors
      is problematic as the states may become inconsistent.
      
      Instead of letting update_parent_subparts_cpumask() return error, a new
      error state (-1) is added to the partition_root_state flag to designate
      the fact that the partition is no longer valid. IOW, it is no longer a
      real partition root, but the CS_CPU_EXCLUSIVE flag will still be set
      as it can be changed back to a real one if favorable change happens
      later on.
      
      This new error state is set internally and user cannot write this new
      value to "cpuset.sched.partition".
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      3881b861
    • Waiman Long's avatar
      cpuset: Add new v2 cpuset.sched.partition flag · ee8dde0c
      Waiman Long authored
      A new cpuset.sched.partition boolean flag is added to cpuset v2.
      This new flag, if set, indicates that the cgroup is the root of a
      new scheduling domain or partition that includes itself and all its
      descendants except those that are scheduling domain roots themselves
      and their descendants.
      
      With this new flag, one can directly create as many partitions as
      necessary without ever using the v1 trick of turning off load balancing
      in specific cpusets to create partitions as a side effect.
      
      This new flag is owned by the parent and will cause the CPUs in the
      cpuset to be removed from the effective CPUs of its parent.
      
      This is implemented internally by adding a new subparts_cpus mask that
      holds the CPUs belonging to child partitions so that:
      
              subparts_cpus | effective_cpus = cpus_allowed
              subparts_cpus & effective_cpus = 0
      
      This new flag can only be turned on in a cpuset if its parent is a
      partition root itself. The state of this flag cannot be changed if the
      cpuset has children.
      
      Once turned on, further changes to "cpuset.cpus" is allowed as long
      as there is at least one CPU left that can be granted from the parent
      and a child partition root cannot use up all the CPUs in the parent's
      effective_cpus.
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      ee8dde0c
    • Waiman Long's avatar
      cpuset: Simply allocation and freeing of cpumasks · bf92370c
      Waiman Long authored
      The previous commit introduces a new subparts_cpus mask into the cpuset
      data structure and a new tmpmasks structure.  Managing the allocation
      and freeing of those cpumasks is becoming more complex.
      
      So a number of helper functions are added to simplify and streamline
      the management of those cpumasks. To make it simple, all the cpumasks
      are now pre-cleared on allocation.
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bf92370c
    • Waiman Long's avatar
      cpuset: Define data structures to support scheduling partition · 58b74842
      Waiman Long authored
      >From a cpuset point of view, a scheduling partition is a group of
      cpusets with their own set of exclusive CPUs that are not shared by
      other tasks outside the scheduling partition.
      
      In the legacy hierarchy, scheduling partitions are supported indirectly
      via the right use of the load balancing and the exclusive CPUs flag
      which is not intuitive and can be hard to use.
      
      To fully support the concept of scheduling partitions in the default
      hierarchy, we need to add some new field into the cpuset structure as
      well as a new tmpmasks structure that is used to pre-allocate cpumasks
      at the top level cpuset functions to avoid memory allocation in inner
      functions as memory allocation failure in those inner functions may
      cause a cpuset to have inconsistent states.
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      58b74842
    • Waiman Long's avatar
      cpuset: Enable cpuset controller in default hierarchy · 4ec22e9c
      Waiman Long authored
      Given the fact that thread mode had been merged into 4.14, it is now
      time to enable cpuset to be used in the default hierarchy (cgroup v2)
      as it is clearly threaded.
      
      The cpuset controller had experienced feature creep since its
      introduction more than a decade ago. Besides the core cpus and mems
      control files to limit cpus and memory nodes, there are a bunch of
      additional features that can be controlled from the userspace. Some of
      the features are of doubtful usefulness and may not be actively used.
      
      This patch enables cpuset controller in the default hierarchy with
      a minimal set of features, namely just the cpus and mems and their
      effective_* counterparts.  We can certainly add more features to the
      default hierarchy in the future if there is a real user need for them
      later on.
      
      Alternatively, with the unified hiearachy, it may make more sense
      to move some of those additional cpuset features, if desired, to
      memory controller or may be to the cpu controller instead of staying
      with cpuset.
      Signed-off-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4ec22e9c
  10. 05 Nov, 2018 1 commit
  11. 02 Nov, 2018 1 commit
  12. 26 Oct, 2018 1 commit
  13. 04 Oct, 2018 1 commit
    • Tejun Heo's avatar
      cgroup: Fix dom_cgrp propagation when enabling threaded mode · 479adb89
      Tejun Heo authored
      A cgroup which is already a threaded domain may be converted into a
      threaded cgroup if the prerequisite conditions are met.  When this
      happens, all threaded descendant should also have their ->dom_cgrp
      updated to the new threaded domain cgroup.  Unfortunately, this
      propagation was missing leading to the following failure.
      
        # cd /sys/fs/cgroup/unified
        # cat cgroup.subtree_control    # show that no controllers are enabled
      
        # mkdir -p mycgrp/a/b/c
        # echo threaded > mycgrp/a/b/cgroup.type
      
        At this point, the hierarchy looks as follows:
      
            mycgrp [d]
      	  a [dt]
      	      b [t]
      		  c [inv]
      
        Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):
      
        # echo threaded > mycgrp/a/cgroup.type
      
        By this point, we now have a hierarchy that looks as follows:
      
            mycgrp [dt]
      	  a [t]
      	      b [t]
      		  c [inv]
      
        But, when we try to convert the node "c" from "domain invalid" to
        "threaded", we get ENOTSUP on the write():
      
        # echo threaded > mycgrp/a/b/c/cgroup.type
        sh: echo: write error: Operation not supported
      
      This patch fixes the problem by
      
      * Moving the opencoded ->dom_cgrp save and restoration in
        cgroup_enable_threaded() into cgroup_{save|restore}_control() so
        that mulitple cgroups can be handled.
      
      * Updating all threaded descendants' ->dom_cgrp to point to the new
        dom_cgrp when enabling threaded mode.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatar"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Reported-by: default avatarAmin Jamali <ajamali@pivotal.io>
      Reported-by: default avatarJoao De Almeida Pereira <jpereira@pivotal.io>
      Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
      Fixes: 454000ad ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
      Cc: stable@vger.kernel.org # v4.14+
      479adb89
  14. 22 Sep, 2018 1 commit
  15. 21 Jul, 2018 1 commit
  16. 11 Jul, 2018 1 commit
    • Steven Rostedt (VMware)'s avatar
      cgroup/tracing: Move taking of spin lock out of trace event handlers · e4f8d81c
      Steven Rostedt (VMware) authored
      It is unwise to take spin locks from the handlers of trace events.
      Mainly, because they can introduce lockups, because it introduces locks
      in places that are normally not tested. Worse yet, because trace events
      are tucked away in the include/trace/events/ directory, locks that are
      taken there are forgotten about.
      
      As a general rule, I tell people never to take any locks in a trace
      event handler.
      
      Several cgroup trace event handlers call cgroup_path() which eventually
      takes the kernfs_rename_lock spinlock. This injects the spinlock in the
      code without people realizing it. It also can cause issues for the
      PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
      handlers are called with preemption disabled.
      
      By moving the calculation of the cgroup_path() out of the trace event
      handlers and into a macro (surrounded by a
      trace_cgroup_##type##_enabled()), then we could place the cgroup_path
      into a string, and pass that to the trace event. Not only does this
      remove the taking of the spinlock out of the trace event handler, but
      it also means that the cgroup_path() only needs to be called once (it
      is currently called twice, once to get the length to reserver the
      buffer for, and once again to get the path itself. Now it only needs to
      be done once.
      Reported-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e4f8d81c
  17. 15 Jun, 2018 1 commit
  18. 12 Jun, 2018 2 commits
    • Kees Cook's avatar
      treewide: Use array_size() in vmalloc() · 42bc47b3
      Kees Cook authored
      The vmalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vmalloc(a * b)
      
      with:
              vmalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vmalloc(a * b * c)
      
      with:
      
              vmalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vmalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vmalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vmalloc(C1 * C2 * C3, ...)
      |
        vmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vmalloc(C1 * C2, ...)
      |
        vmalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      42bc47b3
    • Kees Cook's avatar
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook authored
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      6da2ec56
  19. 06 Jun, 2018 1 commit
    • Kees Cook's avatar
      treewide: Use struct_size() for kmalloc()-family · acafe7e3
      Kees Cook authored
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct foo {
          int stuff;
          void *entry[];
      };
      
      instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
      uses. It was done via automatic conversion with manual review for the
      "CHECKME" non-standard cases noted below, using the following Coccinelle
      script:
      
      // pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
      //                      sizeof *pkey_cache->table, GFP_KERNEL);
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      identifier VAR, ELEMENT;
      expression COUNT;
      @@
      
      - alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
      + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
      
      // mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      identifier VAR, ELEMENT;
      expression COUNT;
      @@
      
      - alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
      + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
      
      // Same pattern, but can't trivially locate the trailing element name,
      // or variable name.
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      expression SOMETHING, COUNT, ELEMENT;
      @@
      
      - alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
      + alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      acafe7e3
  20. 23 May, 2018 1 commit
    • Tejun Heo's avatar
      cgroup: css_set_lock should nest inside tasklist_lock · d8742e22
      Tejun Heo authored
      cgroup_enable_task_cg_lists() incorrectly nests non-irq-safe
      tasklist_lock inside irq-safe css_set_lock triggering the following
      lockdep warning.
      
        WARNING: possible irq lock inversion dependency detected
        4.17.0-rc1-00027-gb37d049 #6 Not tainted
        --------------------------------------------------------
        systemd/1 just changed the state of lock:
        00000000fe57773b (css_set_lock){..-.}, at: cgroup_free+0xf2/0x12a
        but this lock took another, SOFTIRQ-unsafe lock in the past:
         (tasklist_lock){.+.+}
      
        and interrupts could create inverse lock ordering between them.
      
        other info that might help us debug this:
         Possible interrupt unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(tasklist_lock);
      				 local_irq_disable();
      				 lock(css_set_lock);
      				 lock(tasklist_lock);
          <Interrupt>
            lock(css_set_lock);
      
         *** DEADLOCK ***
      
      The condition is highly unlikely to actually happen especially given
      that the path is executed only once per boot.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      d8742e22
  21. 16 May, 2018 1 commit
  22. 07 May, 2018 1 commit
  23. 26 Apr, 2018 3 commits
    • Tejun Heo's avatar
      cgroup: Make cgroup_rstat_updated() ready for root cgroup usage · c43c5ea7
      Tejun Heo authored
      cgroup_rstat_updated() ensures that the cgroup's rstat is linked to
      the parent.  If there's no parent, it never gets linked and the
      function ends up grabbing and releasing the cgroup_rstat_lock each
      time for no reason which can be expensive.
      
      This hasn't been a problem till now because nobody was calling the
      function for the root cgroup but rstat is gonna be exposed to
      controllers and use cases, so let's get ready.  Make
      cgroup_rstat_updated() an no-op for the root cgroup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      c43c5ea7
    • Tejun Heo's avatar
      cgroup: Add memory barriers to plug cgroup_rstat_updated() race window · 9a9e97b2
      Tejun Heo authored
      cgroup_rstat_updated() has a small race window where an updated
      signaling can race with flush and could be lost till the next update.
      This wasn't a problem for the existing usages, but we plan to use
      rstat to track counters which need to be accurate.
      
      This patch plugs the race window by synchronizing
      cgroup_rstat_updated() and flush path with memory barriers around
      cgroup_rstat_cpu->updated_next pointer.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      9a9e97b2
    • Tejun Heo's avatar
      cgroup: Add cgroup_subsys->css_rstat_flush() · 8f53470b
      Tejun Heo authored
      This patch adds cgroup_subsys->css_rstat_flush().  If a subsystem has
      this callback, its csses are linked on cgrp->css_rstat_list and rstat
      will call the function whenever the associated cgroup is flushed.
      Flush is also performed when such csses are released so that residual
      counts aren't lost.
      
      Combined with the rstat API previous patches factored out, this allows
      controllers to plug into rstat to manage their statistics in a
      scalable way.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      8f53470b