1. 18 Oct, 2020 1 commit
    • Jann Horn's avatar
      binfmt_elf: take the mmap lock around find_extend_vma() · b2767d97
      Jann Horn authored
      
      create_elf_tables() runs after setup_new_exec(), so other tasks can
      already access our new mm and do things like process_madvise() on it.  (At
      the time I'm writing this commit, process_madvise() is not in mainline
      yet, but has been in akpm's tree for some time.)
      
      While I believe that there are currently no APIs that would actually allow
      another process to mess up our VMA tree (process_madvise() is limited to
      MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
      under which no syscalls have been executed yet), this seems like an
      accident waiting to happen.
      
      Let's make sure that we always take the mmap lock around GUP paths as long
      as another process might be able to see the mm.
      
      (Yes, this diff looks suspicious because we drop the lock before doing
      anything with `vma`, but that's because we actually don't do anything with
      it apart from the NULL check.)
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2767d97
  2. 16 Oct, 2020 4 commits
    • Jann Horn's avatar
      binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot · a07279c9
      Jann Horn authored
      
      In both binfmt_elf and binfmt_elf_fdpic, use a new helper
      dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
      VMA, if we have one) while protected by the mmap_lock, and then use that
      snapshot instead of walking the VMA list without locking.
      
      An alternative approach would be to keep the mmap_lock held across the
      entire core dumping operation; however, keeping the mmap_lock locked while
      we may be blocked for an unbounded amount of time (e.g.  because we're
      dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
      blocks things like the ->release handler of userfaultfd, and we don't
      really want critical system daemons to grind to a halt just because
      someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
      something like that.
      
      Since both the normal ELF code and the FDPIC ELF code need this
      functionality (and if any other binfmt wants to add coredump support in
      the future, they'd probably need it, too), implement this with a common
      helper in fs/coredump.c.
      
      A downside of this approach is that we now need a bigger amount of kernel
      memory per userspace VMA in the normal ELF case, and that we need O(n)
      kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
      be terribly bad.
      
      There currently is a data race between stack expansion and anything that
      reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
      mitigate that for core dumping, take the mmap_lock in write mode when
      taking a snapshot of the VMA hierarchy.  (If we only took the mmap_lock in
      read mode, we could end up with a corrupted core dump if someone does
      get_user_pages_remote() concurrently.  Not really a major problem, but
      taking the mmap_lock either way works here, so we might as well avoid the
      issue.) (This doesn't do anything about the existing data races with stack
      expansion in other mm code.)
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a07279c9
    • Jann Horn's avatar
      coredump: rework elf/elf_fdpic vma_dump_size() into common helper · 429a22e7
      Jann Horn authored
      
      At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
      different code to figure out which VMAs should be dumped, and if so,
      whether the dump should contain the entire VMA or just its first page.
      
      Eliminate duplicate code by reworking the binfmt_elf version into a
      generic core dumping helper in coredump.c.
      
      As part of that, change the heuristic for detecting executable/library
      header pages to check whether the inode is executable instead of looking
      at the file mode.
      
      This is less problematic in terms of locking because it lets us avoid
      get_user() under the mmap_sem.  (And arguably it looks nicer and makes
      more sense in generic code.)
      
      Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
      only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
      has been written to.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-5-jannh@google.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      429a22e7
    • Jann Horn's avatar
      coredump: refactor page range dumping into common helper · afc63a97
      Jann Horn authored
      
      Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of
      pages into the coredump file.  Extract that logic into a common helper.
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-4-jannh@google.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      afc63a97
    • Chris Kennelly's avatar
      fs/binfmt_elf: use PT_LOAD p_align values for suitable start address · ce81bb25
      Chris Kennelly authored
      Patch series "Selecting Load Addresses According to p_align", v3.
      
      The current ELF loading mechancism provides page-aligned mappings.  This
      can lead to the program being loaded in a way unsuitable for file-backed,
      transparent huge pages when handling PIE executables.
      
      While specifying -z,max-page-size=0x200000 to the linker will generate
      suitably aligned segments for huge pages on x86_64, the executable needs
      to be loaded at a suitably aligned address as well.  This alignment
      requires the binary's cooperation, as distinct segments need to be
      appropriately paddded to be eligible for THP.
      
      For binaries built with increased alignment, this limits the number of
      bits usable for ASLR, but provides some randomization over using fixed
      load addresses/non-PIE binaries.
      
      This patch (of 2):
      
      The current ELF loading mechancism provides page-aligned mappings.  This
      can lead to the program being loaded in a way unsuitable for file-backed,
      transparent huge pages when handling PIE executables.
      
      For binaries built with increased alignment, this limits the number of
      bits usable for ASLR, but provides some randomization over using fixed
      load addresses/non-PIE binaries.
      
      Tested by verifying program with -Wl,-z,max-page-size=0x200000 loading.
      
      [akpm@linux-foundation.org: fix max() warning]
      [ckennelly@google.com: augment comment]
        Link: https://lkml.kernel.org/r/20200821233848.3904680-2-ckennelly@google.com
      
      Signed-off-by: default avatarChris Kennelly <ckennelly@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Hugh Dickens <hughd@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Fangrui Song <maskray@google.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: https://lkml.kernel.org/r/20200820170541.1132271-1-ckennelly@google.com
      Link: https://lkml.kernel.org/r/20200820170541.1132271-2-ckennelly@google.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce81bb25
  3. 27 Jul, 2020 2 commits
    • Al Viro's avatar
      kill elf_fpxregs_t · 7a896028
      Al Viro authored
      
      all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
      been defined on any architecture since 2010
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7a896028
    • Al Viro's avatar
      introduction of regset ->get() wrappers, switching ELF coredumps to those · b4e9c954
      Al Viro authored
      
      Two new helpers: given a process and regset, dump into a buffer.
      regset_get() takes a buffer and size, regset_get_alloc() takes size
      and allocates a buffer.
      
      Return value in both cases is the amount of data actually dumped in
      case of success or -E...  on error.
      
      In both cases the size is capped by regset->n * regset->size, so
      ->get() is called with offset 0 and size no more than what regset
      expects.
      
      binfmt_elf.c callers of ->get() are switched to using those; the other
      caller (copy_regset_to_user()) will need some preparations to switch.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b4e9c954
  4. 05 Jun, 2020 1 commit
  5. 03 Jun, 2020 1 commit
  6. 28 May, 2020 1 commit
  7. 21 May, 2020 1 commit
    • Eric W. Biederman's avatar
      exec: Generic execfd support · b8a61c9e
      Eric W. Biederman authored
      Most of the support for passing the file descriptor of an executable
      to an interpreter already lives in the generic code and in binfmt_elf.
      Rework the fields in binfmt_elf that deal with executable file
      descriptor passing to make executable file descriptor passing a first
      class concept.
      
      Move the fd_install from binfmt_misc into begin_new_exec after the new
      creds have been installed.  This means that accessing the file through
      /proc/<pid>/fd/N is able to see the creds for the new executable
      before allowing access to the new executables files.
      
      Performing the install of the executables file descriptor after
      the point of no return also means that nothing special needs to
      be done on error.  The exiting of the process will close all
      of it's open files.
      
      Move the would_dump from binfmt_misc into begin_new_exec right
      after would_dump is called on the bprm->file.  This makes it
      obvious this case exists and that no nesting of bprm->file is
      currently supported.
      
      In binfmt_misc the movement of fd_install into generic code means
      that it's special error exit path is no longer needed.
      
      Link: https://lkml.kernel.org/r/87y2poyd91.fsf_-_@x220.int.ebiederm.org
      
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      b8a61c9e
  8. 07 May, 2020 2 commits
  9. 05 May, 2020 2 commits
  10. 07 Apr, 2020 4 commits
  11. 16 Mar, 2020 2 commits
  12. 31 Jan, 2020 8 commits
  13. 05 Dec, 2019 2 commits
  14. 15 Nov, 2019 1 commit
    • Arnd Bergmann's avatar
      y2038: elfcore: Use __kernel_old_timeval for process times · e2bb80d5
      Arnd Bergmann authored
      
      We store elapsed time for a crashed process in struct elf_prstatus using
      'timeval' structures. Once glibc starts using 64-bit time_t, this becomes
      incompatible with the kernel's idea of timeval since the structure layout
      no longer matches on 32-bit architectures.
      
      This changes the definition of the elf_prstatus structure to use
      __kernel_old_timeval instead, which is hardcoded to the currently used
      binary layout. There is no risk of overflow in y2038 though, because
      the time values are all relative times, and can store up to 68 years
      of process elapsed time.
      
      There is a risk of applications breaking at build time when they
      use the new kernel headers and expect the type to be exactly 'timeval'
      rather than a structure that has the same fields as before. Those
      applications have to be modified to deal with 64-bit time_t anyway.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      e2bb80d5
  15. 06 Oct, 2019 1 commit
    • Linus Torvalds's avatar
      elf: don't use MAP_FIXED_NOREPLACE for elf executable mappings · b212921b
      Linus Torvalds authored
      In commit 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map") we
      changed elf to use MAP_FIXED_NOREPLACE instead of MAP_FIXED for the
      executable mappings.
      
      Then, people reported that it broke some binaries that had overlapping
      segments from the same file, and commit ad55eac7
      
       ("elf: enforce
      MAP_FIXED on overlaying elf segments") re-instated MAP_FIXED for some
      overlaying elf segment cases.  But only some - despite the summary line
      of that commit, it only did it when it also does a temporary brk vma for
      one obvious overlapping case.
      
      Now Russell King reports another overlapping case with old 32-bit x86
      binaries, which doesn't trigger that limited case.  End result: we had
      better just drop MAP_FIXED_NOREPLACE entirely, and go back to MAP_FIXED.
      
      Yes, it's a sign of old binaries generated with old tool-chains, but we
      do pride ourselves on not breaking existing setups.
      
      This still leaves MAP_FIXED_NOREPLACE in place for the load_elf_interp()
      and the old load_elf_library() use-cases, because nobody has reported
      breakage for those. Yet.
      
      Note that in all the cases seen so far, the overlapping elf sections
      seem to be just re-mapping of the same executable with different section
      attributes.  We could possibly introduce a new MAP_FIXED_NOFILECHANGE
      flag or similar, which acts like NOREPLACE, but allows just remapping
      the same executable file using different protection flags.
      
      It's not clear that would make a huge difference to anything, but if
      people really hate that "elf remaps over previous maps" behavior, maybe
      at least a more limited form of remapping would alleviate some concerns.
      
      Alternatively, we should take a look at our elf_map() logic to see if we
      end up not mapping things properly the first time.
      
      In the meantime, this is the minimal "don't do that then" patch while
      people hopefully think about it more.
      Reported-by: default avatarRussell King <linux@armlinux.org.uk>
      Fixes: 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map")
      Fixes: ad55eac7
      
       ("elf: enforce  MAP_FIXED on overlaying elf segments")
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b212921b
  16. 26 Sep, 2019 1 commit
  17. 24 Sep, 2019 1 commit
  18. 17 Jul, 2019 1 commit
  19. 21 May, 2019 1 commit
  20. 15 May, 2019 3 commits
    • Kees Cook's avatar
      binfmt_elf: move brk out of mmap when doing direct loader exec · bbdc6076
      Kees Cook authored
      Commmit eab09532 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"),
      made changes in the rare case when the ELF loader was directly invoked
      (e.g to set a non-inheritable LD_LIBRARY_PATH, testing new versions of
      the loader), by moving into the mmap region to avoid both ET_EXEC and
      PIE binaries.  This had the effect of also moving the brk region into
      mmap, which could lead to the stack and brk being arbitrarily close to
      each other.  An unlucky process wouldn't get its requested stack size
      and stack allocations could end up scribbling on the heap.
      
      This is illustrated here.  In the case of using the loader directly, brk
      (so helpfully identified as "[heap]") is allocated with the _loader_ not
      the binary.  For example, with ASLR entirely disabled, you can see this
      more clearly:
      
      $ /bin/cat /proc/self/maps
      555555554000-55555555c000 r-xp 00000000 ... /bin/cat
      55555575b000-55555575c000 r--p 00007000 ... /bin/cat
      55555575c000-55555575d000 rw-p 00008000 ... /bin/cat
      55555575d000-55555577e000 rw-p 00000000 ... [heap]
      ...
      7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
      7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
      7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
      7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
      7ffff7ffe000-7ffff7fff000 rw-p 00000000 ...
      7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]
      
      $ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps
      ...
      7ffff7bcc000-7ffff7bd4000 r-xp 00000000 ... /bin/cat
      7ffff7bd4000-7ffff7dd3000 ---p 00008000 ... /bin/cat
      7ffff7dd3000-7ffff7dd4000 r--p 00007000 ... /bin/cat
      7ffff7dd4000-7ffff7dd5000 rw-p 00008000 ... /bin/cat
      7ffff7dd5000-7ffff7dfc000 r-xp 00000000 ... /lib/x86_64-linux-gnu/ld-2.27.so
      7ffff7fb2000-7ffff7fd6000 rw-p 00000000 ...
      7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
      7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
      7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
      7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
      7ffff7ffe000-7ffff8020000 rw-p 00000000 ... [heap]
      7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]
      
      The solution is to move brk out of mmap and into ELF_ET_DYN_BASE since
      nothing is there in the direct loader case (and ET_EXEC is still far
      away at 0x400000).  Anything that ran before should still work (i.e.
      the ultimately-launched binary already had the brk very far from its
      text, so this should be no different from a COMPAT_BRK standpoint).  The
      only risk I see here is that if someone started to suddenly depend on
      the entire memory space lower than the mmap region being available when
      launching binaries via a direct loader execs which seems highly
      unlikely, I'd hope: this would mean a binary would _not_ work when
      exec()ed normally.
      
      (Note that this is only done under CONFIG_ARCH_HAS_ELF_RANDOMIZATION
      when randomization is turned on.)
      
      Link: http://lkml.kernel.org/r/20190422225727.GA21011@beast
      Link: https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=Omfhx_p0nCKPSjA@mail.gmail.com
      Fixes: eab09532
      
       ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarAli Saidi <alisaidi@amazon.com>
      Cc: Ali Saidi <alisaidi@amazon.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbdc6076
    • Alexey Dobriyan's avatar
      elf: init pt_regs pointer later · 249b08e4
      Alexey Dobriyan authored
      Get "current_pt_regs" pointer right before usage.
      
      Space savings on x86_64:
      
      	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-180 (-180)
      	Function                           old     new   delta
      	load_elf_binary                   5806    5626    -180 !!!
      
      Looks like the compiler doesn't know that "current_pt_regs" is stable
      pointer (because it doesn't know ->stack isn't) even though it knows
      that "current" is stable pointer.  So it saves it in the very beginning
      and then tries to carry it through a lot of code.
      
      Here is what happens here:
      
      load_elf_binary()
      		...
      	mov	rax,QWORD PTR gs:0x14c00
      	mov	r13,QWORD PTR [rax+0x18]	r13 = current->stack
      	call	kmem_cache_alloc		# first kmalloc
      
      		[980 bytes later!]
      
      	# let's spill that sucker because we need a register
      	# for "load_bias" calculations at
      	#
      	#	if (interpreter) {
      	#		load_bias = ELF_ET_DYN_BASE;
      	#		if (current->flags & PF_RANDOMIZE)
      	#			load_bias += arch_mmap_rnd();
      	#		elf_flags |= elf_fixed;
      	#	}
      	mov	QWORD PTR [rsp+0x68],r13
      
      If this is not _the_ root cause it is still eeeeh.
      
      After the patch things become much simpler:
      
      	mov	rax, QWORD PTR gs:0x14c00	# current
      	mov	rdx, QWORD PTR [rax+0x18]	# current->stack
      	movq	[rdx+0x3fb8], 0			# fill pt_regs
      		...
      	call finalize_exec
      
      Link: http://lkml.kernel.org/r/20190419200343.GA19788@avx2
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Tested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      249b08e4
    • Alexey Dobriyan's avatar
      fs/binfmt_elf.c: extract PROT_* calculations · d8e7cb39
      Alexey Dobriyan authored
      There are two places where mapping protections are calculated: one for
      executable, another one for interpreter -- take them out.
      
      ELF read and execute permissions are interchanged with Linux PROT_READ
      and PROT_EXEC, microoptimizations are welcome!
      
      Link: http://lkml.kernel.org/r/20190417213413.GB26474@avx2
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8e7cb39