This project is mirrored from git:// Updated .
  1. 21 Nov, 2018 1 commit
    • Andres Freund's avatar
      Remove WITH OIDS support, change oid catalog column visibility. · 578b2297
      Andres Freund authored
      Previously tables declared WITH OIDS, including a significant fraction
      of the catalog tables, stored the oid column not as a normal column,
      but as part of the tuple header.
      This special column was not shown by default, which was somewhat odd,
      as it's often (consider e.g. pg_class.oid) one of the more important
      parts of a row.  Neither pg_dump nor COPY included the contents of the
      oid column by default.
      The fact that the oid column was not an ordinary column necessitated a
      significant amount of special case code to support oid columns. That
      already was painful for the existing, but upcoming work aiming to make
      table storage pluggable, would have required expanding and duplicating
      that "specialness" significantly.
      WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
      Remove it.
      Removing includes:
      - CREATE TABLE and ALTER TABLE syntax for declaring the table to be
        WITH OIDS has been removed (WITH (oids[ = true]) will error out)
      - pg_dump does not support dumping tables declared WITH OIDS and will
        issue a warning when dumping one (and ignore the oid column).
      - restoring an pg_dump archive with pg_restore will warn when
        restoring a table with oid contents (and ignore the oid column)
      - COPY will refuse to load binary dump that includes oids.
      - pg_upgrade will error out when encountering tables declared WITH
        OIDS, they have to be altered to remove the oid column first.
      - Functionality to access the oid of the last inserted row (like
        plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
      The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
      for CREATE TABLE) is still supported. While that requires a bit of
      support code, it seems unnecessary to break applications / dumps that
      do not use oids, and are explicit about not using them.
      The biggest user of WITH OID columns was postgres' catalog. This
      commit changes all 'magic' oid columns to be columns that are normally
      declared and stored. To reduce unnecessary query breakage all the
      newly added columns are still named 'oid', even if a table's column
      naming scheme would indicate 'reloid' or such.  This obviously
      requires adapting a lot code, mostly replacing oid access via
      HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
      The bootstrap process now assigns oids for all oid columns in that do not have an explicit value (starting at the largest
      oid previously used), only oids assigned later by oids will be above
      FirstBootstrapObjectId. As the oid column now is a normal column the
      special bootstrap syntax for oids has been removed.
      Oids are not automatically assigned during insertion anymore, all
      backend code explicitly assigns oids with GetNewOidWithIndex(). For
      the rare case that insertions into the catalog via SQL are called for
      the new pg_nextoid() function can be used (which only works on catalog
      The fact that oid columns on system tables are now normal columns
      means that they will be included in the set of columns expanded
      by * (i.e. SELECT * FROM pg_class will now include the table's oid,
      previously it did not). It'd not technically be hard to hide oid
      column by default, but that'd mean confusing behavior would either
      have to be carried forward forever, or it'd cause breakage down the
      While it's not unlikely that further adjustments are needed, the
      scope/invasiveness of the patch makes it worthwhile to get merge this
      now. It's painful to maintain externally, too complicated to commit
      after the code code freeze, and a dependency of a number of other
      Catversion bump, for obvious reasons.
      Author: Andres Freund, with contributions by John Naylor
  2. 20 Nov, 2018 7 commits
  3. 19 Nov, 2018 14 commits
    • Tom Lane's avatar
      Add needed #include. · cb09903f
      Tom Lane authored
      Per POSIX, WIFSIGNALED and related macros are provided by <sys/wait.h>.
      Apparently on Linux they're also pulled in by some other inclusions,
      but BSD-ish systems are pickier.  Fixes portability issue in ffa4cbd6.
      Per buildfarm.
    • Tom Lane's avatar
      Handle EPIPE more sanely when we close a pipe reading from a program. · ffa4cbd6
      Tom Lane authored
      Previously, any program launched by COPY TO/FROM PROGRAM inherited the
      server's setting of SIGPIPE handling, i.e. SIG_IGN.  Hence, if we were
      doing COPY FROM PROGRAM and closed the pipe early, the child process
      would see EPIPE on its output file and typically would treat that as
      a fatal error, in turn causing the COPY to report error.  Similarly,
      one could get a failure report from a query that didn't read all of
      the output from a contrib/file_fdw foreign table that uses file_fdw's
      PROGRAM option.
      To fix, ensure that child programs inherit SIG_DFL not SIG_IGN processing
      of SIGPIPE.  This seems like an all-around better situation since if
      the called program wants some non-default treatment of SIGPIPE, it would
      expect to have to set that up for itself.  Then in COPY, if it's COPY
      FROM PROGRAM and we stop reading short of detecting EOF, treat a SIGPIPE
      exit from the called program as a non-error condition.  This still allows
      us to report an error for any case where the called program gets SIGPIPE
      on some other file descriptor.
      As coded, we won't report a SIGPIPE if we stop reading as a result of
      seeing an in-band EOF marker (e.g. COPY BINARY EOF marker).  It's
      somewhat debatable whether we should complain if the called program
      continues to transmit data after an EOF marker.  However, it seems like
      we should avoid throwing error in any questionable cases, especially in a
      back-patched fix, and anyway it would take additional code to make such
      an error get reported consistently.
      Back-patch to v10.  We could go further back, since COPY FROM PROGRAM
      has been around awhile, but AFAICS the only way to reach this situation
      using core or contrib is via file_fdw, which has only supported PROGRAM
      sources since v10.  The COPY statement per se has no feature whereby
      it'd stop reading without having hit EOF or an error already.  Therefore,
      I don't see any upside to back-patching further that'd outweigh the
      risk of complaints about behavioral change.
      Per bug #15449 from Eric Cyr.
      Patch by me, review by Etsuro Fujita and Kyotaro Horiguchi
    • Alvaro Herrera's avatar
      psql: Describe partitioned tables/indexes as such · d56e0fde
      Alvaro Herrera authored
      In \d and \z, instead of conflating partitioned tables and indexes with
      plain ones, set the "type" column and table title differently to make
      the distinction obvious.  A simple ease-of-use improvement.
      Author: Pavel Stehule, Michaël Paquier, Álvaro Herrera
      Reviewed-by: Amit Langote
    • Tom Lane's avatar
      Update config/ax_pthread.m4 to latest upstream version. · df303aff
      Tom Lane authored
      This change doesn't fix any bugs that we've heard about, but it seems
      like a good idea on general principles to track upstream occasionally.
    • Tom Lane's avatar
      Postpone LLVM-related uses of AC_CHECK_DECLS. · 640a4ba0
      Tom Lane authored
      Calling AC_CHECK_DECLS before we've finished setting up the compiler's
      CFLAGS seems like a pretty risky proposition, especially now that the
      first use of that macro will result in a test to see whether the compiler
      gives warning or error for undeclared built-in functions.  That answer
      could very easily get changed later than where PGAC_LLVM_SUPPORT is
      called; furthermore, it's hardly unlikely that flags such as -D_GNU_SOURCE
      could change visibility of declarations.  Hence, be a little less cavalier
      about where to do LLVM-related tests.  This results in v11 and HEAD doing
      the warning-or-error check at the same place in the script as older
      branches are doing it, which seems like a good thing.
      Per further thought about commits 0b59b0e8 and 16fbac39.
    • Alvaro Herrera's avatar
      psql: Show IP address in \conninfo · 6e5f8d48
      Alvaro Herrera authored
      When hostaddr is given, the actual IP address that psql is connected to
      can be totally unexpected for the given host.  The more verbose output
      we now generate makes things clearer.  Since the "host" and "hostaddr"
      parts of the conninfo could come from different sources (say, one of
      them is in the service specification or a URI-style conninfo and the
      other is not), this is not as silly as it may first appear.  This is
      also definitely useful if the hostname resolves to multiple addresses.
      Author: Fabien Coelho
      Reviewed-by: Pavel Stehule, Arthur Zakirov
    • Robert Haas's avatar
      Reduce unnecessary list construction in RelationBuildPartitionDesc. · 7ee5f88e
      Robert Haas authored
      The 'partoids' list which was constructed by the previous version
      of this code was necessarily identical to 'inhoids'.  There's no
      point to duplicating the list, so avoid that.  Instead, construct
      the array representation directly from the original 'inhoids' list.
      Also, use an array rather than a list for 'boundspecs'.  We know
      exactly how many items we need to store, so there's really no
      reason to use a list.  Using an array instead reduces the number
      of memory allocations we perform.
      Patch by me, reviewed by Michael Paquier and Amit Langote, the
      latter of whom also helped with rebasing.
    • Tom Lane's avatar
      Fix configure's AC_CHECK_DECLS tests to work correctly with clang. · 16fbac39
      Tom Lane authored
      The test case that Autoconf uses to discover whether a function has
      been declared doesn't work reliably with clang, because clang reports
      a warning not an error if the name is a known built-in function.
      On some platforms, this results in a lot of compile-time warnings about
      strlcpy and related functions not having been declared.
      There is a fix for this (by Noah Misch) in the upstream Autoconf sources,
      but since they've not made a release in years and show no indication of
      doing so anytime soon, let's just absorb their fix directly.  We can
      revert this when and if we update to a newer Autoconf release.
      Back-patch to all supported branches.
    • Alvaro Herrera's avatar
      Disallow COPY FREEZE on partitioned tables · 5c9a5513
      Alvaro Herrera authored
      This didn't actually work: COPY would fail to flush the right files, and
      instead would try to flush a non-existing file, causing the whole
      transaction to fail.
      Cope by raising an error as soon as the command is sent instead, to
      avoid a nasty later surprise.  Of course, it would be much better to
      make it work, but we don't have a patch for that yet, and we don't know
      if we'll want to backpatch one when we do.
      Reported-by: Tomas Vondra
      Author: David Rowley
      Reviewed-by: Amit Langote, Steve Singer, Tomas Vondra
    • Peter Eisentraut's avatar
    • Thomas Munro's avatar
      PANIC on fsync() failure. · 9ccdd7f6
      Thomas Munro authored
      On some operating systems, it doesn't make sense to retry fsync(),
      because dirty data cached by the kernel may have been dropped on
      write-back failure.  In that case the only remaining copy of the
      data is in the WAL.  A subsequent fsync() could appear to succeed,
      but not have flushed the data.  That means that a future checkpoint
      could apparently complete successfully but have lost data.
      Therefore, violently prevent any future checkpoint attempts by
      panicking on the first fsync() failure.  Note that we already
      did the same for WAL data; this change extends that behavior to
      non-temporary data files.
      Provide a GUC data_sync_retry to control this new behavior, for
      users of operating systems that don't eject dirty data, and possibly
      forensic/testing uses.  If it is set to on and the write-back error
      was transient, a later checkpoint might genuinely succeed (on a
      system that does not throw away buffers on failure); if the error is
      permanent, later checkpoints will continue to fail.  The GUC defaults
      to off, meaning that we panic.
      Back-patch to all supported releases.
      There is still a narrow window for error-loss on some operating
      systems: if the file is closed and later reopened and a write-back
      error occurs in the intervening time, but the inode has the bad
      luck to be evicted due to memory pressure before we reopen, we could
      miss the error.  A later patch will address that with a scheme
      for keeping files with dirty data open at all times, but we judge
      that to be too complicated to back-patch.
      Author: Craig Ringer, with some adjustments by Thomas Munro
      Reported-by: Craig Ringer
      Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
    • Thomas Munro's avatar
      Don't forget about failed fsync() requests. · 1556cb2f
      Thomas Munro authored
      If fsync() fails, md.c must keep the request in its bitmap, so that
      future attempts will try again.
      Back-patch to all supported releases.
      Author: Thomas Munro
      Reviewed-by: Amit Kapila
      Reported-by: Andrew Gierth
    • Michael Paquier's avatar
      Rework documentation of pg_promote · 74171f8c
      Michael Paquier authored
      This clarifies the behavior of how the "wait" flag works, which is
      something that the previous version of the documentation failed to do.
      Author: Ian Barwick
    • Michael Paquier's avatar
      Remove unnecessary memcpy when reading WAL record fitting on page · 285bd0ac
      Michael Paquier authored
      When reading a WAL record, its contents are copied into an intermediate
      buffer.  However, doing so is not necessary if the record fits fully
      into the current page, saving one memcpy for each such record.  The
      allocation handling of the intermediate buffer is also now done only
      when a record crosses a page boundary, shaving some extra cycles when
      reading a WAL record.
      Author: Andrey Lepikhov
      Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas
  4. 18 Nov, 2018 3 commits
  5. 17 Nov, 2018 5 commits
    • Tomas Vondra's avatar
      Add valgrind suppressions for wcsrtombs optimizations · d3bbc4b9
      Tomas Vondra authored
      wcsrtombs (called through wchar2char from common functions like lower,
      upper, etc.) uses various optimizations that may look like access to
      uninitialized data, triggering valgrind reports.
      For example AVX2 instructions load data in 256-bit chunks, and  gconv
      does something similar with 32-bit chunks.  This is faster than accessing
      the bytes one by one, and the uninitialized part of the buffer is not
      actually used. So suppress the bogus reports.
      The exact stack depends on possible optimizations - it might be AVX, SSE
      (as in the report by Aleksander Alekseev) or something else. Hence the
      last frame is wildcarded, to deal with this.
      Backpatch all the way back to 9.4.
      Author: Tomas Vondra
    • Tom Lane's avatar
      Avoid defining SIGTTIN/SIGTTOU on Windows. · 37afc079
      Tom Lane authored
      Setting them to SIG_IGN seems unlikely to have any beneficial effect
      on that platform, and given the signal numbering collision with SIGABRT,
      it could easily have bad effects.
      Given the lack of field complaints that can be traced to this, I don't
      presently feel a need to back-patch.
    • Tom Lane's avatar
      Leave SIGTTIN/SIGTTOU signal handling alone in postmaster child processes. · 125f551c
      Tom Lane authored
      For reasons lost in the mists of time, most postmaster child processes
      reset SIGTTIN/SIGTTOU signal handling to SIG_DFL, with the major exception
      that backend sessions do not.  It seems like a pretty bad idea for any
      postmaster children to do that: if stderr is connected to the terminal,
      and the user has put the postmaster in background, any log output would
      result in the child process freezing up.  Hence, switch them all to
      doing what backends do, ie, nothing.  This allows them to inherit the
      postmaster's SIG_IGN setting.  On the other hand, manually-launched
      processes such as standalone backends will have default processing,
      which seems fine.
      In passing, also remove useless resets of SIGCONT and SIGWINCH signal
      processing.  Perhaps the postmaster once changed those to something
      besides SIG_DFL, but it doesn't now, so these are just wasted (and
      confusing) syscalls.
      Basically, this propagates the changes made in commit 8e2998d8 from
      backends to other postmaster children.  Probably the only reason these
      calls now exist elsewhere is that I missed changing pgstat.c along with
      postgres.c at the time.
      Given the lack of field complaints that can be traced to this, I don't
      presently feel a need to back-patch.
    • Andres Freund's avatar
    • Andres Freund's avatar
      Make TupleTableSlots extensible, finish split of existing slot type. · 4da597ed
      Andres Freund authored
      This commit completes the work prepared in 1a0586de, splitting the
      old TupleTableSlot implementation (which could store buffer, heap,
      minimal and virtual slots) into four different slot types.  As
      described in the aforementioned commit, this is done with the goal of
      making tuple table slots extensible, to allow for pluggable table
      access methods.
      To achieve runtime extensibility for TupleTableSlots, operations on
      slots that can differ between types of slots are performed using the
      TupleTableSlotOps struct provided at slot creation time.  That
      includes information from the size of TupleTableSlot struct to be
      allocated, initialization, deforming etc.  See the struct's definition
      for more detailed information about callbacks TupleTableSlotOps.
      I decided to rename TTSOpsBufferTuple to TTSOpsBufferHeapTuple and
      ExecCopySlotTuple to ExecCopySlotHeapTuple, as that seems more
      consistent with other naming introduced in recent patches.
      There's plenty optimization potential in the slot implementation, but
      according to benchmarking the state after this commit has similar
      performance characteristics to before this set of changes, which seems
      There's a few changes in execReplication.c that currently need to poke
      through the slot abstraction, that'll be repaired once the pluggable
      storage patchset provides the necessary infrastructure.
      Author: Andres Freund and  Ashutosh Bapat, with changes by Amit Khandekar
  6. 16 Nov, 2018 10 commits
    • Alvaro Herrera's avatar
      Avoid re-typedef'ing PartitionTupleRouting · 0201d79a
      Alvaro Herrera authored
      Apparently, gcc on macOS (?) doesn't like it.
      Per buildfarm.
    • Alvaro Herrera's avatar
      pgbench: introduce a RandomState struct · 40923191
      Alvaro Herrera authored
      This becomes useful when used to retry a transaction after a
      serialization error or deadlock abort.  (We don't yet have that feature,
      but this is preparation for it.)
      While at it, use separate random state for thread administratrivia such
      as deciding which script to run, how long to delay for throttling, or
      whether to log a message when sampling; this not only makes these tasks
      independent of each other, but makes the actual thread run
      Author: Marina Polyakova
      Reviewed-by: Fabien Coelho
    • Andres Freund's avatar
      Inline hot path of slot_getsomeattrs(). · a7aa608e
      Andres Freund authored
      This yields a minor speedup, which roughly balances the loss from the
      upcoming introduction of callbacks to do some operations on slots.
      Author: Andres Freund
    • Alvaro Herrera's avatar
      Redesign initialization of partition routing structures · 3f2393ed
      Alvaro Herrera authored
      This speeds up write operations (INSERT, UPDATE, DELETE, COPY, as well
      as the future MERGE) on partitioned tables.
      This changes the setup for tuple routing so that it does far less work
      during the initial setup and pushes more work out to when partitions
      receive tuples.  PartitionDispatchData structs for sub-partitioned
      tables are only created when a tuple gets routed through it.  The
      possibly large arrays in the PartitionTupleRouting struct have largely
      been removed.  The partitions[] array remains but now never contains any
      NULL gaps.  Previously the NULLs had to be skipped during
      ExecCleanupTupleRouting(), which could add a large overhead to the
      cleanup when the number of partitions was large.  The partitions[] array
      is allocated small to start with and only enlarged when we route tuples
      to enough partitions that it runs out of space. This allows us to keep
      simple single-row partition INSERTs running quickly.  Redesign
      The arrays in PartitionTupleRouting which stored the tuple translation maps
      have now been removed.  These have been moved out into a
      PartitionRoutingInfo struct which is an additional field in ResultRelInfo.
      The find_all_inheritors() call still remains by far the slowest part of
      ExecSetupPartitionTupleRouting(). This commit just removes the other slow
      In passing also rename the tuple translation maps from being ParentToChild
      and ChildToParent to being RootToPartition and PartitionToRoot. The old
      names mislead you into thinking that a partition of some sub-partitioned
      table would translate to the rowtype of the sub-partitioned table rather
      than the root partitioned table.
      Authors: David Rowley and Amit Langote, heavily revised by Álvaro Herrera
      Testing help from Jesper Pedersen and Kato Sho.
    • Andres Freund's avatar
      Fix slot type assumptions for nodeGather[Merge]. · a387a3df
      Andres Freund authored
      The assumption made in 1a0586de was wrong, as evidenced by
      buildfarm failure on locust, which runs with
      force_parallel_mode=regress.  The tuples accessed in either nodes are
      in the outer slot, and we can't trivially rely on the slot type being
      known because the leader might execute the subsidiary node directly,
      or via the tuple queue on a worker. In the latter case the tuple will
      always be a heaptuple slot, but in the former, it'll be whatever the
      subsidiary node returns.
    • Andres Freund's avatar
      Add dummy field to currently empty struct TupleTableSlotOps. · f92cd739
      Andres Freund authored
      Per MSVC complaint on buildfarm member dory.
    • Andres Freund's avatar
      Don't generate tuple deforming functions for virtual slots. · 7ef04e4d
      Andres Freund authored
      Virtual tuple table slots never need tuple deforming. Therefore, if we
      know at expression compilation time, that a certain slot will always
      be virtual, there's no need to create a tuple deforming routine for
      Author: Andres Freund
    • Andres Freund's avatar
      Verify that expected slot types match returned slot types. · 15d8f831
      Andres Freund authored
      This is important so JIT compilation knows what kind of tuple slot the
      deforming routine can expect. There's also optimization potential for
      expression initialization without JIT compilation. It e.g. seems
      plausible to elide EEOP_*_FETCHSOME ops entirely when dealing with
      virtual slots.
      Author: Andres Freund
    • Andres Freund's avatar
      Compute information about EEOP_*_FETCHSOME at expression init time. · 675af5c0
      Andres Freund authored
      Previously this information was computed when JIT compiling an
      expression.  But the information is useful for assertions in the
      non-JIT case too (for assertions), therefore it makes sense to move
      This will, in a followup commit, allow to treat different slot types
      differently. E.g. for virtual slots there's no need to generate a JIT
      function to deform the slot.
      Author: Andres Freund
    • Andres Freund's avatar
      Introduce notion of different types of slots (without implementing them). · 1a0586de
      Andres Freund authored
      Upcoming work intends to allow pluggable ways to introduce new ways of
      storing table data. Accessing those table access methods from the
      executor requires TupleTableSlots to be carry tuples in the native
      format of such storage methods; otherwise there'll be a significant
      conversion overhead.
      Different access methods will require different data to store tuples
      efficiently (just like virtual, minimal, heap already require fields
      in TupleTableSlot). To allow that without requiring additional pointer
      indirections, we want to have different structs (embedding
      TupleTableSlot) for different types of slots.  Thus different types of
      slots are needed, which requires adapting creators of slots.
      The slot that most efficiently can represent a type of tuple in an
      executor node will often depend on the type of slot a child node
      uses. Therefore we need to track the type of slot is returned by
      nodes, so parent slots can create slots based on that.
      Relatedly, JIT compilation of tuple deforming needs to know which type
      of slot a certain expression refers to, so it can create an
      appropriate deforming function for the type of tuple in the slot.
      But not all nodes will only return one type of slot, e.g. an append
      node will potentially return different types of slots for each of its
      Therefore add function that allows to query the type of a node's
      result slot, and whether it'll always be the same type (whether it's
      fixed). This can be queried using ExecGetResultSlotOps().
      The scan, result, inner, outer type of slots are automatically
      inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
      left/right subtrees respectively. If that's not correct for a node,
      that can be overwritten using new fields in PlanState.
      This commit does not introduce the actually abstracted implementation
      of different kind of TupleTableSlots, that will be left for a followup
      commit.  The different types of slots introduced will, for now, still
      use the same backing implementation.
      While this already partially invalidates the big comment in
      tuptable.h, it seems to make more sense to update it later, when the
      different TupleTableSlot implementations actually exist.
      Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar