Cut the size of zip.File structures in memory
Summary
I'm not sure if this can easily be fixed, just capturing my research here
We spend a lot of memory for each zip.File
we cache, and probably we don't use many of these fields... E.g. 25% of all pages memory is taking by timezone information for every file in zip archives.
It would be nice to see if we can get rid of some of this memory.
Here's how to 1% memory profile for pages looks on gitlab.com:
About 80% of it are zip.File
object that we store in:
if file.Mode().IsDir() {
a.directories[file.Name] = &file.FileHeader
} else {
a.files[file.Name] = file
}
zip.File
and zip.FileHeader
are quite heavy structures:
type File struct {
FileHeader
zip *Reader
zipr io.ReaderAt
headerOffset int64
zip64 bool // zip64 extended information extra field presence
descErr error // error reading the data descriptor during init
}
type FileHeader struct {
// Name is the name of the file.
//
// It must be a relative path, not start with a drive letter (such as "C:"),
// and must use forward slashes instead of back slashes. A trailing slash
// indicates that this file is a directory and should have no data.
//
// When reading zip files, the Name field is populated from
// the zip file directly and is not validated for correctness.
// It is the caller's responsibility to sanitize it as
// appropriate, including canonicalizing slash directions,
// validating that paths are relative, and preventing path
// traversal through filenames ("../../../").
Name string
// Comment is any arbitrary user-defined string shorter than 64KiB.
Comment string
// NonUTF8 indicates that Name and Comment are not encoded in UTF-8.
//
// By specification, the only other encoding permitted should be CP-437,
// but historically many ZIP readers interpret Name and Comment as whatever
// the system's local character encoding happens to be.
//
// This flag should only be set if the user intends to encode a non-portable
// ZIP file for a specific localized region. Otherwise, the Writer
// automatically sets the ZIP format's UTF-8 flag for valid UTF-8 strings.
NonUTF8 bool
CreatorVersion uint16
ReaderVersion uint16
Flags uint16
// Method is the compression method. If zero, Store is used.
Method uint16
// Modified is the modified time of the file.
//
// When reading, an extended timestamp is preferred over the legacy MS-DOS
// date field, and the offset between the times is used as the timezone.
// If only the MS-DOS date is present, the timezone is assumed to be UTC.
//
// When writing, an extended timestamp (which is timezone-agnostic) is
// always emitted. The legacy MS-DOS date field is encoded according to the
// location of the Modified time.
Modified time.Time
ModifiedTime uint16 // Deprecated: Legacy MS-DOS date; use Modified instead.
ModifiedDate uint16 // Deprecated: Legacy MS-DOS time; use Modified instead.
CRC32 uint32
CompressedSize uint32 // Deprecated: Use CompressedSize64 instead.
UncompressedSize uint32 // Deprecated: Use UncompressedSize64 instead.
CompressedSize64 uint64
UncompressedSize64 uint64
Extra []byte
ExternalAttrs uint32 // Meaning depends on CreatorVersion
}
Most interesting thing there is Modified time.Time
field there:
type Time struct {
// wall and ext encode the wall time seconds, wall time nanoseconds,
// and optional monotonic clock reading in nanoseconds.
//
// From high to low bit position, wall encodes a 1-bit flag (hasMonotonic),
// a 33-bit seconds field, and a 30-bit wall time nanoseconds field.
// The nanoseconds field is in the range [0, 999999999].
// If the hasMonotonic bit is 0, then the 33-bit field must be zero
// and the full signed 64-bit wall seconds since Jan 1 year 1 is stored in ext.
// If the hasMonotonic bit is 1, then the 33-bit field holds a 33-bit
// unsigned wall seconds since Jan 1 year 1885, and ext holds a
// signed 64-bit monotonic clock reading, nanoseconds since process start.
wall uint64
ext int64
// loc specifies the Location that should be used to
// determine the minute, hour, month, day, and year
// that correspond to this Time.
// The nil location means UTC.
// All UTC times are represented with loc==nil, never loc==&utcLoc.
loc *Location
}
type Location struct {
name string
zone []zone
tx []zoneTrans
// The tzdata information can be followed by a string that describes
// how to handle DST transitions not recorded in zoneTrans.
// The format is the TZ environment variable without a colon; see
// https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html.
// Example string, for America/Los_Angeles: PST8PDT,M3.2.0,M11.1.0
extend string
// Most lookups will be for the current time.
// To avoid the binary search through tx, keep a
// static one-element cache that gives the correct
// zone for the time when the Location was created.
// if cacheStart <= t < cacheEnd,
// lookup can return cacheZone.
// The units for cacheStart and cacheEnd are seconds
// since January 1, 1970 UTC, to match the argument
// to lookup.
cacheStart int64
cacheEnd int64
cacheZone *zone
}
We see in profile above that these locations
are generated by this function:
func FixedZone(name string, offset int) *Location {
l := &Location{
name: name,
zone: []zone{{name, offset, false}},
tx: []zoneTrans{{alpha, 0, false, false}},
cacheStart: alpha,
cacheEnd: omega,
}
l.cacheZone = &l.zone[0]
return l
}
So, it looks like 25% of all pages memory usage comes from storing timezones for individual files in zip archives.
Similar to #432 (closed), but points suggested there are already done.
Steps to reproduce
Example Project
What is the current bug behavior?
What is the expected correct behavior?
Relevant logs and/or screenshots
Output of checks
Possible fixes
~"devops::release" ~"group::release" Category:Pages