.match() on large file leading to 'heap out of memory' errors
Hi Phil!
Thank you so much for the library, we love it - though have recently come across a potential limitation.
We are using some one-off scripts to perform an ETL, and this is the core of our 'Transform' portion:
const input = argv.input
const output = argv.output
const readStream = fs.createReadStream(input)
const payees = []
async function makeUniquePayees() {
return await new Promise((resolve, reject) => {
try {
const dataStream = bfj.match(readStream, 'payee', { ndjson: true })
console.log(`Starting transformation - this might take a while for larger files.`)
dataStream.on('data', item => payees.push(item))
dataStream.on('end', () => {
let uniquePayees = new Set(payees)
let output = {
payees: Array.from(uniquePayees).map(payee => ({
payee: payee,
})),
}
return resolve(output)
})
dataStream.on('error', err => {
throw err
})
} catch (err) {
console.log(err)
return reject(err)
}
})
}
async function transform() {
let uniquePayees = await makeUniquePayees()
let uniquePayeesParsed = JSON.stringify(uniquePayees)
fs.writeFileSync(output, uniquePayeesParsed)
console.log(`File written to ${output} with ${Object.keys(uniquePayees.payees).length} entries.\nFinished!!!`)
return
}
transform()
Short version: We have a ton of duplicate keys in one of our fields, and as we want to use it as a searchable key, we're paring it down to use in its own index in Elasticsearch. That's what the Set
is being used for at the moment =]
This has worked well with my test runs using smaller files, however when I am using larger json files (>500MB), I run into the following error:
<--- Last few GCs --->
[18658:0x102aa6000] 113493 ms: Scavenge 2034.8 (2049.7) -> 2030.9 (2049.7) MB, 1.7 / 0.0 ms (average mu = 0.270, current mu = 0.251) allocation failure
[18658:0x102aa6000] 113496 ms: Scavenge 2034.8 (2049.7) -> 2031.0 (2049.7) MB, 1.6 / 0.0 ms (average mu = 0.270, current mu = 0.251) allocation failure
[18658:0x102aa6000] 113500 ms: Scavenge 2034.9 (2049.7) -> 2031.1 (2057.7) MB, 1.9 / 0.0 ms (average mu = 0.270, current mu = 0.251) allocation failure
<--- JS stacktrace --->
==== JS stack trace =========================================
0: ExitFrame [pc: 0x10097d5b9]
Security context: 0x3af7bf4008d1 <JSObject>
1: tryConvertToPromise(aka tryConvertToPromise) [0x3af7e9a0ac99] [/Users/<user>/Documents/project/lib/search/util/ETL/node_modules/bluebird/js/release/thenables.js:~7] [pc=0x218a6f90e920](this=0x3af7d30c04b1 <undefined>,0x3af7d30c04b1 <undefined>,0x3af7d30c04b1 <undefined>)
2: step(aka step) [0x3af74b717751] [/Us...
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
1: 0x1010248bd node::Abort() (.cold.1) [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
2: 0x100084c4d node::FatalError(char const*, char const*) [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
3: 0x100084d8e node::OnFatalError(char const*, char const*) [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
4: 0x100186477 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
5: 0x100186417 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
6: 0x1003141c5 v8::internal::Heap::FatalProcessOutOfMemory(char const*) [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
7: 0x100315a3a v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
8: 0x10031246c v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
9: 0x10031026e v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
10: 0x10030f2b1 v8::internal::Heap::HandleGCRequest() [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
11: 0x1002d4551 v8::internal::StackGuard::HandleInterrupts() [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
12: 0x10063e79c v8::internal::Runtime_StackGuard(int, unsigned long*, v8::internal::Isolate*) [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
13: 0x10097d5b9 Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit [/Users/<user>/.nvm/versions/node/v12.16.2/bin/node]
14: 0x218a6f90e920
15: 0x218a6f914e70
Abort trap: 6
I imagine there is a more efficient way to do what I'm doing, and perhaps it is user error rather than anything in bfj
- but wanted to reach out for any thoughts!
Also, we are using the latest version of bfj
- ^7.0.2
in package.json
Thank you again =] -Alex
Edit:
With some poking around, it looks like it is not making it to the datastream.on('end)
section, so it appears to be happening before we get there =]