Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dump from data validation of larger files #348

Open
zaneselvans opened this issue Sep 15, 2018 · 3 comments
Open

Core dump from data validation of larger files #348

zaneselvans opened this issue Sep 15, 2018 · 3 comments
Milestone

Comments

@zaneselvans
Copy link

When attempting to use the CLI to validate my data package using the command:

data validate datapackage.json

I first get a warning about a memory leak:

(node:21287) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 121 end listeners added. Use emitter.setMaxListeners() to increase limit

followed by a core dump an hour or more later. The data package I'm working with can be found on datahub. It currently consists of two tabular data resources. One (mines) contains ~30MB of CSV data, and triggers the memory leak warning but validates successfully in under a minute. The other (employment-production-quarterly) is ~160MB of CSV data, and also triggers the memory leak warning, proceeding to run for many minutes using ~100-150% of a CPU, while slowly and continuously increasing its memory footprint (but only up to ~10% of available memory), eventually resulting in the following error:

<--- Last few GCs --->

[21287:0x5610d38d7aa0]  2412787 ms: Mark-sweep 2011.0 (2121.7) -> 2011.0 (2091.2) MB, 1581.1 / 0.0 ms  last resort GC in old space requested
[21287:0x5610d38d7aa0]  2414404 ms: Mark-sweep 2011.0 (2091.2) -> 2011.0 (2091.7) MB, 1615.7 / 0.0 ms  last resort GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x74ce6898fe1 <JSObject>
    1: push(this=0x1115d9486161 <JSArray[1956331]>)
    2: _callee2$ [/home/zane/anaconda3/lib/node_modules/data-cli/node_modules/tableschema/lib/table.js:~469] [pc=0x27385caf7d07](this=0x1115d94826e9 <Table map = 0x224e9de6491>,_context2=0x1115d9482689 <Context map = 0x224e9de1211>)
    3: tryCatch(aka tryCatch) [/home/zane/anaconda3/lib/node_modules/data-cli/node_modules/regenerator-runtime/run...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [node]
 2: 0x5610d228a3b3 [node]
 3: v8::Utils::ReportOOMFailure(char const*, bool) [node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
 5: v8::internal::Factory::NewUninitializedFixedArray(int) [node]
 6: 0x5610d1e698a5 [node]
 7: 0x5610d1e69a9f [node]
 8: v8::internal::JSObject::AddDataElement(v8::internal::Handle<v8::internal::JSObject>, unsigned int, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyAttributes, v8::internal::Object::ShouldThrow) [node]
 9: v8::internal::Object::AddDataProperty(v8::internal::LookupIterator*, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyAttributes, v8::internal::Object::ShouldThrow, v8::internal::Object::StoreFromKeyed) [node]
10: v8::internal::Object::SetProperty(v8::internal::LookupIterator*, v8::internal::Handle<v8::internal::Object>, v8::internal::LanguageMode, v8::internal::Object::StoreFromKeyed) [node]
11: v8::internal::Runtime::SetObjectProperty(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::LanguageMode) [node]
12: v8::internal::Runtime_SetProperty(int, v8::internal::Object**, v8::internal::Isolate*) [node]
13: 0x27385c8040bd

From within python, using goodtables.validate() on the same data package including all ~2 million records, validation completes successfully and takes about 10 minutes.

I am running Ubuntu 18.04.1 on a Thinkpad T470S with 2, 2-thread cores, and 24GB of RAM. The version of node (v8.11.1) and npm (v6.4.1) that I'm using are the ones distributed with the current anaconda3 distribution (v5.2). The version of data is 0.9.5.

@zaneselvans
Copy link
Author

Core dump aside, it seems like the data validation could happen much faster somehow. Is it going through record by record? Or working on vectorized columns?

@zelima zelima added this to the Backlog milestone Oct 3, 2018
@ezwelty
Copy link

ezwelty commented Sep 4, 2019

@zaneselvans When running goodtables.validate(), are you setting row_limit= to a large enough number to scan the whole table? At least on my system, the default limit is 1000. Checking because I am suspicious of your speed results (2 million records in 10 minutes); I would have expected it to be much slower based on testing with my own data...

"warnings": ["Table table.csv inspection has reached 1000 row(s) limit"]

At least goodtables-py is checking line by line. I agree this could probably be done much faster by working with vectorized columns.

@zaneselvans
Copy link
Author

It's been a while! Not sure I remember whether I had the row limit set. Initially at least I was trying to validate everything. In the PUDL project now we are in theory going to to try and use goodtables programmatically, but it's not yet able to test all the things we want to test, structurally, about the data, so we're only running it on a few thousand rows, and the main structural validation we're doing is happening through actually pulling all the data into an SQLite database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants