add split processor to standard library #109

OriHoch · 2017-12-24T13:12:40Z

As a dpp user, I want to split or shard data from a single or more resources to a single or more other resources based on certain conditions

Use cases:

get a sample of rows for preview from a large datapackage
allow parallel processing of resources by sharding the source data
provide more user friendly datapackages for large datasets, split to smaller chunks which users can open in a spreadsheet

see documentation and tests for this suggested processor here

akariv · 2017-12-31T08:02:12Z

At least one use case for this could be accomplished by adding a parameter to the filter processor in which it should create a new resource for the filtered rows instead of working on the source resource.
A second one could be accomplished by using a 'group-by' sort of processor, which takes a sorted stream and splits it to multiple streams based on a "key" (composed out of values in specific columns).
The main problem with the latter is that you need to know in advance the list of distinct values in the data (so that you can modify the resource list in the datapackage), which complicates significantly the implementation.

akariv added the discussion label Dec 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add split processor to standard library #109

add split processor to standard library #109

OriHoch commented Dec 24, 2017

akariv commented Dec 31, 2017

add split processor to standard library #109

add split processor to standard library #109

Comments

OriHoch commented Dec 24, 2017

akariv commented Dec 31, 2017