Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add split processor to standard library #109

Open
OriHoch opened this issue Dec 24, 2017 · 1 comment
Open

add split processor to standard library #109

OriHoch opened this issue Dec 24, 2017 · 1 comment

Comments

@OriHoch
Copy link
Contributor

OriHoch commented Dec 24, 2017

As a dpp user, I want to split or shard data from a single or more resources to a single or more other resources based on certain conditions

Use cases:

  • get a sample of rows for preview from a large datapackage
  • allow parallel processing of resources by sharding the source data
  • provide more user friendly datapackages for large datasets, split to smaller chunks which users can open in a spreadsheet

see documentation and tests for this suggested processor here

@akariv
Copy link
Member

akariv commented Dec 31, 2017

At least one use case for this could be accomplished by adding a parameter to the filter processor in which it should create a new resource for the filtered rows instead of working on the source resource.
A second one could be accomplished by using a 'group-by' sort of processor, which takes a sorted stream and splits it to multiple streams based on a "key" (composed out of values in specific columns).
The main problem with the latter is that you need to know in advance the list of distinct values in the data (so that you can modify the resource list in the datapackage), which complicates significantly the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants