New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Should add support for metrics collection from all processors and pipelines #86

Open

OriHoch opened this issue Oct 1, 2017 · 0 comments

Labels

Contributor

OriHoch commented Oct 1, 2017 •

edited

Loading

Scenario 1 - long running processor

Run a long-running processor
Want to see the progress of the processor as it runs

expected

should have a way to quickly, reliably and consistently check the progress
Some example questions:
- how many rows were processed so far?
- which processor is currently running? for how long?

actual

try to track the output data - but it might be committed in batches, also it's not a reliable or consistent.
check the processor log (add logging.info) - also, not reliable or consistent

Scenario 2 - BI / reporting

Have a complex environment with many pipelines
Want to see reports showing which pipelines / processors ran and when / show statistics

expected

should have a way to generate reports over time
example questsions:
- what's the average processing time per pipeline / per processor / per row
- when a pipeline last ran in the past? for how long it ran? how many rows were yielded?

actual

only possible by looking at the output data

Scenario 3 - alerting

Have a complex environment with many pipelines
it's hard to track pipelines manually

expected

should have a way to alert in real-time based on pipeline processing
example alerts:
- a pipeline takes too long to run (compared to previous average running time)
- a pipeline is "stuck" and stopped yielding new rows in a timely manner

actual

no way to have real-time alerts

Suggested solution

The datapackage-pipelines-metrics provides most of the required features, but it has 2 major problems:

it's not pretty to integrate - you have to rename all pipeline specs to the plugin source spec filename
- or - manually add it to your own source spec - or to each pipeline
it aggregates metrics based on datapackage or resource - there is no way to analyze a specific step or a specific processor

AFAIK the current plugin framework doesn't support this use-case - which might be a good thing (I wouldn't like to see general hooks / events which allow to modify how the system works)

but - I think that for metrics, it is needed

perhaps a generic pluggable metrics architecture is in order?

akariv added the discussion label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment