Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should add support for metrics collection from all processors and pipelines #86

Open
OriHoch opened this issue Oct 1, 2017 · 0 comments

Comments

@OriHoch
Copy link
Contributor

OriHoch commented Oct 1, 2017

Scenario 1 - long running processor

  • Run a long-running processor
  • Want to see the progress of the processor as it runs

expected

  • should have a way to quickly, reliably and consistently check the progress
  • Some example questions:
    • how many rows were processed so far?
    • which processor is currently running? for how long?

actual

  • try to track the output data - but it might be committed in batches, also it's not a reliable or consistent.
  • check the processor log (add logging.info) - also, not reliable or consistent

Scenario 2 - BI / reporting

  • Have a complex environment with many pipelines
  • Want to see reports showing which pipelines / processors ran and when / show statistics

expected

  • should have a way to generate reports over time
  • example questsions:
    • what's the average processing time per pipeline / per processor / per row
    • when a pipeline last ran in the past? for how long it ran? how many rows were yielded?

actual

  • only possible by looking at the output data

Scenario 3 - alerting

  • Have a complex environment with many pipelines
  • it's hard to track pipelines manually

expected

  • should have a way to alert in real-time based on pipeline processing
  • example alerts:
    • a pipeline takes too long to run (compared to previous average running time)
    • a pipeline is "stuck" and stopped yielding new rows in a timely manner

actual

  • no way to have real-time alerts

Suggested solution

The datapackage-pipelines-metrics provides most of the required features, but it has 2 major problems:

  • it's not pretty to integrate - you have to rename all pipeline specs to the plugin source spec filename
    • or - manually add it to your own source spec - or to each pipeline
  • it aggregates metrics based on datapackage or resource - there is no way to analyze a specific step or a specific processor

AFAIK the current plugin framework doesn't support this use-case - which might be a good thing (I wouldn't like to see general hooks / events which allow to modify how the system works)

but - I think that for metrics, it is needed

perhaps a generic pluggable metrics architecture is in order?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants