Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[epic][uber] v1 of CKAN DataStore Load using AirCan #85

Open
7 of 28 tasks
rufuspollock opened this issue Jul 31, 2020 · 1 comment
Open
7 of 28 tasks

[epic][uber] v1 of CKAN DataStore Load using AirCan #85

rufuspollock opened this issue Jul 31, 2020 · 1 comment

Comments

@rufuspollock
Copy link
Member

rufuspollock commented Jul 31, 2020

This the uber-epic for the complete evolution of CKAN DataStore load to AirCan.

Acceptance

  • We are using AirCan in production for data loading to datastore
    • (central?) AirCan service in our cluster
    • CKAN instances updated with connector for AirCan
    • Monitoring / Debugging working i.e. we can see what is happening and if there are issues
  • New UI for CKAN instances for data loading experience ...

Tasks

  • v0.1 MVP DataStore load working including
  • v0.2 - errors and logging [epic] v0.2 Error and Logging #65
    • Refactor DAGs and ckanext-aircan etc to take a run_id which you can pass in to the DAG and which it uses in logging etc when running it so we can reliably track logs etc. Also move airflow status info into logs (so we don't depend on AirFlow API).
      • Research how others solve this problem of getting unique run ids per DAG run in AirFlow (and how we could pass this info down into stackdriver so that we can filter logs). Goal is that we have a reliable aircan_status(run_id) function that can be turned into an API in CKAN (or elsewhere)
  • v0.3 - UI integration into CKAN [epic] v0.3 #89
  • v0.4 - improved datastore load e.g. more formats
    • Loads XLSX ok (uses types)
    • Load google sheets
  • v0.5 - harvesting MVP

Plan of work (from 4 nov)

  • Test instance of CKAN + ckanext-aircan (+ AirFlow) https://ckan.aircan.dev.datopian.com/
    • Move this into the "dev/test cluster" @cuducos
  • Instance of Google Cloud Composer and a way to update DAGs there.
    • Should it be a test instance OR do we could use production (think this is OK in part because we can create new DAGs if we need so we don’t interfere with existing ones. E.g. Suppose we want to update datastore_load_dag and that is being used by production CKAN instances … Well, we can create datastore_load_dag_v2)?ANS: Use Production
    • Shut down all other Cloud Composer instances
  • Integration test for ckanext-aircan etc: start with a simple CSV. Scripted test to upload a file and check it is imported ckanext-aircan#26 🔥
    • with some large files [automatedly generate them] e.g. Does AirFlow DAG have an issue, is it very slow …

FUTURE after this

graph TD

v1[v0.1 CSV load working, CI/CD setup with rich tests]
v2[v0.2 errors, logging and UI integration]
v3[UI integration]
v3[v0.3 expand the tasks and  e.g. xlsx, google sheets loading]
v4[v0.4 harvesting ...]


v1 --> v2
v2 --> v3
v3 --> v4
Loading

Detailed

graph TD

deploytotest[Deploy DAGs to test GCC]
deploydags[Deploy DAGs into this AirFlow<br/>starting with CKAN data load]
deploygcc[Deploy Airflow<br/>i.e. Google Cloud Composer]

nhsdag[NHS DAG for loading to bigquery]
nhs[NHS Done: instance updated<br/>with extension and working in production]

logging[Logging]
reporting[Reporting]

othersite["Other Site Done"]

start[Start] --> deploygcc

start --> logging
multinodedag --> deploytotest

subgraph General Dev of AirCan
  errors[Error Handling]
  aircanlib[AirCan lib refactoring]
  multinodedag[Multi Node DAG]
  logging --> reporting
end

subgraph Deploy into Datopian Cluster
  deploytotest[Deploy DAGs to test GCC] --> deploydags
  deploygcc --> deploydags
end

subgraph CKAN Integration
  setschema[Set Schema from Resource]
  endckan[End CKAN work]
  setschema --> endckan
end

deploydags --> nhsdag
deploydags --> othersite
endckan --> nhs

subgraph NHS
  nhsdag --> nhs
end

classDef done fill:#21bf73,stroke:#333,stroke-width:1px;
classDef nearlydone fill:lightgreen,stroke:#333,stroke-width:1px;
classDef inprogress fill:orange,stroke:#333,stroke-width:1px;
classDef next fill:lightblue,stroke:#333,stroke-width:1px;

class multinodedag done;
class versioning nearlydone;
class setschema,errors,deploydags,nhsdag,deploygcc inprogress;
Loading
@rufuspollock
Copy link
Member Author

@hannelita can you update this with current state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant