Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

[stub] Scrape and harvest dataset documentation #116

Open
4 tasks
nightsh opened this issue Apr 14, 2020 · 0 comments
Open
4 tasks

[stub] Scrape and harvest dataset documentation #116

nightsh opened this issue Apr 14, 2020 · 0 comments

Comments

@nightsh
Copy link
Contributor

nightsh commented Apr 14, 2020

Dataset documentation is, in out CKAN extension:

  • a package
  • has type = documentation
  • has a relationship with a dataset type package
  • has at least one resource

We can attempt to automatically extract data profile documentation from HTML pages following the rules:

  • if a page has at least one resource, extract it as dataset (as we already do)
  • if there's a dataset on page, check for document type files (e.g. PDFs)
  • from all the document type files, identify which ones to keep (this is the problematic bit, as we might collect many false positives)
  • attempt to extract a description for the documentation based on the context of the most relevant document files

In addition to that, we also need to add the resulting documentation to:

  • scraper output bucket
    • new documentation subdirectory
  • resulting final data.json file
    • new @type value
    • new metadata for Documentation type:
      • dataset - link to the dataset it provides documentation for
  • the datajson extension - needs documentation processing capabilities based on the data it finds in the data.json file

Tasks:

  • amend the scraper output bucket and the output traversal functions to use the new structure
  • implement the new parsing rules to get documentation dumps
  • implement the new type in the datajson schema
  • add the found documentations to the datajson output

Acceptance criteria

  • the scraping process generates (and links) documentation for some of the produced datasets
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant