Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

(New Office) Office of Finance and Operations: create scraper for this office (Phase 1) #161

Open
9 tasks
higorspinto opened this issue May 26, 2020 · 0 comments

Comments

@higorspinto
Copy link
Contributor

higorspinto commented May 26, 2020

The Office of Finance and Operations is among the list of new offices whose datasets need to be ingested into the data portal. For this to happen, we need to create a new scraper to crawl/parse the available webpages of the office.

https://www2.ed.gov/about/offices/list/ofo/index.html

Acceptance Criteria

  • We have a functional crawler that crawls through the webpages of the offices
  • We have a functional parser that understands the page structures and generates structured data
  • Datasets are produced when the scraper is run

Tasks

  • Identify the possible page structures in the target site
  • Write one or multiple parsers that cover as many cases as possible
  • Test if it runs well within the pipeline
  • Ensure datasets produced have a description metadata
  • Ensure datasets have a publisher metadata
  • Improve other metadata (use defaults where available)

Jira Card

@higorspinto higorspinto changed the title (New Office) Office of Finance and Operations: create scraper for this office (New Office) Office of Finance and Operations: create scraper for this office (Phase 1) May 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant