Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

(New Office) White House Initiatives: create scraper for these offices (Phase 1) #163

Open
6 tasks
higorspinto opened this issue May 26, 2020 · 0 comments
Open
6 tasks

Comments

@higorspinto
Copy link
Contributor

higorspinto commented May 26, 2020

The White House Initiatives are among the list of new offices whose datasets need to be ingested into the data portal. For this to happen, we need to create a new scraper to crawl/parse the available webpages of the office.

https://sites.ed.gov/hispanic-initiative/
https://sites.ed.gov/whieeaa/
https://sites.ed.gov/whhbcu/

Acceptance Criteria

  • We have a functional crawler that crawls through the webpages of the offices
  • We have a functional parser that understands the page structures and generates structured data
  • Datasets are produced when the scraper is run

Tasks

  • Identify the possible page structures in the target site
  • Write one or multiple parsers that cover as many cases as possible
  • Test if it runs well within the pipeline

Jira Card

@higorspinto higorspinto changed the title (New Office) White House Initiatives on: create scraper for these offices (New Office) White House Initiatives: create scraper for these offices May 26, 2020
@higorspinto higorspinto changed the title (New Office) White House Initiatives: create scraper for these offices (New Office) White House Initiatives: create scraper for these offices (Phase 1) May 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant