Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BIG file dependencies #11

Open
ruohoruotsi opened this issue Jul 7, 2019 · 2 comments
Open

Improve BIG file dependencies #11

ruohoruotsi opened this issue Jul 7, 2019 · 2 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed

Comments

@ruohoruotsi
Copy link
Member

We need to fix the big file dependencies in this project:

  • The pre-trained ADR model (binary) is a 88MB file living in the model folder. This make a very heavy upload/download from PYPI.
  • The torch dependency in requirements.txt by default pulls down the GPU version of torch. This makes integration with Heroku and RTD difficult/impossible because of hard-size-limits. It would be better to integrate and use a CPU-only version. Is this compatible with Travis CI and requirements.txt??

To facilitate all this:

  • all the ADR pre-trained models live in this Bintray artifactory
  • Is some clever way (or post install script) that we can download them locally as needed
  • The upside is that the iranlowo download is fast/small and then if you can separately pull down the models to do inference/prediction.
@ruohoruotsi ruohoruotsi added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed labels Jul 7, 2019
@Olamyy
Copy link
Contributor

Olamyy commented Jul 9, 2019

A possible workaround here would be to have a standalone repository for models.
So, if a user needs any functionality tied to a model, a check comes up to see if they've cloned / downloaded the model. If not, an error is raised. This is how I've seen a lot of projects handle this challenge. On travis end, we can have it clone that same repository each time a test needs to be run. Major challenge here is having the user do multiple installs.

I'm not very familiar with torch as I use keras more but why haven't we considered zipping the file yet? Is that going to reduce performance somehow? If not, it'll solve the challenge of having to do multiple installs.

@ruohoruotsi
Copy link
Member Author

Let me tackle matters in order:

  • Regarding "standalone repository for models", I've been saving the models here because pre-optimization (April 2019 time frame), the models were 200MB and too big for github. I listed the link in the top post above ☝️

all the ADR pre-trained models live in this Bintray artifactory

  • Regarding zipping the file, I optimized the size of the pytorch model. See this issue. It basically removed the "intermediate back-propagation information" necessary to continuing to train from particular model checkpoint. I don't think additional optimization will gain much, but that is another experiment to see what the exact compression factor is.

  • Finally, back in April when I was trying to get things started and the 200MB model wasn't going to go onto github and I was using the Bintray artifcatory, I thought perhaps I could use pre-install step to setup.py and I asked this question on the repo of the setupmeta project used to easy setup. And the answer is yes, you can use a pre/post-install step to programmatically download from the artifcatory, so that is the path I think we need to explore, I can tackle this next week, it think it'll take some experimentation (trail & error) to ensure that things work smoothly.

    This the StackOverflow thread with more details/instructions to implement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants