Improve BIG file dependencies #11

ruohoruotsi · 2019-07-07T20:20:11Z

We need to fix the big file dependencies in this project:

The pre-trained ADR model (binary) is a 88MB file living in the model folder. This make a very heavy upload/download from PYPI.
The torch dependency in requirements.txt by default pulls down the GPU version of torch. This makes integration with Heroku and RTD difficult/impossible because of hard-size-limits. It would be better to integrate and use a CPU-only version. Is this compatible with Travis CI and requirements.txt??

To facilitate all this:

all the ADR pre-trained models live in this Bintray artifactory
Is some clever way (or post install script) that we can download them locally as needed
The upside is that the iranlowo download is fast/small and then if you can separately pull down the models to do inference/prediction.

The text was updated successfully, but these errors were encountered:

Olamyy · 2019-07-09T05:53:48Z

A possible workaround here would be to have a standalone repository for models.
So, if a user needs any functionality tied to a model, a check comes up to see if they've cloned / downloaded the model. If not, an error is raised. This is how I've seen a lot of projects handle this challenge. On travis end, we can have it clone that same repository each time a test needs to be run. Major challenge here is having the user do multiple installs.

I'm not very familiar with torch as I use keras more but why haven't we considered zipping the file yet? Is that going to reduce performance somehow? If not, it'll solve the challenge of having to do multiple installs.

ruohoruotsi · 2019-07-20T06:41:07Z

Let me tackle matters in order:

Regarding "standalone repository for models", I've been saving the models here because pre-optimization (April 2019 time frame), the models were 200MB and too big for github. I listed the link in the top post above ☝️

all the ADR pre-trained models live in this Bintray artifactory

Regarding zipping the file, I optimized the size of the pytorch model. See this issue. It basically removed the "intermediate back-propagation information" necessary to continuing to train from particular model checkpoint. I don't think additional optimization will gain much, but that is another experiment to see what the exact compression factor is.
Finally, back in April when I was trying to get things started and the 200MB model wasn't going to go onto github and I was using the Bintray artifcatory, I thought perhaps I could use pre-install step to setup.py and I asked this question on the repo of the setupmeta project used to easy setup. And the answer is yes, you can use a pre/post-install step to programmatically download from the artifcatory, so that is the path I think we need to explore, I can tackle this next week, it think it'll take some experimentation (trail & error) to ensure that things work smoothly.

This the StackOverflow thread with more details/instructions to implement

ruohoruotsi added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed labels Jul 7, 2019

ruohoruotsi assigned ruohoruotsi and Timilehin Jul 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BIG file dependencies #11

Improve BIG file dependencies #11

ruohoruotsi commented Jul 7, 2019

Olamyy commented Jul 9, 2019

ruohoruotsi commented Jul 20, 2019

Improve BIG file dependencies #11

Improve BIG file dependencies #11

Comments

ruohoruotsi commented Jul 7, 2019

Olamyy commented Jul 9, 2019

ruohoruotsi commented Jul 20, 2019