Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache engine for knitr #167

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

cache engine for knitr #167

wants to merge 5 commits into from

Conversation

tmastny
Copy link

@tmastny tmastny commented Feb 26, 2018

This is my attempt to add Python session caching between knitr chunks in reticulate. The pull request addresses knitr issue #1505.

First, note that the cache engine depends on a new cache engine API seen in knitr pull request #1518. The basic idea is that we need to add this line to the setup chunk:

knitr::cache_engines$set(python = reticulate::cache_eng_python)

My implementation depends on the Python module dill. This is a well-regarded extension to basic Python pickling. As you can see in the Github issues there are known limitations, but dill is actually able to save more types of objects than the standard pickle. This includes session variables (with names) and imported modules. I have a dill unit test showcasing some basic functionality.

The main purpose of the pull request as described in knitr issue #1505 is tested in the cache engine unit test.

@jjallaire
Copy link
Member

I'll let @yihui comment on the knitr PR and @kevinushey comment on how this fits with our knitr engine. One thing I'd like to see is that the dll module isn't actually required but instead used if it's available. It also seems like there should be an option toggling pickle vs. dll.

@tmastny
Copy link
Author

tmastny commented Feb 26, 2018

Thanks for the feedback. If you look at line 185 of R/knitr-engine.R, if the dill module is not installed it drops a warning that dill is needed for full caching functionality. Then it continues as normal.

As for pickle vs. dill, that is a more complicated problem. dill is able to save more types of objects than pickle in an easier way (one call to dill.dump_session(...). This includes loaded modules and named variables which pickle can't handle, but are very important for knitr caching.

It would be much more difficult to setup the same system using pickle, and I worry I would be duplicating the existing dill functionality.

@jjallaire
Copy link
Member

Okay, that's a fair point about duplicating work. I think that the warning as you have it is fine.

R/knitr-engine.R Outdated
}})
if (dill == "No dill") return()

py_run_string("globals().pop('r')")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to ensure that the r / R object is truly the 'shim' R object introduced by reticulate, since a user could in theory overwrite it.

Copy link
Author

@tmastny tmastny Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'm not actually sure how to check if the class R is the one introduced by reticulate or the user. Take a look at my latest commit and see if you think it is sufficient. If not, I can try to find a work around.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks sufficient to me, but you're entirely right that we should have a way of making this more deterministic. Maybe that class should instead be defined within our rpytools module?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Then we could possibly do something like this: https://stackoverflow.com/questions/14570802/python-check-if-object-is-instance-of-any-class-from-a-certain-module

I'll see if I can implement something, although I haven't messed with that side of the package before.

R/knitr-engine.R Outdated
stop(module$message)
}
} else {
py_run_string("import dill")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line necessary? IIUC the above import attempt should ensure the module is loaded if needed. Or is this done to put dill into the main module?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it isn't necessary. This was a hold over from a previous caching attempt. I just added a commit removes it. The new commit passes all tests on my end.

@kevinushey
Copy link
Collaborator

This PR looks good to me, but I think we might want to hold this until the next release of reticulate just because we want to stabilize soon.

The knitr PR also looks nice and clean but maybe we can give Yihui a nudge later to see if he has any extra comments.

It looks like there are some merge conflicts on this PR (sorry, probably from my recent merge). Would you be able to fix those up?

@yihui
Copy link
Member

yihui commented Mar 6, 2018

@tmastny I don't quite see the usefulness of the object cache_engines in knitr (I admit I didn't think very carefully about it due to lack of time). The engine function registered in knit_engines$set() can be responsible for caching. For example, the Rcpp engine handles caching by itself, and knitr does not need to know anything about it (just pass knitr's cache dir to this engine):

https://github.com/yihui/knitr/blob/0da648bff63/R/engine.R#L194-L199

@tmastny
Copy link
Author

tmastny commented Mar 6, 2018

@yihui I will carefully review Rcpp's cache engine and see what I can learn.

However, the reason I choose to add a cache_engines object in knitr rather than handle it in eng_python is because the current flow of the knitr engine prevented the python engine from handling the cache. I alluded to this in my issue, but here's a diagram I think will help. This is the current flow of block.R as I understand it, with the green box representing the cache_engines object.

image
As you can see, if a cache currently exists the function call_block loads the chunk, returns, and loops to the next chunk. It never calls the python engine, so the python engine never has a chance to load a python cache for that chunk.

I may be wrong, but I believe what you are suggesting (allowing the engines to handle caching) would require some additional refactoring on the knitr side to get something like this:

image

@yihui
Copy link
Member

yihui commented Mar 6, 2018

@tmastny Thanks for the diagrams! For the Rcpp engine, I actually make an exception here: https://github.com/yihui/knitr/blob/0da648bff6/R/block.R#L72 Perhaps we need to do the same thing for the python engine. In the long run, knitr needs a mechanism to let engines handle caching by themselves, which seems to be the goal of your PR yihui/knitr#1518. I was wondering if it was possible not to introduce a new object cache_engines.

@eddelbuettel
Copy link
Contributor

Not to state the obious, and I only glanced at this, but I also think that knitr benefits from Rcpp::sourceCpp() having a native cache logic -- sourceCpp() knows it won't need to recompile unchanged code thanks to JJ putting that logic in there in the first place.

@tmastny
Copy link
Author

tmastny commented Mar 6, 2018

@yihui Ah, thanks for pointing that out. So if I understand correctly, when the engine is Rcpp the standard R caching no longer executes. I don't believe this would work for my implementation.

My solution only caches the python environment and depends on R to cache everything else. I'd have to go back to the drawing board to make the python engine cache everything.

Edit: This could work if R has nothing important to cache when the chunk is ran through the python engine, but seeing as how reticulate can exchange data with R, I think R should be allowed to cache.

@yihui
Copy link
Member

yihui commented Mar 6, 2018

@tmastny Okay, I'll spend more time on understanding your PR when I have a chance. Thanks!

@yihui
Copy link
Member

yihui commented Apr 19, 2018

FYI I have merged the knitr PR yihui/knitr#1505.

@cderv
Copy link
Contributor

cderv commented Jan 21, 2021

@kevinushey is this something you will consider ?

The knitr PR is merged but the following part in reticulate wasn't. I am wondering if there is a reason or just because time flies. 😄

@kevinushey
Copy link
Collaborator

The PR has some conflicts and it's a bit stale -- @tmastny, sorry for letting this drop to the wayside but would you have time to bring this PR up-to-date?

@tmastny
Copy link
Author

tmastny commented Jan 22, 2021

Yes I could revisit this in a couple of weeks.

@leogama
Copy link

leogama commented Apr 22, 2022

Working on this issue: uqfoundation/dill#463, yihui/knitr#1505

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants