Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The CI uses up a lot of resources #212

Open
TheKnarf opened this issue Aug 1, 2024 · 20 comments
Open

The CI uses up a lot of resources #212

TheKnarf opened this issue Aug 1, 2024 · 20 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request ready for implementation The problem was discussed and is ready for a PR
Milestone

Comments

@TheKnarf
Copy link

TheKnarf commented Aug 1, 2024

Screenshot 2024-08-01 at 14 41 38

A single game jam was all it took.

Too late I realized that I should disable the windows/linux/mac build since I only really needed the web build for the game jam. I think the template probably should disable those by default and require people to enable them if they need them.

Maybe there are other things one could do to reduce the amount of minutes it takes up? Maybe running cargo fmt and cargo test isn't necessary by default on main. Maybe one could run cargo test without triggering it parsing the docs folder? Are there other improvements one could make?

@TheKnarf
Copy link
Author

TheKnarf commented Aug 1, 2024

Screenshot 2024-08-01 at 14 47 00

CI builds took almost 30 minutes to run.

Screenshot 2024-08-01 at 14 47 39

Release build took over 30 minutes when building for all targets, less than 15 minutes after disabling everything except for the web build.

@janhohenheim
Copy link
Member

janhohenheim commented Aug 1, 2024

We intentionally made it do as much stuff as possible, assuming that most users would open source their code by default (which means infinite GitHub actions) or be ready to deal either the consequences of having a private repository, i.e. fiddling with the workflow.

That said, it would be fairly easy to enable or disable the builds based on a cargo generate flag. How would you feel about that?

The CI builds should take like a minute at most when cached, and an uncached release takes about 10 minutes for me. Did something go wrong there? Do private repos get less powerful CI agents? @TimJentzsch cargo cache works on private repos, right?

@TimJentzsch
Copy link

Under the hood it uses the GitHub Actions cache, I'm not aware of any restrictions for private repositories.

@benfrankel
Copy link
Collaborator

benfrankel commented Aug 1, 2024

Yeah I'm curious if caching wasn't working, or otherwise why it was taking so long. Looking at my own game's CI, I see this error when trying to save the cache (https://github.com/benfrankel/blobo_party/actions/runs/10139032444):

zstd: error 70 : Write error : cannot write block : No space left on device 
/usr/bin/tar: cache.tzst: Cannot write: Broken pipe
/usr/bin/tar: Child returned status 70
/usr/bin/tar: Error is not recoverable: exiting now
Warning: Failed to save: "/usr/bin/tar" failed with error: The process '/usr/bin/tar' failed with exit code 2

This started happening on the same CI run that spontaneously started recompiling from log, making CI take 5x longer (~15 minutes). Which is very strange because the cargo --version / Cargo.toml / Cargo.lock weren't touched as part of that commit. Maybe related:

2024-08-01_1722530283_1284x74

This is with 5 saved caches at 3 GB each.

On the other hand, my releases were all under 10 minutes because I reused the same tag every time and hit the cache.

@janhohenheim
Copy link
Member

janhohenheim commented Aug 1, 2024

@mockersf thanks! That explains the performance issues then. Nothing we can fix, but it might be good to point it out in the docs.
That leaves the spontaneously broken cache, which sounds like a bug in either GitHub's cache or cargo cache. @TimJentzsch, what do you think? Is there something we can do about this?

Edit: oh, looking at the logs, the machine simply ran out of space! Yeah, I've had that happen to me as well a couple of times. Don't think there's much we can do there other than restarting a build. Although you're right, this shouldn't invalidate the cache. Hmm.

@TimJentzsch
Copy link

Hmm thats interesting.
GitHub should automatically delete old caches when the cache limit is being exceeded, but maybe it has a delay.

Maybe we can start optimizing the size of the cache though.

@benfrankel
Copy link
Collaborator

benfrankel commented Aug 1, 2024

So I noticed that in the CI run I linked, the Docs and Tests jobs actually restored their cache from the Release workflow that ran right before as a fallback, then Docs was able to save its cache but Tests was not, so every subsequent run of Tests used the Docs cache instead. It seems like cache thrashing / out of cache space in my particular case.

@TimJentzsch
Copy link

So I noticed that in the CI run I linked, the Docs and Tests jobs actually restored their cache from the Release workflow that ran right before as a fallback, then Docs was able to save its cache but Tests was not, so every subsequent run of Tests used the Docs cache instead. It seems like cache thrashing / out of cache space in my particular case.

Ah shoot. We should probably remove the last cache key fallback. It doesnt make sense for Check to fallback to a cache from Test for example, because they need different builds anyway, it just increases the cache size for nothing.

@benfrankel
Copy link
Collaborator

benfrankel commented Aug 1, 2024

Well I think the fallback itself is okay and shouldn't increase cache size, the problem is that Tests was unable to save its own cache afterwards for some reason.

Unless using a fallback means that when the cache is saved at the end of the job, it will also include any irrelevant stuff it downloaded from the fallback, thus increasing cache sizes every time a fallback occurs? If that's the case... maybe it would be better to have no fallback at all.

@TimJentzsch
Copy link

Well I think the fallback itself is okay and shouldn't increase cache size, the problem is that Tests was unable to save its own cache afterwards for some reason.

Unless using a fallback means that when the cache is saved at the end of the job, it will also include any irrelevant stuff it downloaded from the fallback, thus increasing cache sizes every time a fallback occurs? If that's the case... maybe it would be better to have no fallback at all.

Yes, the stuff from the fallbacks will also be saved again because its in the same folder.
But without fallbacks I think we will have a cache miss too often...
If we use the fallback from the same job it should be fine though.
Unrelated jobs can double the cache size in the worst case

@benfrankel
Copy link
Collaborator

benfrankel commented Aug 1, 2024

Could there be a situation where cache Z is a fallback from cache Y, which is a fallback from cache X, ... causing an ever-ballooning cache size over time? I wonder if there's a way to prune unused stuff from target/, but I would be surprised.

@TimJentzsch
Copy link

Yes, I think in theory that's possible.
We should adjust the fallback in cargo-cache a bit.

Otherwise I guess it's a good idea to clear your caches in CI occasionally.

@janhohenheim
Copy link
Member

Is it alright if I move this issue to cargo cache?

@benfrankel
Copy link
Collaborator

benfrankel commented Aug 2, 2024

IMO that makes sense for cache size specifically, but this issue should remain here as well since we ought to consider the fact that private runners have worse performance + are not free.

@janhohenheim
Copy link
Member

Would a solution be to ask you in cargo generate which builds you want?

@benfrankel
Copy link
Collaborator

Yes, and a note in the workflows doc that mentions the issue with private runners, so users hopefully see that when they're setting up CI.

@benfrankel benfrankel added documentation Improvements or additions to documentation question Further information is requested labels Aug 2, 2024
@benfrankel benfrankel added this to the Bevy Jam 6 milestone Aug 2, 2024
@TimJentzsch
Copy link

I created an issue for this on the cargo-cache side: Leafwing-Studios/cargo-cache#22

@janhohenheim janhohenheim added the ready for implementation The problem was discussed and is ready for a PR label Aug 5, 2024
@benfrankel
Copy link
Collaborator

Here's another issue on the cargo-cache side for the root cause / optimizing the cache size down even further: Leafwing-Studios/cargo-cache#24

benfrankel pushed a commit that referenced this issue Aug 28, 2024
Addresses part of #212.

I recently merged a feature into
[Leafwing-Studios/cargo-cache](https://github.com/Leafwing-Studios/cargo-cache)
v2.3.0 that automatically deletes unused build artifacts before creating
a new cache, to prevent cache sizes from snowballing.

It's disabled by default due to backward compatibility reasons, so this
PR enables it. Cache sweeping should keep cache sizes small and workflow
runs fast.

(As an side, should a cache be used at all in the release build? Caching
prevents builds from being reproducible, and can introduce the
occasional issue when building.)
@benfrankel
Copy link
Collaborator

benfrankel commented Aug 29, 2024

Remaining tasks:

  • Add a warning about private runners (costs, quota, smaller / slower) + suggest disabling platforms you don't want to release to
  • Add a flag to opt out of caching in release workflow to save cache space

@benfrankel benfrankel added enhancement New feature or request and removed question Further information is requested labels Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request ready for implementation The problem was discussed and is ready for a PR
Projects
None yet
Development

No branches or pull requests

5 participants