-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Restore for OperatorField data #1646
base: main
Are you sure you want to change the base?
Conversation
ToDo: Track down all the CPU memory leaks, add GPU restores, language wrapers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question is whether the extra check during the restore is worth the new API, or just force the users to call *Destroy
on the objects obtained from CeedOperatorFieldGet*
.
I'm not sure what bugs the extra CeedCheck(*vec == op_field->vec, ...)
would catch.
See later comment for the fluids code that drove me to the above question.
2361dc6
to
54e15cd
Compare
Ok, rebased to work off the branch with the lingering Operator memory fix. Impl for CPU ref is in, rest in progress |
4502c7f
to
02da2a3
Compare
CPU blocked and ref in, GPU non-gen next. (would need to rebase in another open PR to do gen) OCCA will be annoying so I'm saving it for last |
fc5c85c
to
b377dfe
Compare
6feaff7
to
5862f7b
Compare
Ok, should be good to review now |
65d2d14
to
dd9ae86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll run on Sunspot real quick to verify the SYCL changes.
|
||
CeedCallBackend(CeedOperatorFieldGetVector(output_fields[i], &vec)); | ||
if (vec == CEED_VECTOR_ACTIVE) { | ||
CeedCallBackend(CeedOperatorFieldGetBasis(output_fields[i], &basis_out)); | ||
CeedCallBackend(CeedOperatorFieldGetElemRestriction(output_fields[i], &rstr_out)); | ||
CeedCheck(!rstr_out || rstr_out == rstr_in, ceed, CEED_ERROR_BACKEND, "Backend does not implement multi-field non-composite operator assembly"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the change in check here deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously it aligns more with the CUDA and HIP backends, but just making sure we're at least mostly certain that the check isn't necessary anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SYCL backend is a huge problem IMHO. During development, changes were made from the CUDA/HIP source that didn't have well captured reasons, and since then the CUDA/HIP source has changed, so its really tricky to make good changes here, especially with the limited access to a good development environment. I don't see a good fix in the near term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wonder if it's worthwhile getting you access to at least sunspot (even if it isn't Aurora right now) just for the backend support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jedbrown Thoughts on getting Jeremy access to Sunspot? CI would be nice too, but these kind of dev changes would be easier in a shorter debug-loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving me a way to build and test would be very helpful - I'd like to make a number of code quality fixes and transfer some CUDA and HIP improvements over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have authority to grant Sunspot access (cc: @KennethEJansen ). I would consider buying a (modestly priced) Intel GPU to put in Noether for dev/testing. Cursory web search hasn't been good at finding actual vendors and I don't know what pricing on the Max 1100 looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, one way or another it would be hugely helpful to the health of those backends to get me access to Sunspot or to get an Intel card somewhere ourselves, otherwise they run the risk of becoming like the OCCA backend
I've pushed up more fixes to
where it's
The rest are correctness failures in assembly. I'd try limiting the SYCL changes to the absolute minimum required to get this PR merged. |
Those changes in that branch look good, so feel free to push them to this branch |
8a95c7e
to
be77d12
Compare
@jrwrigh I think I put back the pieces that were accidentally lost |
Needed to apply the following diff to compile: diff --git i/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp w/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp
index 9d2ecec6..a784e1e5 100644
--- i/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp
+++ w/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp
@@ -662,7 +662,7 @@ static inline int CeedOperatorAssembleDiagonalSetup_Sycl(CeedOperator op) {
CeedBasis basis;
CeedCallBackend(CeedOperatorFieldGetElemRestriction(op_fields[i], &elem_rstr));
- CeedCheck(!rstr_in || rstr_in == rstr, ceed, CEED_ERROR_BACKEND,
+ CeedCheck(!rstr_in || rstr_in == elem_rstr, ceed, CEED_ERROR_BACKEND,
"Backend does not implement multi-field non-composite operator diagonal assembly");
if (!rstr_in) CeedCallBackend(CeedElemRestrictionReferenceCopy(elem_rstr, &rstr_in));
CeedCallBackend(CeedOperatorFieldGetBasis(op_fields[i], &basis)); But I'm still seeing correctness failures for:
All assembly errors though, so you managed to squash some of the issues. |
I'll keep plunking away, thanks for the cross-check! |
c759d50
to
9a34422
Compare
|
Note
will give the details on the failures for just those |
Fixed t506, but the others are still failing: sorted this time
They fail with all the SYCL backends (ref, shared, and gen). Not too surprising since the assembly always goes through the ref backend, but just another sanity check. |
Its the same routine for all of them, so now you only need to look at |
Found another difference. Any useful error messages in the test suite output with |
I've been running with
or similar. |
Interesting, I wouldn't have expected getting 0s |
Yeah, all the errors are in the form |
9fddab1
to
067190e
Compare
Huzzah, at long last. I think this is good to add now, and I can spin up the companion PR to MFEM. Mildly disruptive but not a huge deal as it just leaks if not fixed downstream |
067190e
to
a071b1c
Compare
Fixes #1639
What do we think @jrwrigh. You can see how this does get a bit invasive.