Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43960: [R] fix str_sub binding to properly handle negative end values #44141

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

coussens
Copy link

@coussens coussens commented Sep 17, 2024

First-time contributor here, so let me know where I can improve!

Rationale for this change

The str_sub binding in arrow was not handling negative end values properly. The problem was two-fold:

  1. When end values were negative (and less than the start value, which might be positive), str_sub would improperly return an empty string.
  2. When end values were < -1 but the end position was still to the right of the start position, str_sub failed to return the final character in the substring, since it did not account for the fact that end is counted exclusively in the underlying C++ function (utf8_slice_codeunits), but inclusively in R.

See discussion/examples at #43960 for details.

What changes are included in this PR?

  1. The removal of lines from r/R/dplyr-funcs-string.R that previously set end= 0 when start < end, which meant if the user was counting backwards from the end of the string (with a negative end value), an empty string would [wrongly] be returned. It appears that the case that the previous code was trying to address is already handled properly by the underlying C++ function (utf8_slice_codeunits).
  2. Addition of lines to r/R/dplyr-funcs-string.R in order to account the difference in between R's inclusive end and C++'s exclusive end when end is negative.
  3. The addition of a test (described below) to r/tests/testthat/test-dplyr-funcs-string.R to test for these cases.

Are these changes tested?

Yes, I ran all tests in r/tests/testthat/test-dplyr-funcs-string.R, including one which I added (see attached commit), which explicitly tests the case where end is negative (-3) and less than the start value (1). This also tests the case where end < -1 and to the right of the start position.

Are there any user-facing changes?

No.

This PR contains a "Critical Fix". Previously:

  • When end values were negative (and less than the start value, which might be positive), str_sub would improperly return an empty string.
  • When end values were < -1 but the end position was still to the right of the start position, str_sub failed to return the final character in the substring, since it did not account for the fact that end is counted exclusively in the underlying C++ function (utf8_slice_codeunits), but inclusively in R.

Copy link

⚠️ GitHub issue #43960 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant