Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for UTF8 string literals #2940

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

iamdmitrij
Copy link

@iamdmitrij iamdmitrij commented May 22, 2024

Fixes #2923.

Implemented the same logic as in plain strings:

  • ""u8 -> "Stryker was here!"u8
  • ""something8 -> ""u8

SyntaxFactory.Literal(string) method used for string mutations doesn't work with UTF8 string literal type (ReadOnlySpan<byte>), so I've taken inspiration on how to make it work from dotnet/roslyn analyzers' code.

Tasks:

  • Modify StringMutator to support UTF8 string literals
  • Unit tests
  • Integration tests
  • Edit docs

@iamdmitrij iamdmitrij marked this pull request as ready for review May 22, 2024 07:56
Copy link
Member

@richardwerkman richardwerkman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your first contribution! I've got some suggestions

""")]
public void ShouldMutateUtf8StringLiteral(string original, string expected)
{
var syntaxTree = CSharpSyntaxTree.ParseText($"""var test = "{original}"u8;""");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you add the u8 token here? I'd put it in the inline data to keep it transparent what is being mutated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, good idea. Will add to inline data.

{
public ReadOnlySpan<byte> HelloWorld()
{
return "Hello"u8 + " "u8 + "World!"u8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run the integration test locally and saw this line isn't mutated correctly. The mutation causes compile errors. I think this isn't related to the mutator but related to the mutation placing logic. I'd say we create a separate defect for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed in this PR, not as a separate defect.

Copy link
Author

@iamdmitrij iamdmitrij Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've investigated why this test is failing. It fails during mutated code compilation. Stryker by default add ternary expression to each operand:

So this statement

return "Hello"u8 + " "u8 + "World!"u8;

is transformed into this:

return (StrykerqBEu3bP1rxHLxTW.MutantControl.IsActive(99)?""u8 :"Hello"u8 )+ (StrykerqBEu3bP1rxHLxTW.MutantControl.IsActive(100)?""u8 :" "u8 )+ (StrykerqBEu3bP1rxHLxTW.MutantControl.IsActive(101)?""u8:"World!"u8);

It boxes each UTF-8 value to ReadOnlySpan<byte>() type. The problem lies with C# and how it allows to concatenate UTF-8 string literals. It allows to concatenate UTF-8 string directly: "Hello"u8 + " "u8 + "World!"u8, but not when they are represented as ReadOnlySpan<byte>: new ReadOnlySpan<byte>() + new ReadOnlySpan<byte>().

Hence, mutated code fails to compile with:

CS9047	Operator '+' cannot be applied to operands of type 'ReadOnlySpan<byte>' and 'ReadOnlySpan<byte>' that are not UTF-8 byte representations

Do you think it's worth fixing this bug in current PR? It also touches core MutantControl injection logic. If so, any new ideas are welcome here. So far, I've only thought of converting UTF-8 strings into arrays and concatenating them manually using LINQ when ternary conditions are used:

public ReadOnlySpan<byte> HelloWorld()
{
    return (true ? ""u8 : "Hello"u8).ToArray()
        .Concat((true ? ""u8 : " "u8).ToArray()
            .Concat((true ? ""u8 : "World!"u8).ToArray()))
        .ToArray();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. In that case, we could maybe include some extra code in the compilation that contains an operator overload for + on ReadOnlySpan<byte>. During the compilation we already pass some custom code to be compiled together with the source project, so that should be fairly easy to do.

Do you think it's worth fixing this bug in current PR?

Yes, as this seems like an easy fix we should add to this PR. Otherwise, we risk that this will never be fixed once this PR has been merged.

Copy link
Member

@dupdob dupdob Jul 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure adding this operator is needed.

  1. "Hello"u8 + " "u8 + "World!"u8; is actually a compile time constant so the compiler concatenate the strings and does not compile any expression. See the actual result:
    internal static readonly __StaticArrayInitTypeSize=13 5A09E8FA9C77807B24E99C9CF999DEBFAD8441E269EB960E201F61FC3DE20D5A/* Not supported: data(48 65 6C 6C 6F 20 57 6F 72 6C 64 21 00) */;
    An array of 13 bytes. This is seen as a single string here. As such, it does not really make sense to mutate sub parts.
  2. In any case, there is no interest in mutating this expression more than once. It is almost certain that if removing Hello is killed by a failed test, that very same test will kill removing the space or world!. So there is no benefit keeping each mutation. Note that this remarks is also valid for classical constant string concatenations.
  3. As such, it would make more sense to mutate the whole expression once, which would remove any problems with the missing operator.
  4. Adding this operator requires injecting a dedicated using statement at the start of any files needing it; meaning it would have to be added everywhere. Furthermore, it could result in compilation errors (ambiguous call) that would require a new rollback logic (to detect it and remove the unnecessary using statement). It does not appear easy to me.
  5. The only consequence of not adding this operand is simply that concatenated u8 strings will not be mutated; which looks alright to me.

TLDR;
I recommend: doing nothing for now and contemplate detecting concatenated constant strings (u8 or otherwise) to be mutated at once. But it requires specific logic within the ExpressionOrchestrator

test = ""u8;
}

public bool IsNullOrEmpty(ReadOnlySpan<byte> myString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has nothing to do with utf8 strings and can be removed

@richardwerkman
Copy link
Member

I tested your PR on the following code and it broke:

        public string Test()
        {
            return "Hello " + " " + "World";
        }

So there is some issue with the mutator logic

@iamdmitrij
Copy link
Author

iamdmitrij commented May 28, 2024

I tested your PR on the following code and it broke:

        public string Test()
        {
            return "Hello " + " " + "World";
        }

So there is some issue with the mutator logic

Can you specify what exactly fails here?

I have noticed it doesn't compile when concatenation operator + is mutated to -.

var a1 = "Hello"u8 + " "u8 + "World"u8; // OK
var a2 = "Hello"u8 - " "u8 - "World"u8; // Doesn't compile

var a3 = "Hello" + " " + "World"; // OK
var a4 = "Hello" - " " - "World"; // Doesn't compile

If that's the case, should it be fixed somehow? Because I don't see how current or previous string mutator code has solution for that. My best guess would be to fix BinaryExpressionMutator implementation to avoid this.

@richardelekta
Copy link

I have noticed it doesn't compile when concatenation operator + is mutated to -.

This isn't the problem, we prevent this mutation from being placed.

var a1 = "Hello"u8 + " "u8 + "World"u8; // OK
var a2 = ""u8 + " "u8 + "World"u8; // Doesn't compile

var a3 = "Hello" + " " + "World"; // OK
var a4 = "" + " " + "World"; // OK

The above displays the issue. The interesting part is that the code should compile, but it doesn't because of how stryker places the mutation. I haven't investigate yet what exactly goes wrong, but my guess is that there is a flaw in our mutation placing logic

@iamdmitrij
Copy link
Author

I have noticed it doesn't compile when concatenation operator + is mutated to -.

This isn't the problem, we prevent this mutation from being placed.

var a1 = "Hello"u8 + " "u8 + "World"u8; // OK
var a2 = ""u8 + " "u8 + "World"u8; // Doesn't compile

var a3 = "Hello" + " " + "World"; // OK
var a4 = "" + " " + "World"; // OK

The above displays the issue. The interesting part is that the code should compile, but it doesn't because of how stryker places the mutation. I haven't investigate yet what exactly goes wrong, but my guess is that there is a flaw in our mutation placing logic

Thank you. Now it's clear, I will look into this use-case.

@richardwerkman
Copy link
Member

@iamdmitrij The integration tests have been updated and now also include my previous example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 string literals are not mutated
5 participants