PDFString supports only one-byte characters #1649

NikitaKemarskiy · 2024-07-04T11:52:11Z

What were you trying to do?

I was trying to add a comment to PDF with cyrillic letters.

How did you attempt to do it?

const commentAnnotRef = this.pdfDocument.context.register(
  this.pdfDocument.context.obj({
    Type: 'Annot',
    Subtype: 'Text',
    Open: true,
    Name: 'Comment', // Determines the icon to place in the document
    T: PDFString.of('abc абві äüöß'), // Comment title
    Contents: PDFString.of('abc абві äüöß'), // Comment main text
    // The position of the annotation
    Rect: [
      xCoordinate,
      pageHeight - yCoordinate,
      xCoordinate,
      pageHeight - yCoordinate,
    ],
  })
)

What actually happened?

It turned out that one-byte per characters is used under the hood (see the result on the screenshot)

What did you expect to happen?

I expected UTF-8 characters to work correctly.

How can we reproduce the issue?

Try to add the comment to PDF file using the code I've provided

Version

1.17.1

What environment are you running pdf-lib in?

Node

Checklist

My report includes a Short, Self Contained, Correct (Compilable) Example.
I have attached all PDFs, images, and other files needed to run my SSCCE.

Additional Notes

No response

NikitaKemarskiy · 2024-07-04T13:16:34Z

I've tried to come up with the custom PDFUnicodeString class but it didn't worked out:

export class PDFUnicodeString extends PDFObject {
  // The PDF spec allows newlines and parens to appear directly within a literal
  // string. These character _may_ be escaped. But they do not _have_ to be. So
  // for simplicity, we will not bother escaping them.
  static of = (value: string) => new PDFUnicodeString(value);

  private readonly value: string;

  private constructor(value: string) {
    super();
    this.value = value;
  }

  asBytes(): Uint8Array {
    return new TextEncoder().encode(this.value)
  }

  asString(): string {
    return this.value;
  }

  clone(): PDFUnicodeString {
    return PDFUnicodeString.of(this.value);
  }

  toString(): string {
    return `(${this.value})`;
  }

  sizeInBytes(): number {
    return new TextEncoder().encode(this.value).length + 2;
  }

  copyBytesInto(buffer: Uint8Array, offset: number): number {
    buffer[offset++] = 40;
    const encodedValue = new TextEncoder().encode(this.value);
    buffer.set(encodedValue, offset);
    offset += encodedValue.length;
    buffer[offset++] = 41;
    
    return encodedValue.length + 2;
  }
}

NikitaKemarskiy · 2024-07-10T07:24:08Z

UPD: PDFHexString class solves the problem: PDFHexString.fromText(YOUR_TEXT)

NikitaKemarskiy added bug needs-triage labels Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFString supports only one-byte characters #1649

PDFString supports only one-byte characters #1649

NikitaKemarskiy commented Jul 4, 2024

NikitaKemarskiy commented Jul 4, 2024

NikitaKemarskiy commented Jul 10, 2024

PDFString supports only one-byte characters #1649

PDFString supports only one-byte characters #1649

Comments

NikitaKemarskiy commented Jul 4, 2024

What were you trying to do?

How did you attempt to do it?

What actually happened?

What did you expect to happen?

How can we reproduce the issue?

Version

What environment are you running pdf-lib in?

Checklist

Additional Notes

NikitaKemarskiy commented Jul 4, 2024

NikitaKemarskiy commented Jul 10, 2024