Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode in titles/usernames #35

Open
3 of 4 tasks
jdpage opened this issue Dec 23, 2015 · 3 comments
Open
3 of 4 tasks

Unicode in titles/usernames #35

jdpage opened this issue Dec 23, 2015 · 3 comments

Comments

@jdpage
Copy link
Owner

jdpage commented Dec 23, 2015

Right now, page titles are required to match /^[a-z][a-z0-9]*(?:\/[a-z][a-z0-9]*)*$/i, i.e. they must be one or more parts separated by slashes, where each part consists of an English letter followed by zero or more English letters or Arabic numbers. Usernames are further restricted and must match /^[a-z][a-z0-9]*$/i equivalent to one "part" above. (See also issue #33, which will introduce spaces in titles and forever ban underscores.)

This policy is obviously horrifically Anglocentric. I went ahead and implemented it because I wasn't up on Perl 5 Unicode support, and it's easier to expand the character-set allowed for titles than it is to contract it. But the question remains: what should the title format be?

  • Research Perl 5 Unicode support (initial reading suggests 5.18 supports it well, throughout)
  • Research Sqlite3 Unicode support (seems to be passthrough)
  • Choose a title format
  • Implementation

References:

@jdpage jdpage added this to the 0.1 Aromaticity milestone Dec 23, 2015
@jdpage
Copy link
Owner Author

jdpage commented Dec 23, 2015

  • Wikipedia disallows # < > [ ] | { }, disrecommends . /, and bans sequences of three or more consecutive ~ characters.
  • Unless we change the link syntax, the : | [ ] characters are likely to cause problems for us.

@jdpage
Copy link
Owner Author

jdpage commented Jan 10, 2016

My current thinking is that page titles follow this format:

<page-title> ::= <title-part> { "/" <page-title> }
<title-part> ::= <letter-character> { <title-character> }
<title-character> ::= <letter-character>
                    | <decimal-digit-character>
                    | <combining-character>
                    | <formatting-character>
<letter-character> ::= /\pL/  ; uppercase, lowercase, titlecase, modifier, other
<decimal-digit-character> ::= /\p{Nd}/
<combining-character> ::= /[\p{Mn}\p{Mc}]/  ; non-spacing, spacing-combining
<formatting-character> ::= /\p{Cf}/

i.e. they match the regular expression:
/\pL[\pL\p{Nd}\p{Mn}\p{Mc}\p{Cf}]*(?:\/\pL[\pL\p{Nd}\p{Mn}\p{Mc}\p{Cf}]*)*/

Furthermore, all page titles, slugs, what-have-you should be put into NFKC (per http://unicode.org/faq/normalization.html#2) before any processing occurs. This means that homograph attacks will not cause problems. It also means that some titles which (as input by the user) would not be accepted by the above pattern will be transformed into titles that are (e.g. titles containing Roman numerals, which in NFKC are decomposed into ASCII characters, or more practically titles copy-pasted from sources where the actual ffi ligature character was used).

@jdpage jdpage modified the milestones: 0.1 Aromaticity, 0.2 Benzene Jan 10, 2016
@jdpage
Copy link
Owner Author

jdpage commented Jan 10, 2016

Note that the above pattern still allows for a #33-style implementation of spaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant