Unicode in titles/usernames #35

jdpage · 2015-12-23T22:40:07Z

Right now, page titles are required to match /^[a-z][a-z0-9]*(?:\/[a-z][a-z0-9]*)*$/i, i.e. they must be one or more parts separated by slashes, where each part consists of an English letter followed by zero or more English letters or Arabic numbers. Usernames are further restricted and must match /^[a-z][a-z0-9]*$/i equivalent to one "part" above. (See also issue #33, which will introduce spaces in titles and forever ban underscores.)

This policy is obviously horrifically Anglocentric. I went ahead and implemented it because I wasn't up on Perl 5 Unicode support, and it's easier to expand the character-set allowed for titles than it is to contract it. But the question remains: what should the title format be?

Research Perl 5 Unicode support (initial reading suggests 5.18 supports it well, throughout)
Research Sqlite3 Unicode support (seems to be passthrough)
Choose a title format
Implementation

References:

The text was updated successfully, but these errors were encountered:

jdpage · 2015-12-23T22:45:06Z

Wikipedia disallows # < > [ ] | { }, disrecommends . /, and bans sequences of three or more consecutive ~ characters.
Unless we change the link syntax, the : | [ ] characters are likely to cause problems for us.

jdpage · 2016-01-10T09:18:10Z

My current thinking is that page titles follow this format:

<page-title> ::= <title-part> { "/" <page-title> }
<title-part> ::= <letter-character> { <title-character> }
<title-character> ::= <letter-character>
                    | <decimal-digit-character>
                    | <combining-character>
                    | <formatting-character>
<letter-character> ::= /\pL/  ; uppercase, lowercase, titlecase, modifier, other
<decimal-digit-character> ::= /\p{Nd}/
<combining-character> ::= /[\p{Mn}\p{Mc}]/  ; non-spacing, spacing-combining
<formatting-character> ::= /\p{Cf}/

i.e. they match the regular expression:
/\pL[\pL\p{Nd}\p{Mn}\p{Mc}\p{Cf}]*(?:\/\pL[\pL\p{Nd}\p{Mn}\p{Mc}\p{Cf}]*)*/

Furthermore, all page titles, slugs, what-have-you should be put into NFKC (per http://unicode.org/faq/normalization.html#2) before any processing occurs. This means that homograph attacks will not cause problems. It also means that some titles which (as input by the user) would not be accepted by the above pattern will be transformed into titles that are (e.g. titles containing Roman numerals, which in NFKC are decomposed into ASCII characters, or more practically titles copy-pasted from sources where the actual ffi ligature character was used).

jdpage · 2016-01-10T09:40:58Z

Note that the above pattern still allows for a #33-style implementation of spaces.

jdpage added the question label Dec 23, 2015

jdpage added this to the 0.1 Aromaticity milestone Dec 23, 2015

jdpage added the bikeshedding label Dec 31, 2015

jdpage modified the milestones: 0.2 Benzene, 0.1 Aromaticity Dec 31, 2015

jdpage modified the milestones: 0.1 Aromaticity, 0.2 Benzene Jan 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode in titles/usernames #35

Unicode in titles/usernames #35

jdpage commented Dec 23, 2015

jdpage commented Dec 23, 2015

jdpage commented Jan 10, 2016

jdpage commented Jan 10, 2016

Unicode in titles/usernames #35

Unicode in titles/usernames #35

Comments

jdpage commented Dec 23, 2015

References:

jdpage commented Dec 23, 2015

jdpage commented Jan 10, 2016

jdpage commented Jan 10, 2016