é is not é: The same glyph can have different Unicode representations.

posted 2012-Jan-25

Did you know that é is not the same as é? No, seriously.

The first is a Unicode “Latin small letter e with acute” character, 0xC3 0xA9 in UTF-8.

The second is a “Latin small letter e” character (0x65 in ASCII and UTF-8) followed by a “Combining Acute Accent” character (0xCC 0x81 in UTF-8). The second glyph is zero-width and draws over top of the first.

Why does this matter? Well, if you use OS X to name and upload a file to your web server and then later try to navigate to the file by typing in the address in Windows, you will fail.

Making matters worse, when you then browse the directory of files on the web server and click on the link you get a file name that looks exactly like what you typed in, but that works (unlike what you typed). [Edit: I’ve actually put files with both names in the directory.]

Unicode is hrrd.

joemppe
10:32AM ET
2012-Mar-01

Unicode does provide a solution this, but it looks like Windows doesn’t implement Unicode canonical equivalence correctly.

Might also be nice if OS X stored file names in Unicode normalized form.

net.mind details other résumé contact
Phrogz.net