[BRLTTY] [SPAM] Re: [SPAM] Re: A strange Unicode translation

Dave Mielke dave at mielke.cc
Tue Jul 3 22:28:13 EDT 2012


[quoted lines by Lee Maschmeyer on 2012/07/03 at 14:07 -0400]

>It turns out the problem is in the lynx dump of the xml file. Although that 
>file contains protégé, when lynx finishes dumping it it becomes:
>
>protégé
>
>which seems to be 0xc383c2a9 

Okay, two things are at work, here.

First: The file is encoded in UTF-8 whereas your lynx isn't configured to 
detect this. Your lynx is believing that the file is encoded in Latin1 (ISO 
8859-1).

Second: Your lynx is configured to use ASCII equivalents for non-ASCII 
characters. Usually this means displaying a letter without its accent.

So, it goes like this: The original character, é (lowercase e with acute), is 
E9. This character, encoded in UTF-8, appears in your file as C3A9. Lynx, 
assuming the file is encoded in Latin1, sees this as the two separate 
characters C3 and A9. It's displaying C3, which is an uppercase A with a tilde, 
as a plain uppercae A, and it's displaying A9, which is the copyright symbol, 
as itself - which is why you're seeing A©.

Now for the next mystery - why you're then seeing C383C2A9: C383 is UTF-8 for 
C3, and C2A9 is UTF-8 for A9. C3A9 is UTF-8 for the original character, é, 
which is E9. In other words, something you did assumed that the two characters 
in your file, which represent the UTF-8 encoding for the single character é, 
were two separate characters, and then encoded each of those two separate 
characters in UTF-8.

-- 
Dave Mielke           | 2213 Fox Crescent | The Bible is the very Word of God.
Phone: 1-613-726-0014 | Ottawa, Ontario   | 2011 May 21 is the End of Salvation.
EMail: dave at mielke.cc | Canada  K2A 1H7   | http://Mielke.cc/now.html
http://FamilyRadio.com/                   | http://Mielke.cc/bible/


More information about the BRLTTY mailing list