[BRLTTY] [SPAM] Re: [SPAM] Re: A strange Unicode translation
Dave Mielke
dave at mielke.cc
Tue Jul 3 22:28:13 EDT 2012
[quoted lines by Lee Maschmeyer on 2012/07/03 at 14:07 -0400]
>It turns out the problem is in the lynx dump of the xml file. Although that
>file contains protégé, when lynx finishes dumping it it becomes:
>
>protégé
>
>which seems to be 0xc383c2a9
Okay, two things are at work, here.
First: The file is encoded in UTF-8 whereas your lynx isn't configured to
detect this. Your lynx is believing that the file is encoded in Latin1 (ISO
8859-1).
Second: Your lynx is configured to use ASCII equivalents for non-ASCII
characters. Usually this means displaying a letter without its accent.
So, it goes like this: The original character, é (lowercase e with acute), is
E9. This character, encoded in UTF-8, appears in your file as C3A9. Lynx,
assuming the file is encoded in Latin1, sees this as the two separate
characters C3 and A9. It's displaying C3, which is an uppercase A with a tilde,
as a plain uppercae A, and it's displaying A9, which is the copyright symbol,
as itself - which is why you're seeing A©.
Now for the next mystery - why you're then seeing C383C2A9: C383 is UTF-8 for
C3, and C2A9 is UTF-8 for A9. C3A9 is UTF-8 for the original character, é,
which is E9. In other words, something you did assumed that the two characters
in your file, which represent the UTF-8 encoding for the single character é,
were two separate characters, and then encoded each of those two separate
characters in UTF-8.
--
Dave Mielke | 2213 Fox Crescent | The Bible is the very Word of God.
Phone: 1-613-726-0014 | Ottawa, Ontario | 2011 May 21 is the End of Salvation.
EMail: dave at mielke.cc | Canada K2A 1H7 | http://Mielke.cc/now.html
http://FamilyRadio.com/ | http://Mielke.cc/bible/
More information about the BRLTTY
mailing list