[BRLTTY] Contracted English braille error and patch: O'clock

Dave Mielke dave at mielke.cc
Fri Apr 11 22:52:21 EDT 2014


[quoted lines by Lee Maschmeyer on 2014/04/11 at 19:25 -0400]

>In my naiveté I thought a font had to do with how a character was
>displayed on the screen, not with what the character itself is; and I
>thought brltty snagged the character before it went to the font. I
>sure _hope_ that's the way it is. :-))

That's not exactly how it works with Unicode. Maybe, though, my use of the term 
"font" isn't technically correct. I'm not sure what the Unicode term for it is.

All those different types of letters, for example, as listed in my previous 
message, are defined within distinct Unicode codepoint ranges. Each of them 
defines a particular style to be used for the letters. Unicode knows that 
they're compatible sets of characters, and there's a way to do what Unicode 
calls string normalization. That, in theory, is the solution to this problem, 
except for one (to us) very important limitation.

Unicode can define a character in one of two ways: composed and decomposed. 
This is particularly significant, for example, for languages which use letters 
with accents. You used the word "naivité, for example., above. The last letter 
in that word is a lowercase e with an acute accent. Its composed form is the 
single codepoint U+00E9, and its decomposed form is the two-codepoint sequence 
U+0065 U+00B4 (a plain lowercase e, followed by a "combining" acute accent).

The text on the screen, or the text in a file, can use either scheme for any 
character. In other words, if the text contains more than one composit 
character, some may be composed while others are decomposed. What we'd need to 
do is to normalize the text before contracting it by forcing all the characters 
to be composed.

That's easy enough to do with standard functions except that those functions 
don't return offset information. In other words, they don't make it easy for us 
to map the start of a decomposed character in the soruce text to its 
corresponding composed character in the normalized text. Unless I find a way, 
what I may end up having to do is figure out how to do our own normalization so 
that we can keep track of the offset information.

Something I think I'll experiment with is skipping over combining characters in 
both strings after the normalization is done since, in theory, both strings 
should contain exactly the same number of base characters. An added efficiency, 
which would be the common case by far, might be to compare the two strings, 
and, if they're the same, skip the (expensive) offset mapping bit.

If this approach will work then all we should need to do is define our tables 
using only composed and compatibility characters, which, I suspect, is already 
the case. Contraction table compilation could check for this anyway, though, 
just to be sure.

-- 
Dave Mielke           | 2213 Fox Crescent | The Bible is the very Word of God.
Phone: 1-613-726-0014 | Ottawa, Ontario   | http://Mielke.cc/bible/
EMail: dave at mielke.cc | Canada  K2A 1H7   | http://FamilyRadio.com/


More information about the BRLTTY mailing list