[Dragaera] Fun with language and OCR
Alexx Kay
alexx at panix.com
Thu Aug 14 15:33:51 PDT 2008
My wife (who is blind) recently scanned and OCRed my old paperback of
_Brokedown Palace_ for our mutual enjoyment. I knew going in that I
would have to do a massive proofreading pass, due to the high proportion
of accents, interesting proper names, Hungarian words, and italics. (I
don't have accents in this program, so will be approximating them here.)
The winner for most corrections needed is La'szlo'. Arguably the
second-most important character, and with two separate accents in his
name.
A sampling: Liszlo, Laszko, L4szl6, LaszkS, Ldszld, L&szlo, LaszUS
A close runner-up is Miklo's. Only one accent, but the main protagonist.
Samples: Mik16s, Mikl<s, MikKSs, Miklds, Mikkis, MikJ6s, MiKLbs
Third place surprised me. "Brust" appears on every other page, and has a
remarkable number of variations. The font used for his name in this
edition has a fancy cursive "s", which is very prone to errors.
Some of them include: Bruat, Brudt, Brwit, Brivt, Brtuit, Brujt, Bnut,
Brtwt, Briuft, Brtvt, Bnu>t...
Optical Character Recognition has come a long way in the last few years,
but it's still not up to some challenges :-)
Alexx
Opinions expressed are my own and not necessarily those of my employers.
alexx at panixSPAMBL@CK.com http://www.panix.com/~alexx
"Have you read about the 5000 year old caveman they found frozen
under a Swiss glacier? The *men* are all wondering how he came to
be lost up there. "We women" all know that if he'd brought his
wife along *she* would have asked for directions." -- Nurse Jones
More information about the Dragaera
mailing list