[Dragaera] Fun with language and OCR

Alexx Kay alexx at panix.com
Thu Aug 14 15:33:51 PDT 2008


My wife (who is blind) recently scanned and OCRed my old paperback of 
_Brokedown Palace_ for our mutual enjoyment.  I knew going in that I 
would have to do a massive proofreading pass, due to the high proportion
of accents, interesting proper names, Hungarian words, and italics.  (I 
don't have accents in this program, so will be approximating them here.)

The winner for most corrections needed is La'szlo'.  Arguably the
second-most important character, and with two separate accents in his
name.
A sampling: Liszlo, Laszko, L4szl6, LaszkS, Ldszld, L&szlo, LaszUS

A close runner-up is Miklo's.  Only one accent, but the main protagonist.
Samples: Mik16s, Mikl<s, MikKSs, Miklds, Mikkis, MikJ6s, MiKLbs

Third place surprised me.  "Brust" appears on every other page, and has a
remarkable number of variations.  The font used for his name in this
edition has a fancy cursive "s", which is very prone to errors.
Some of them include: Bruat, Brudt, Brwit, Brivt, Brtuit, Brujt, Bnut,
Brtwt, Briuft, Brtvt, Bnu>t...

Optical Character Recognition has come a long way in the last few years,
but it's still not up to some challenges :-)

Alexx


Opinions expressed are my own and not necessarily those of my employers.
alexx at panixSPAMBL@CK.com                http://www.panix.com/~alexx
"Have you read about the 5000 year old caveman they found frozen 
 under a Swiss glacier?  The *men* are all wondering how he came to 
 be lost up there.  "We women" all know that if he'd brought his 
 wife along *she* would have asked for directions."   -- Nurse Jones



More information about the Dragaera mailing list