Sunday, October 05, 2008

Google and TGN

Google stays mum on plans for public documents, points to OCR hurdle | The Industry Standard:
"However, Sullivan described a major drawback related to census forms and many other public records: Handwritten documents are notoriously difficult to read using optical character recognition (OCR) software.

Sullivan described old census forms from the 18th, 19th, and early 20th centuries as 'a hugely diverse collection of handwritten records,' in the sense that the handwriting styles and the quality of the source documents varied greatly. Remember, census forms from 1930 and earlier were filled out by a multitude of individuals going door to door across America, and their handwriting styles varied greatly. OCR tools, which are used to convert books and other printed documents into online text that can be searched and indexed, are 'not even close' to being able to read handwritten records, Sullivan says.

So how did The Generations Network import the data from millions of old census forms into its online database? Sullivan says the company spent about $75 million over 10 years to build its 'content assets' including the census data, and much of that cost went into partnering with Chinese firms whose employees read the data and entered it into's database. The Chinese staff are specially trained to read the cursive and other handwriting styles from digitized paper records and microfilm. The task is ongoing with other handwritten records, at a cost of approximately $10 million per year, he adds.

When asked about Google, Sullivan said, "We view that their mission and ours is quite complimentary." However, he declined to discuss the nature of communications between the two companies."

and - Google Search is useful to find the static pages for obscure databases and much more


Post a Comment

<< Home