DOI QR코드

DOI QR Code

Automatic Generation of Training Character Samples for OCR Systems

  • Le, Ha (School of Electronics and Computer Engineering Chonnam National University) ;
  • Kim, Soo-Hyung (School of Electronics and Computer Engineering Chonnam National University) ;
  • Na, In-Seop (School of Electronics and Computer Engineering Chonnam National University) ;
  • Do, Yen (School of Electronics and Computer Engineering Chonnam National University) ;
  • Park, Sang-Cheol (Samsung Medison) ;
  • Jeong, Sun-Hwa (Electronics and Telecommunications Research Institute)
  • Received : 2012.07.25
  • Accepted : 2012.09.10
  • Published : 2012.09.28

Abstract

In this paper, we propose a novel method that automatically generates real character images to familiarize existing OCR systems with new fonts. At first, we generate synthetic character images using a simple degradation model. The synthetic data is used to train an OCR engine, and the trained OCR is used to recognize and label real character images that are segmented from ideal document images. Since the OCR engine is unable to recognize accurately all real character images, a substring matching method is employed to fix wrongly labeled characters by comparing two strings; one is the string grouped by recognized characters in an ideal document image, and the other is the ordered string of characters which we are considering to train and recognize. Based on our method, we build a system that automatically generates 2350 most common Korean and 117 alphanumeric characters from new fonts. The ideal document images used in the system are postal envelope images with characters printed in ascending order of their codes. The proposed system achieved a labeling accuracy of 99%. Therefore, we believe that our system is effective in facilitating the generation of numerous character samples to enhance the recognition rate of existing OCR systems for fonts that have never been trained.

Keywords

References

  1. Nagy, G., Twenty years of document image analysis in PAMI, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.22, no.1, pp.38-62, 2000. https://doi.org/10.1109/34.824820
  2. N. S. Sarhan and L. Al-Zobaidy, Recognition of Printed Assyrian Character Based on Neocognitron Artificial Neural Network, The International Arab Journal of Information Technology, vol4, no.1, 2007.
  3. H. Guo and J. Zhao, A Chinese Minority Script Recognition Method Based on Wavelet Feature and Modified KNN, Journal of Software, vol.5, no.2, 2010.
  4. Sachin Rawat, A Semi-automatic Adaptive OCR for Digital Libraries, Centre for Visual Information Technology, 2006.
  5. M. Meshesha and C. V. Jawahar, Optical Character Recognition of Amharic Documents, Center for Visual Information Technology, 2007.
  6. Tapas Kanungo, Robert M. Haralick, Henry S. Baird, Werner Stuezle and David Madigan, A Statistical, Nonparametric Methodology for Document Degradation Model Validation, IEEE Transaction on Pattern Analysis and Machine Intelligence 22, 2000.
  7. T. Kanungo and R. M. Haralick, An automatic closedloop methodology for generating character groundtruth for scanned documents, IEEE Trans. Pattern Anal. Mach. Intell., pp.179-183, 1999.
  8. D.-W. Kim and T. Kanungo, Attributed point matching for automatic ground truth generation, Int. Journal on Document Analysis and Recognition, pp.47-66, 2002.
  9. H. S. Baird, The state of the art of document image degradation modeling, IAPR Workshop on Document Analysis Systems, 2000.
  10. J. van Beusekom, F. Shafait, and T. M. Breuel, Automated OCR Ground Truth Generation, In 8th IAPR Workshop on Document Analysis Systems, pp.111-117, 2008.
  11. H. S. Baird, Document image defect models. In Document image analysis, pp.315-325, 1995.
  12. H. S. Baird, Calibration of Document Image Defect Models, 2nd UNLV Symp. on Document Analysis & Information Retrieval, pp.26-28, 1993.
  13. T. Kanungo, Global and Local Document Degradation Models, Document Analysis and Recognition, pp.730-734, 1993.
  14. T. Kanungo, Document Degradation Models and Methodology for Degradation Model Validation, Ph.D. Dissertation, 1996.
  15. M. Zimmermann and H. Bunke. Automatic segmentation of the IAM off-line database or handwritten english text. Proc Int. Conf. on Pattern Recognition, 2002.
  16. S. Jaeger, S. Manke, J. Reichert, and A. Waibel. Online handwriting recognition: the npen++ recognizer. Int. Journal on Document Analysis and Recognition, pp.1433-2833, 2001.
  17. P. V. C. Hough, Method and means for recognizing complex patterns, U.S. Patent 3069654, 1962.
  18. R. O. Duda and P. E. Hart, Use of The Hough Transform to Detect Lines and Curves in Pictures, Commun. ACM, vol.15, no.1, pp.11-15, 1972. https://doi.org/10.1145/361237.361242
  19. Seung Ick Jang and Youn Seok Nam, A Method of Machine-Printed Hangul Recognition using Grapheme Recognizer, Proc. of Korea Information Processing Society Spring Conference, vol.11, no.1, pp.351 - 354, 2004.
  20. L. Bergroth, A Survey of Longest Common Subsequence Algorithms, Seventh International Symposium on String Processing and Information Retrieval, 2000.
  21. V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Cybemetics and Control Theory, vol.10, no.8, pp.707-710, 1966.