JOURNAL BROWSE
Search
Advanced SearchSearch Tips
A Method for Automatic Detection of Character Encoding of Multi Language Document File
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
 Title & Authors
A Method for Automatic Detection of Character Encoding of Multi Language Document File
Seo, Min Ji; Kim, Myung Ho;
  PDF(new window)
 Abstract
Character encoding is a method for changing a document to a binary document file using the code table for storage in a computer. When people decode a binary document file in a computer to be read, they must know the code table applied to the file at the encoding stage in order to get the original document. Identifying the code table used for encoding the file is thus an essential part of decoding. In this paper, we propose a method for detecting the character code of the given binary document file automatically. The method uses many techniques to increase the detection rate, such as a character code range detection, escape character detection, character code characteristic detection, and commonly used word detection. The commonly used word detection method uses multiple word database, which means this method can achieve a much higher detection rate for multi-language files as compared with other methods. If the proportion of language is 20% less than in the document, the conventional method has about 50% encoding recognition. In the case of the proposed method, regardless of the proportion of language, there is up to 96% encoding recognition.
 Keywords
character encoding;automatic recognition;word database;multiple language document;
 Language
Korean
 Cited by
 References
1.
N. H.F.Beebe, "Character set encoding," TUGboat, Vol. 11, No. 2, pp. 171-175, 1990.

2.
J. Bettels and F.A. Bishop, "Unicode: A universal character code," Digital Technical Journal, Vol. 5, No. 3, pp. 21-31, 1993.

3.
S. Hussain, N. Durrani and S. Gul, "Survey of Language Computing in Asia 2005," Proc. of PAN Localization, pp. 37-46, 2005.

4.
N. N. Karanikolas and P. Ousranos, "Uncovering Languages from written documents," Proc. of the 18th Panhellenic Conference on Informatics, pp. 1-4, 2014.

5.
M. Durst and A. Freytag, "Unicode in XML and other Markup Languages," Unicode Technical Report #20, 2013.

6.
S. Li and K. Momoi, "A composite approach to language/encoding detection ," Netscape Communications Corp, 2002.

7.
C. Y. Suen, "N-Gram Statics for Natural Language Understanding and Text Processing," IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. PAMI-1, No. 2, pp. 164-172, 1979. crossref(new window)

8.
H. B. Kang, H. C. Jang, and C. S. Jang, "Automatic Recognition of Encoding on the Server for Preventing Mojibake," Journal of KIlT, Vol. 13, No.6, pp. 105-112, 2015.

9.
S. J. Searle. (2004). "A Brief History of Character Codes," TRON Web. [Online]. Available: http//tronweb.super-nova.co.jp/characcodehist.html