An Algorithm for Predicting the Relationship between Lemmas and Corpus Size

Yang, Dan-Hee;Gomez, Pascual Cantos;Song, Man-Suk;

ETRI Journal

Volume 22 Issue 2
/
Pages.20-31
/
2000
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

An Algorithm for Predicting the Relationship between Lemmas and Corpus Size

Yang, Dan-Hee (Department of Computer Engineering, Samchok National University) ;
Gomez, Pascual Cantos (University of Murcia) ;
Song, Man-Suk (Department of Computer Science, Yonsei University)

Received : 2000.01.31
Published : 2000.06.30

PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Much research on natural language processing (NLP), computational linguistics and lexicography has relied and depended on linguistic corpora. In recent years, many organizations around the world have been constructing their own large corporal to achieve corpus representativeness and/or linguistic comprehensiveness. However, there is no reliable guideline as to how large machine readable corpus resources should be compiled to develop practical NLP software and/or complete dictionaries for humans and computational use. In order to shed some new light on this issue, we shall reveal the flaws of several previous researches aiming to predict corpus size, especially those using pure regression or curve-fitting methods. To overcome these flaws, we shall contrive a new mathematical tool: a piecewise curve-fitting algorithm, and next, suggest how to determine the tolerance error of the algorithm for good prediction, using a specific corpus. Finally, we shall illustrate experimentally that the algorithm presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, compiling methodology, corpus representativeness and linguistic comprehensiveness.

Keywords

References

Lexicographic Study v.5;6 Corpus: the Concept and Implementation Lee, Sang-Sup
Proceedings of '98 Spring Conference of the Korean Information Science society (KISS) Machine Learning and Corpus Building of the Korean Language Yang, Dan-Hee;Song, Man-Suk
International Journal of Computer Processing of Oriental Languages(CPOL), the Oriental languagea Computer society 1999 v.12 no.2 Representation and Acquisition of the Word Meaning for Picking out Thematic Roles Yang, Dan-Hee;Song, Man-Suk
Introduction to the Special Issue on Computational Linguistics Using Large Corpora, Using Large Corpora Church, Kenneth W.;Mercer, Robert L.;Armstrong, Susan(ed.)
Selection and Information: A Class-Based Approach to Lexical Relationships Resnik, Philip
Proceedings of the First Workshop on Text, Speech, Dialogue(TSD '98) How Much Training Data Is Required to Remove Data Sparseness in Statistical language Learning? Yang, Dan-Hee;Song, Man-Suk
Coping with Ambiguity and Unknown Words through Probabilistic Models, Using Large Corpora Weischedel, Ralph(et al.);Armstrong, Susan(ed.)
the 8th Australian Joint Conference on Artificial Intelligence Conserving Fuel in Statistical Language Learning: Predicting Data Requirements Lauer, Mark
2th Conference of the Pacific Association for Computational Linguistics How much is enough?: Data requirements for statistical NLP, cmp-lg/9509001 Lauer, Mark
The Optimum Corpus Sample Size?;New Directions in English Language Corpora, Methodology Results, Software Development De Haan, Pieter;Leitner(ed.);Gerhard(ed.)
Information Retrieval: Computational and Theoretical Aspects Heaps, H.S.
Lexicographic Study v.5;6 Statistical Characteristics of Korean Vocabulary and Its Application Jeong, Young-Mi
International Journal of Corpus Linguistics v.2 no.2 Predictability of Word Forms (Types) and Lemmas in Linguistic Corpora, A Case Study Based on the Analysis of the CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish Sanchez, Aquilino;Cantos, Pascual
Selection Criteria of Sampling for Frequency Survey in Korean Words, Lexicographic Study, Vol. 3 Jeong, Chan-Sup;Lee, Sang-Sup;Nam, Ki-Sim(et al.)
Numerical Analysis Burden, Richard L.;Faires, J. Douglas
Statistics Hays, William
Numerical Analysis: A Practical Approach Maron, M.J.
Journal of Korean Psychological Association v.5 no.3 Frequency Survey of Korean Vocabulary Kim, Young-Chae
Journal of KISS v.26 no.4 The Estimate of the Corpus Size for Solving Data Sparseness Yang, Dan-Hee;Lim, Su-Jong;Song, Man-Suk
Atlantis (Revista de la Asociacion Espanola de Estudios Anglo-Norteamericanos) v.19 no.1 El ritmo incremental de palabras unevas en los repertorios de textos. Estudio experimental y comparativo basado en dos corpus liguisticos equivalentes de cuatro millones de palabras, de las lenguas inglesa y espanola y en cinco autores de ambas lenguas Sanchez, Aquilino;Cantos, Pascual

ETRI Journal

An Algorithm for Predicting the Relationship between Lemmas and Corpus Size

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)