An Algorithm for Predicting the Relationship between Lemmas and Corpus Size

  • Received : 2000.01.31
  • Published : 2000.06.30


Much research on natural language processing (NLP), computational linguistics and lexicography has relied and depended on linguistic corpora. In recent years, many organizations around the world have been constructing their own large corporal to achieve corpus representativeness and/or linguistic comprehensiveness. However, there is no reliable guideline as to how large machine readable corpus resources should be compiled to develop practical NLP software and/or complete dictionaries for humans and computational use. In order to shed some new light on this issue, we shall reveal the flaws of several previous researches aiming to predict corpus size, especially those using pure regression or curve-fitting methods. To overcome these flaws, we shall contrive a new mathematical tool: a piecewise curve-fitting algorithm, and next, suggest how to determine the tolerance error of the algorithm for good prediction, using a specific corpus. Finally, we shall illustrate experimentally that the algorithm presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, compiling methodology, corpus representativeness and linguistic comprehensiveness.



  1. Lexicographic Study v.5;6 Corpus: the Concept and Implementation Lee, Sang-Sup
  2. Proceedings of '98 Spring Conference of the Korean Information Science society (KISS) Machine Learning and Corpus Building of the Korean Language Yang, Dan-Hee;Song, Man-Suk
  3. International Journal of Computer Processing of Oriental Languages(CPOL), the Oriental languagea Computer society 1999 v.12 no.2 Representation and Acquisition of the Word Meaning for Picking out Thematic Roles Yang, Dan-Hee;Song, Man-Suk
  4. Introduction to the Special Issue on Computational Linguistics Using Large Corpora, Using Large Corpora Church, Kenneth W.;Mercer, Robert L.;Armstrong, Susan(ed.)
  5. Selection and Information: A Class-Based Approach to Lexical Relationships Resnik, Philip
  6. Proceedings of the First Workshop on Text, Speech, Dialogue(TSD '98) How Much Training Data Is Required to Remove Data Sparseness in Statistical language Learning? Yang, Dan-Hee;Song, Man-Suk
  7. Coping with Ambiguity and Unknown Words through Probabilistic Models, Using Large Corpora Weischedel, Ralph(et al.);Armstrong, Susan(ed.)
  8. the 8th Australian Joint Conference on Artificial Intelligence Conserving Fuel in Statistical Language Learning: Predicting Data Requirements Lauer, Mark
  9. 2th Conference of the Pacific Association for Computational Linguistics How much is enough?: Data requirements for statistical NLP, cmp-lg/9509001 Lauer, Mark
  10. The Optimum Corpus Sample Size?;New Directions in English Language Corpora, Methodology Results, Software Development De Haan, Pieter;Leitner(ed.);Gerhard(ed.)
  11. Information Retrieval: Computational and Theoretical Aspects Heaps, H.S.
  12. Lexicographic Study v.5;6 Statistical Characteristics of Korean Vocabulary and Its Application Jeong, Young-Mi
  13. International Journal of Corpus Linguistics v.2 no.2 Predictability of Word Forms (Types) and Lemmas in Linguistic Corpora, A Case Study Based on the Analysis of the CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish Sanchez, Aquilino;Cantos, Pascual
  14. Selection Criteria of Sampling for Frequency Survey in Korean Words, Lexicographic Study, Vol. 3 Jeong, Chan-Sup;Lee, Sang-Sup;Nam, Ki-Sim(et al.)
  15. Numerical Analysis Burden, Richard L.;Faires, J. Douglas
  16. Statistics Hays, William
  17. Numerical Analysis: A Practical Approach Maron, M.J.
  18. Journal of Korean Psychological Association v.5 no.3 Frequency Survey of Korean Vocabulary Kim, Young-Chae
  19. Journal of KISS v.26 no.4 The Estimate of the Corpus Size for Solving Data Sparseness Yang, Dan-Hee;Lim, Su-Jong;Song, Man-Suk
  20. Atlantis (Revista de la Asociacion Espanola de Estudios Anglo-Norteamericanos) v.19 no.1 El ritmo incremental de palabras unevas en los repertorios de textos. Estudio experimental y comparativo basado en dos corpus liguisticos equivalentes de cuatro millones de palabras, de las lenguas inglesa y espanola y en cinco autores de ambas lenguas Sanchez, Aquilino;Cantos, Pascual