Proceedings of the KSPS conference (대한음성학회:학술대회논문집)
The Korean Society Of Phonetic Sciences And Speech Technology
- Semi Annual
Domain
- Linguistics > Linguistics, General
2005.04a
-
This research is for finding prototypes and characteristics of intonation found in
${\ulcorner}$ -a/e,$t{\int}ijo$ <${\lrcorner}$ and${\ulcorner}$ p/simnida${\lrcorner}$ among modern Korean predicate statements by constructing spoken corpus based on the current radio broadcast. So the result of the study is as follows. : (1) The construction of the balanced spoken corpus and the standard for boundary determination of rhythm are needed for the intonation model of speech synthesis. (2) Korean intonation units have the splited word tone which includes the nuclear tone and the pre-nuclear tone makes unclear tone more detailed. (3) I made man and woman intonation models individually through t-test of SPSS. (4) The standard intonation model is devided '-ajo'type and '-nida'type -
The aim of this paper is to analyze the vowel lengthening in Korean, whose function is distinctive in the word's level. In this paper, I examined two acoustic parameters : vowel length and formants(F1 and F2) to distinguish or to identify the long vowel and his short correspondant, for exemple, /a:/ and /a/. According to the results of experimental analysis and to the discussion on the vowel length's relation and its influence to Korean phonological system, I considered a vowel lengthening as a prosodeme, so as a prosodic element in Korean phonological system.
-
The aim of this study is to provide some information on frequencies of occurrence for units of Korean phonemes and syllables analysing spontaneous speech spoken by 3 to 8-year-old Korean children. 49 Korean Children(7
${\sim}$ 10 children for each age) were employed as subjects for this study. Speech data were recorded and phonemically transcribed. 120 utterances for each child were selected for analysis except one child whose data were only 91 utterances. The data size of the present study were 5,971 utterances, 5,1554 syllables, and 105491 phonemes. Among 19 consonants, /n/ showed highest frequency rate of these four conson ants were over 50% for all age groups. Among 18 vowels, /a/ was the most frequent one and /i/ and /${\wedge}$ were the second and third respectively. The frequency rate of these four consonants were over 50% for all age groups. Frequently occurring syllable types were a part of grammatical word in most cases. Only 5${\sim}$ 6% of syllable types covered 50% of speech. -
This paper presents durational characteristics of Korean Lombard speech using data, which consist of 500 Lombard utterances and 500 normal utterances of 10 speakers (5 males and 5 females). Each file was segmented and labeled manually and the duration of each segment and each word was extracted. The durational change of Lombard effect in comparison with normal speech was analyzed using a statistical method. The results show that the duration of words with Lombard effect is increased in comparison with normal style, and that the average unvoiced consonantal duration is reduced while the average vocalic duration is increased. Female speakers show a stronger tendency towards lengthening the duration in Lombard speech, but without statistical significance. Finally, this study also shows that the speakers of Lombard speech could be classified according to their different duration rate.
-
The purpose of this paper is to represent values of acoustic cues for Korean oral stops in the multi-dimensional space, and to attempt to find possible relationships among acoustic cues through correlation coefficient analyses. The acoustic cues used for differentiation of 3 types of Korean stops are closure duration, voice onset time and fundamental frequency of a vowel after a stop. The values of these cues are plotted in the two and three dimensional space and see what the critical cues are for complete separation of different types of stops. Correlation coefficient analyses show that there are statistically significant relationships among acoustic cues but they are not strong enough to make a conjecture that there is a possible articulatory relationship among the mechanisms employed by the acoustic cues.
-
This study aims to establish a speech corpus for Korean as a foreign language (L2 Korean Speech Corpus, L2KSC) and to examine the aspects of the foreign learners acquisition of the phonetic and phonological systems in the Korean Language. In the first year of this project, L2KSC will be established through the process of reading list organizing, recording, and slicing, and the second year includes an in-depth study of the aspects of foreign learners Korean acquisition and a contrastive analysis of phonetic and phonological systems. The expectation is that this project will provide significant bases for a variety of fields such as Korean education, academic research, and technological development of phonetic information.
-
Recently due to the rapid development of speech synthesis based on the corpora, the performance of TTS systems, which convert text into speech through synthesis, has enhanced, and they are applied in various fields. However, the procedure for objective assessment of the performance of systems is not well established in Korea. The establishment of the procedure for objective assessment of the performance of systems is essential for the assessment of development systems for the developers and as the standard for choosing the suitable systems for the users. In this paper we will report on the results of the basic research for the establishment of the systematic standard for the procedure of objective assessment of the performance of Korean TTS systems with reference to the various attempts for this project in Korea and other countries.
-
This study is about Chinese tone evaluation system for Korean learners using speech technology, Chinese prounciaion system consists of initials, finals and tones. Initials/finals are in segmental level and tones are in suprasegmental level. So different method could be used assessing Korean users' Chinese. Differ from segmental level recognition method, we chose pattern matching method in evaluating Chinese tones. Firstly we defined speakers' own speech range and produced standard tonal pattern according to speakers' own range. And then we compared input patterns of users with referring patterns.
-
This study is about constructing L2 pronunciation correction system for L1 speakers using speech technology. Chinese pronunciation system consists of initials, finals and tones. Initials/finals are in segmental level and tones are in suprasegmental level. So different method could be used assessing Korean users' Chinese. The recognition rate of initials is 81.9% and that of finals is 68.7% in the standard acoustic model. Differ from native speech recognition, nonnative speech recognition could be promoted by additional modeling using L2 speakers' speech. As a first step for the those task we analysed nonnative speech and then set a strategy for modeling Korean speakers'.
-
Cepstral Mean Subtraction (CMS) makes effectively compensation for a channel distortion, but there are some shortcomings such as distortions of feature parameters, waiting for the whole speech sentence. By assuming that the silence parts have the channel characteristics, we consider the channel normalization using subtraction of cepstral means which are only obtained in the silence areas. If the considered techniques are successfully used for the channel compensation, the proposed method can be used for real time processing environments or time important areas. In the experiment result, however, the performance of our method is not good as CMS technique. From the analysis of the results, we found potentiality of the proposed method and will try to find the technique reducing the gap between CMS and ours method.
-
In this paper, we propose a transformation based robust adaptation technique that uses the maximum mutual information(MMI) estimation for the objective function and the linear spectral transformation(LST) for adaptation. LST is an adaptation method that deals with environmental noises in the linear spectral domain, so that a small number of parameters can be used for fast adaptation. The proposed technique is called MMI-LST, and evaluated on TIMIT and FFMTIMIT corpora to show that it is advantageous when only a small amount of adaptation speech is used.
-
One's mother tongue can have an effect on learning a foreign language, especially on pronunciation. Investigating and comparing English vowels /
${\varepsilon}$ / and /${\ae}$ /, and their- supposedly- corresponding vowels in Korean /에/ and /애/, this study addresses the following questions: Can Koreans distinguish /에/ and /애/? Can they distinguish /${\varepsilon}$ / and /${\ae}$ / in English? And what is the relationship between the Korean vowels and the English vowels? That is, is the conventional correspondence of /에/-/${\varepsilon}$ / and /애/-/${\ae}$ / appropriate? The results showed that most Korean students distinguish neither Korean /에/ and /애/, nor English /${\varepsilon}$ / and /${\ae}$ /. While not distinguishable within a language, Korean /에/ and /애/ still form a separate group from English /${\varepsilon}$ / and /${\ae}$ /. Therefore the correspondence between /에/-/${\varepsilon}$ / and /애/-/${\ae}$ / is not appropriate. Strategies for teaching English pronunciation should be designed accordingly. -
Sasang Constitution, one branch of oriental medicine, claims that people can be classified into four different 'constitutions:' Taeyang, Taeum, Soyang, and Soeum. This study investigates whether the classification of the 'constitutions' could be accurately made solely based on people's voice by analyzing the data from 46 different voices whose constitutions were already determined. Seven source-related parameters and four filter-related parameters were phonetically analyzed and the GMM(gaussian mixture model) was tried with the data. Both the results from phonetic analyses and GMM showed that all the parameters (except one)failed to distinguish the constitutions of the people successfully. And even the single exception, the bandwidth of F2, did not provide us with sufficient reasons to be the source of distinction. This result seems to suggest one of the two conclusions: either the Sasang Constitutions cannot be substantiated with phonetic characteristics of peoples' voices with reliable accuracy, or we need to find yet some other parameters which haven't been conventionally proposed.
-
The purpose of this paper is the comparison of the Korean medial fortis duration between Korean native speaker and Japanese native speaker who study Korean language. For this purpose, I selected words with medial fortis from the SITEC DB. The Korean medial fortis of Japanese tends to have longer closure/friction duration than Korean native speakers in 3 syllables words. There are no distinct differences in 2 syllables words. This might be owing to the different timing unit of Korean and Japanese.
-
We can easily recognize the voices already known to us. But what about unknown voices? Is there any relationship between voices and the images triggered by the voices? Actually, this question has been partly addressed by Moon(2000, 2002). The current study aims at shedding some more lights on the topic by investigating the relationship between unknown foreign voices and the images triggered by them. Speech samples from 16 American males and females (8 each) were recorded and 180 Korean subjects without any knowledge of the American Speakers were asked to match the voices with the corresponding photos. And the number of corrects matches between voices and pictures of the current study was less than that of Korean-speakers and Korean-listeners case. But in terms of the majority matches, regardless of correctness, the present study showed a similar trend: that is, there is more than a chance relationship between voices and the images triggered by the voices.
-
The aim of this paper was to analyze the lexical effects on spoken word recognition of Korean monosyllabic word. The lexical factors chosen in this paper was frequency, density and lexical familiarity of words. Result of the analysis was as follows; frequency was the significant factor to predict spoken word recognition score of monosyllabic word. The other factors were not significant. This result suggest that word frequency should be considered in speech perception test.
-
This purpose of the present study was to determine the effects of yawn-sigh technique in voice quality of a cleft palate child. A 9-year old cleft palate child participated in the study 3 times a week for a month. The assessments were done by Dr. Speech (Version 4.0, Tiger DRS) on
$F_{0}$ , jitter, shimmer and NNE. The results showed that there was a tendency that the voice improved in terms of NNE. However, it did not reach a statistical significance. -
Today's state-of the-art speech recognition systems typically use continuous distribution hidden Markov models with the mixtures of Gaussian distributions. To obtain higher recognition accuracy, the hidden Markov models typically require huge number of Gaussian distributions. Such speech recognition systems have problems that they require too much memory to run, and are too slow for large applications. Many approaches are proposed for the design of compact acoustic models. One of those models is subspace distribution clustering hidden Markov model. Subspace distribution clustering hidden Markov model can represent original full-space distributions as some combinations of a small number of subspace distribution codebooks. Therefore, how to make the codebook is an important issue in this approach. In this paper, we report some experimental results on various quantization methods to make more accurate models.
-
The performance of speech recognition is degraded by the mismatch between training and test environments. Many methods have been presented to compensate for additive noise and channel effect in the cepstral domain, and Cepstral Mean Subtraction (CMS) is the representative method among them. Recently, high order cepstral moment normalization method has introduced to improve recognition accuracy. In this paper, we apply high order moment normalization method and smoothing filter for real-time processing. In experiments using Aurora2 DB, we obtained error rate reduction of 49.7% with the proposed algorithm in comparison with baseline system.
-
Many utterance verification (UV) algorithms have been studied to reject out-of-vocabulary (OOV) in speech recognition systems. Most of conventional confidence measures for UV algorithms are mainly based on log likelihood ratio test, but these measures take much time to evaluate the alternative hypothesis or anti-model likelihood. We propose a novel confidence measure which makes use of a momentary best scored state sequence during Viterbi search. Our approach is more efficient than conventional LRT-based algorithms because it does not need to build anti-model or to calculate the alternative hypothesis. The proposed confidence measure shows better performance in additive noise-corrupted speech as well as clean speech.
-
Research efforts have been made for out-of-vocabulary word rejection to improve the confidence of speech recognition systems. However, little attention has been paid to non-recognition sentence rejection. According to the appearance of pronunciation correction systems using speech recognition technology, it is needed to reject non-recognition sentences to provide users with more accurate and robust results. In this paper, we introduce standard phoneme based sentence rejection system with no need of special filler models. Instead we used word spotting ratio to determine whether input sentences would be accepted or rejected. Experimental results show that we can achieve comparable performance using only standard phoneme based recognition network in terms of the average of FRR and FAR.
-
The goal of our research is to build a text independent speaker recognition system that can be used in any condition without any additional adaptation process. The performance of speaker recognition systems can be severally degraded in some unknown mismatched microphone and noise conditions. In this paper, we show that PCA(Principal component analysis) without dimension reduction can greatly increase the performance to a level close to matched condition. The error rate is reduced more by the proposed augmented PCA, which augment an axis to the feature vectors of the most confusable pairs of speakers before PCA