Proceedings of the KSPS conference (대한음성학회:학술대회논문집)
The Korean Society Of Phonetic Sciences And Speech Technology
- Semi Annual
Domain
- Linguistics > Linguistics, General
2003.05a
-
This paper presents segmental labeling conventions proposed by SiTEC(Speech Information Technology Engineering Center) 2002 and proposes a new directions of a revision for a simpler version.
-
The purpose of this paper is to review one of the prosody labelling conventions for Korean, K-ToBI convention(ver. 3.1) and to propose a couple of modifications and suggestions.
-
K-SEC(Korean-Spoken English Corpus) is a kind of speech database that is being under construction by the authors of this paper. This article discusses the needs of the K-SEC from various academic disciplines and industrial circles, and it introduces the characteristics of the K-SEC design, its catalogues and contents of the recorded database, exemplifying what are being considered from both Korean and English languages' phonetics and phonologies. The K-SEC can be marked as a beginning of a parallel speech corpus, and it is suggested that a similar corpus should be enlarged for the future advancements of the experimental phonetics and the speech information technology.
-
This paper presents common speech database collection for telecommunication applications. During 3 year project, we will construct very large scale speech and text databases for speech recognition, speech synthesis, and speaker identification. The common speech database has been considered various communication environments, distribution of speakers' sex, distribution of speakers' age, and distribution of speakers' region. It consists of Korean continuous digit, isolated words, and sentences which reflects Korean phonetic coverage. In addition, it consists of various pronunciation style such as read speech, dialogue speech, and semi-spontaneous speech. Thanks to the common speech databases, the duplicated resources of Korean speech industries are prohibited. It encourages domestic speech industries and activate speech technology domestic market.
-
음성정보기술 산업을 효과적으로 지원하기 위해서는 상품 및 기술의 개발을 위한 표준화된 음성 코퍼스의 구축 및 보급이 필수적이라고 할 수 있다. 본 논문에서는 음성정보기술산업지원센터(SiTEC)의 1∼2차년도 (2001. 5. 1 ∼ 2003. 4. 30)의 사업기간 중에 구축된 음성 코퍼스의 현황 및 향후 계획을 소개한다. 전통산업분야에 대한 음성정보기술 적용확산을 위한 자동차 소음 및 대규모 다채널 자동차 음성 코퍼스, 수출지원을 위한 다양한 외국어 음성 코퍼스, 방음실 환경에서의 인식 및 운율 합성 연구용 코퍼스, Dictation용 음성 코퍼스, 아동용 음성 코퍼스 등의 구축 내용이 소개된다.
-
This paper describes our recent work for developing a baseline platform for Korean spoken dialog recognition. In our work, We have collected about 65 hour speech corpus with auditory transcriptions. Linguistic information on various levels such as mophology, syntax, semantics, and discourse is attached to the speech database by using automatic or semi-automatic tools for tagging linguistic information.
-
The purpose of this paper is to correct the errors in the isolated word speech database under the PC environment, and to analyze the various errors. The importance and procedures of the error detection are also described.
-
This paper is phonetic study of
$F_{0}$ range and boundary tone in Mandarin Chinese. The production data from 6 Chinese speakers show that there are declination, pitch resetting and tonal variation of boundary tone. In declarative sentence,$F_{0}$ declines gradually over the utterance but mid-sentence boundary prevents$F_{0}$ of following syllable from declining because of pitch resetting.$F_{0}$ range of syllable is expanded before the mid- and final sentence boundaries. In interrogative one,$F_{0}$ ascends gradually over the utterance and mid-sentence boundary makes$F_{0}$ of following syllable rise more.$F_{0}$ range of sentence final syllable is expanded and$F_{0}$ contour shows rising curve. -
The purpose of this paper is to research the realization of /h/ between sonorant sounds. For this purpose, we analyze speech of 5 people using standard Korean. As a result, we can find that the possibility of deletion of /h/ is increasing, when speech rate is high, the AP has more syllables, and /h/ is far from the AP-initial. While the position of AP or IP has no relation to realization of /h/. The deletion of /h/ is more often in this order. Followed segments: lateral>nasal>vowel, following segments: vowel>glide. And there is no change on duration of following vowel after /h/ deletion.
-
Most Korean students who have not studied Japanese pronounced Japanese phoneme /k/ as /kk/ in Korean, regardless of sex. But analysis considering many phoneme environments gives us different results. Although the middle syllable which comes after 'the joon' does not show any specific distinctions, the rest cases show that the half of the subjects pronounced it as /kk/ and the other half as /k/. To draw concrete conclusions, further studies must be done.
-
The purpose of this paper is to research voice imitation. Voice imitation changes various phonetic feature. Also, in our experimental results, voice imitation has preferential prosody difference. For imitating voice, imitators change their fundamental frequency bandwidths for the most part. Imitative speakers change their high fundamental frequencies effectively while they maintain their low fundamental frequencies. Also, excellent group is distinctly superior to common group for imitating prosodic patterns. That is, the f0 bandwidth's change and the prosodic patterns are significant in imitating voice. But the low f0 is maintain by all speakers.
-
This experimental research shows that, in reading of the English isolated words that are enumerated, the releasing of the word-final stop is employed for signaling enumeration in company with the well-known intonational pattern for it. Furthermore, this study tries to find the conceivable phonetic correlates of the releasing of the stop in word-final position, focusing on the association of the stop releasing/nonreleasing with i) the POA (Place of Articulation) distinction of the word-final stop, ii) the various qualities of the preceding vowel placed before the final stop, and iii) the voice distinction of the stop in the word-final position.
-
In this paper, we present some of experimental results developed in computer-based English Pronunciation Correction System for Korean speakers. The aim of the system is to detect incorrectly pronounced phonemes in spoken words and to give correction comment to users. Speech data were collected from 254 native speakers and 411 Koreans, then used for phoneme modeling and test. We built two types of acoustic phoneme models: native speaker model and Korean speaker model. We also built langugage models to reflect Koreans' commonly occurred mispronunications. The detection rate was over 90% in insertion/deletion/replacement of phonemes, but we got under 75% detection rate in diphthong split and accents.
-
In this paper, we describe a cross-morpheme pronunciation variation model which is especially useful for constructing morpheme-based pronunciation lexicon for Korean LVCSR. There are a lot of pronunciation variations occurring at morpheme boundaries in continuous speech. Since phonemic context together with morphological category and morpheme boundary information affect Korean pronunciation variations, we have distinguished pronunciation variation rules according to the locations such as within a morpheme, across a morpheme boundary in a compound noun, across a morpheme boundary in an eojeol, and across an eojeol boundary. In 33K-morpheme Korean CSR experiment, an absolute improvement of 1.16% in WER from the baseline performance of 23.17% WER is achieved by modeling cross-morpheme pronunciation variations with a context-dependent multiple pronunciation lexicon.
-
In this paper, we proposed Backgorund Model Set algorithm for the speaker verification to improve the shortcoming of calculating process in conventional confidence measure(CM). CM is to display relative likelihood between recognized models and unrecognized models. Unrecognized models is known as antiphone models. Calculate probability and standard deviation using all phonemes at process that compose antiphone model. At this process, antiphone CM brought bad result. Also, recognition time increases. In order problem, we studied about method to reconstitute average and standard deviation taking BMS algorithm using antiphoneme that near phoneme of CM calculation.
-
This paper presents our style-based language model adaptation for Korean conversational speech recognition. Korean conversational speech is observed various characteristics of content and style such as filled pauses, word omission, and contraction as compared with the written text corpora. For style-based language model adaptation, we report two approaches. Our approaches focus on improving the estimation of domain-dependent n-gram models by relevance weighting out-of-domain text data, where style is represented by n-gram based tf*idf similarity. In addition to relevance weighting, we use disfluencies as predictor to the neighboring words. The best result reduces 6.5% word error rate absolutely and shows that n-gram based relevance weighting reflects style difference greatly and disfluencies are good predictor.
-
In order to produce high quality synthesized speech, it is very important to get an accurate grapheme-to-phoneme conversion and prosody model from texts using natural language processing. Robust preprocessing for non-Korean characters should also be required. In this paper, we analyzed Korean texts using a morphological analyzer, part-of-speech tagger and syntactic chunker. We present a new grapheme-to-phoneme conversion method, i.e. a dictionary-based and rule-based hybrid method, for unlimited vocabulary Korean TTS. We constructed a prosody model using a probabilistic method and decision tree-based method.
-
The present study was conducted to investigate the difference of cortical activation in naming the picture in Korean and English. Experimental design was 2(Korean, English) language condition x 4(no distractor, semantic related distractor, semantic unrelated distrator, corresponding distractor) distractor condition. language condition was between subject factor and distractor condition was within subject factor. The result was that Korean naming condition showed less cortical activation than English naming condition. The activation region was reported in each condition.
-
The present study was planned to investigate the cortical activation correlated with producing morphologically complex Korean verbs by using. fMRI technique. In this study two derivational affixes and two inflectional affixes were selected: pre-final ending and final ending for inflectional affix and passive affix and causative affix for derivational affix. Two Experiment were conducted. The results of two Experiments suggest a possibility that process of pre-final ending is different from final ending.
-
The purpose of this study was to compare acoustic differences of fricative /s/ between the dysarthric subjects and normal subjects. In additional, the subjects' speeches were evaluated in terms of word intelligibility containing /s/ and perceptual severity. Acoustic parameters were duration, peak frequency and intensity of /s/. The results showed that the first peak frequency and intensity of /s/ were significantly different between dysarthric subjetcts and normal subjects. Second, peceptual parameters were significantly different between dysarthric subjetcts and normal subjects. The Pearson correlation coefficient was used to determine the relationship between the acoustical and perceptual data. The results showed that there was a strong correlation between perceptual parameters and peak frequency of /s/.
-
The purpose of this paper was to evaluate the effect of auditory feedback on fundamental frequencies in prelingulally deaf children. Participants totaled three groups of sixty children: deaf children with cochlear implantation(CI), deaf children with hearing aids (HA), and children with normal hearing(NH). Fundamental frequencies were measured during sustained phonation of /a/. There were statistically significant differences of fundamental frequencies across the groups(p<.01). In post hoc analysis, HA and NH group showed statistically significant differences, but CI group didn't. In correlation analysis between Fo and the chronological age, there were significant negative tendencies in CI and NH group, but not in HA group. The characteristics of fundamental frequencies in CI group were found similar to NH group than HA group in this study. This could be understood as the effect of relatively sufficient auditory feedback after cochlear implantation.
-
This paper aims at developing sample paragraphs for voice pitch assessment which is specifically designed for Koreans. Recently the demand for such a battery of sample sentences has been steadily increased among Korean speech therapists. In this paper, different sample paragraphs (two conventionally used paragraphs and three newly developed ones which consist mainly of sonorant sounds and different types of sentences), different softwares (Dr. Speech, Wavesurfer, Praat) and different techniques (automatic measurement and detailed measurement in which the researcher controls many aspects which might influence the measurement of pitch) will be compared for measuring fundamental frequency.
-
Channel distortion and coarticulation effect in the connected digit telephone speech make it difficult to recognize, and degrade recognition performance in the telephone environment. In this paper, as a basic research to improve the recognition performance of Korean connected digit telephone, error patterns are investigated and analyzed. Telephone digit speech database released by SITEC with HTK system is used for recognition experiments. Both DWFBA and MRTCN methods are used for feature extraction and channel compensation, respectively. Experimental results are discussed with our findings.
-
In general, triangular shape filters are used in the filter bank when we get the MFCCs from the spectrum of speech signal. In [1], a new feature extraction approach is proposed, which uses specific filter shapes in the filter bank that are obtained from the spectrum of training speech data. In this approach, principal component analysis technique is applied to the spectrum of the training data to get the filter coefficients. In this paper, we carry out speech recognition experiments, using the new approach given in [1], for a large amount of telephone speech data, that is, the telephone speech database of Korean connected digit released by SITEC. Experimental results are discussed with our findings.
-
We discuss how to reduce the number of inverse matrix and its dimensions requested in MLLR framework for speaker adaptation. To find a smaller set of variables with less redundancy, we employ PCA(principal component analysis) and ICA(independent component analysis) that would give as good a representation as possible. The amount of additional computation when PCA or ICA is applied is as small as it can be disregarded. The dimension of HMM parameters is reduced to about 1/3 ~ 2/7 dimensions of SI(speaker independent) model parameter with which speech recognition system represents word recognition rate as much as ordinary MLLR framework. If dimension of SI model parameter is n, the amount of computation of inverse matrix in MLLR is proportioned to O(
$n^4$ ). So, compared with ordinary MLLR, the amount of total computation requested in speaker adaptation is reduced to about 1/80~1/150. -
This paper describes an efficient method for unsupervised speaker adaptation. This method is based on selecting a subset of speakers who are acoustically close to a test speaker, and calculating adapted model parameters according to the previously stored sufficient HMM statistics of the selected speakers' data. In this method, only a few unsupervised test speaker's data are required for the adaptation. Also, by using the sufficient HMM statistics of the selected speakers' data, a quick adaptation can be done. Compared with a pre-clustering method, the proposed method can obtain a more optimal speaker cluster because the clustering result is determined according to test speaker's data on-line. Experiment results show that the proposed method attains better improvement than MLLR from the speaker independent model. Moreover the proposed method utilizes only one unsupervised sentence utterance, while MLLR usually utilizes more than ten supervised sentence utterances.
-
This paper describes an efficient algorithm to generate compact and complete prompts lists for connected spoken digits database. In building a connected spoken digit recognizer, we have to acquire speech data in various contexts. However, in many speech databases the lists are made by using random generators. We provide an efficient algorithm that can generate compact and complete lists of digits in various contexts. This paper includes the proof of optimality and completeness of the algorithm.
-
In this paper, we propose a robust endpoint detection algorithm for speaker verification. Proposed algorithm uses energy and cepstral distance parameters, and it replaces the detected endpoints with endpoints of voiced speech, when the estimated signal-to-noise ratio (SNR) is low. Experimental results show that proposed algorithm is superior to energy-based endpoint detection algorithm.
-
In this paper, we compared formant frequency extraction algorithms with various conditions, and show their performances. The formant frequency is the resonance frequency which is decided by the vocal tract characteristics. It is related with phonemes, or characteristics of the physical condition of the vocal track. Since the speech signal is influenced by both the sound source and the vocal tract, it is difficult to calculate the exact formant frequencies. Many studies on the formant frequency extraction had been executed already Besides, any new formant frequency extraction algorithm is hardly found recently.
-
In this paper, we carried out comparative study about various feature parameters for the effective speaker recognition such as LPC, LPCC, MFCC, Log Area Ratio, Reflection Coefficients, Inverse Sine, and Delta Parameter. We also adopted cepstral liftering and cepstral mean subtraction methods to check their usefulness. Our recognition system is HMM based one with 4 connected-Korean-digit speech database. Various experimental results will help to select the most effective parameter for speaker recognition.
-
In this paper, we present a performance comparison of feature parameters and classifiers for speech/music discrimination. Experiments were carried out on six feature parameters and three classifiers. It turns out that three classifiers shows similar performance. The feature set that captures the temporal and spectral structure of the signal yields good performance, while the phone-based feature set shows relatively inferior performance.
-
In this paper, we studied advanced audio/ voice information processing techniques, and trying to introduce more human friendly audio/voice. It is just in the beginning stage. Firstly, we approached in well-known time-domain methods such as moving average, differentiation, interpolation, and decimation. Moreover, some variation of them and envelope contour modification are utilized. We also suggested the MOS test to evaluate subjective listening factors. In the long term viewpoint, user's preference, mood, and environmental conditions will be considered and according to them, we hope our future technique can adapt speech and audio signals automatically.
-
In this paper I investigate the intervening consonants constraint in umlaut in Korean. It is generally known that if a palatalized consonant, i.e. s/ㅅ/ or
$t{\int}/ㅈ/$ etc, intervenes in an umlaut environment, the expected umlaut process is blocked. But there are not a few words which are thought to have undergone umlaut diachronically: wensu$( If we assume that these words were formed as a result of umlaut, we must explain the reason of the violation of the intervening consonants constraint. On the other hand if we assume that these words were formed as a result of other phonological processes, we must explain these words by ad hoc rules respectively. In this paper I argue that these words including others have undergone the umlaut process by offering some historical and dialectal evidence. -
Intonation as suprasegmental phonetic features conveys meanings on the postlexical or utterance level in a linguistically structured way. It includes three aspects: tunes, relative prominence, and intonational phrasing. In this article, I will treat how prosodic phrasing is functionally related to the listening comprehension of English by analysing the students' errors of listening comprehension. When utterance meaning is conveyed, it is realized to be divided into intonational phrases. The small intonational phrase is regarded as an intermediate phrase which has a primary accent and a phrase tone or audible break. Most students' errors of listening occurred with linking pronunciation in the intermediate phrases of the fast speech. Thus through the smallest unit with tune we can help students improve their pronunciation and listening ability of English.
-
This study is about modeling pronunciation dictionary necessary for PLU(phoneme like unit) level word recognition. The recognition of nonnative speakers' pronunciation enables an automatic diagnosis and an error detection which are the core of English pronunciation tutoring system. The above system needs two pronunciation dictionaries. One is for representing standard English pronunciation. The other is for representing Korean speakers' English Pronunciation. Both dictionaries are integrated to generate pronunciation networks for variants.
-
This paper investigates phonation types of Malay plosives and compares Malay plosives with Korean ones in terms of VOT, F0, duration of closure, and durations of the preceding and following vowels. This study is significant in that it specifies phonetic characteristics of phonation types of the two languages and provides phonetic bases for teaching and learning either of the two languages. The results showed that Malay voiceless plosives are higher than voiced ones in VOT, F0, duration of closure but the other way in durations of the preceding and following vowels. The distribution of VOT suggests that Malay voiceless plosives are close to Korean fortis plosives.
-
This study was conducted to extract the feature set of English writing error for suggesting adequate English writing program and making automated scoring system. The frequent committed error and the error across the level of writing proficiency were reported. Also, It is reported that the correlation between type of error and native speaker's rating score.
-
Speech recognition performance depends on various factors. One of the factors is the characteristic of a microphone which is used when speech data is collected. Thus, in the present experiment speech databases for tests are created through varying types of microphones. Then, acoustic models are built based on these databases, and each of the acoustic models is assessed by the data to determine recognition performance depending on various microphones.
-
The purpose of this paper is to research the phonetic parameters used on the voice imitation. First of all, the fundamental frequency is imitated effectively. Distinctive prosodic patterns are used repeatedly on the voice imitation. Speaking rate is used in special measure in case the target speaker has extraordinary speaking rate. Also formant frequency is imitated variously. In sum, distinctive characteristics perceived by listener are used on voice imitation.
-
The aim of this paper is to analyze the phonetic features for disguised voice. In this paper we examined the features such as phonation types, pitch range, speech rate, intonation type and boundary tones etc. So the result of the analysis is as follows. :
$\circled1$ Phonation types are very important manner of disguised voice for male subjects.$\circled2$ Pitch range and average of pitch value is very important cue for speaker verification.$\circled3$ pitch contour, speech rate and boundary tones can be a secondary cue for speaker verification. -
The results of this thesis show that Seoul dialect speakers neutralized /
$\varepsilon$ / and /e/ to /E/ about 80% and the old aged generations pronounced /ㅐ/ and /ㅔ/ as it is than the young aged generations's. -
In this paper, we employ the adaptive comb filtering for effective noise reduction in mobile communication environment. Adaptive comb filtering is a well- known method for noise reduction, but requires the correct pitch period and must be applied just in voiced speech frames. To satisfy these requirements we use two kinds of information extracted from speech packets, one of which is the pitch period information measured precisely by a speech coder and the other is the frame rate information related to a decision on speech or silence frame. Experiments on speech recognition system confirm the efficiency of this method. Feature parameters employing this method give superior performance in noise environment to those extracted directly from output speech.
-
In this paper, the combinations of speech enhancement techniques are experimented. Specifically, the spectral subtraction, KLT based comb-filtering, and their combinations are applied to the Aurora2 database. The results show that recognition accuracy is improved when KLT based comb-filtering is applied after spectral subtraction.
-
일반적으로 음성 인식에서의 성능은 잡음의 영향으로 인하여 저하된다. 전화망을 통한 한국어 연속 숫자음 인식은 음성인식 분야에 있어서 어려운 영역에 속하는데, 이는 조음 현상으로 인한 인식률 저하되는 점과 전화망 채널의 영향으로 인하여 스펙트럼 포락이 왜곡되며 음성신호의 대역폭이 제한되기 때문이다. 본 논문에서는 잡음의 영향을 줄이기 위하여, 2WF(2-stage Wiener Filter) 와 SWP (SNR-dependent Waveform Processing) 그리고 CMN(Cepstrum Mean Normalization)을 사용하였다. 2WF는 음성 신호의 포만트 구조를 적게 왜곡시키면서 전체적인 가산잡음 뿐만 아니라 동적 가산잡음도 줄여준다. SWP는 음성파형에서 SNR값이 상대적으로 큰 부분을 강조하여 전체적인 SNR을 향상시킬 수 있다. 또한, CMN은 특징벡터로부터 채널잡음의 영향을 정규화하여 음성 인식 성능을 향상시킨다. 이러한 방법들을 전화망 한국어 연속 숫자음 DB를 이용하여 실험한 결과, 음성신호의 왜곡을 최소화하면서 잡음의 영향을 줄여 전화망에서의 숫자음 인식 성능을 향상시킬 수 있었다.
-
In this paper we apply PMC (parallel model combination) to speech recognition system online. As a representative of model based noise compensation techniques, PMC compensates environmental mismatch by combining pretrained clean speech models and real-time estimated noise information. This is very effective approach for compensating extreme environmental mismatch but is inadequate to use in on-line system for heavy computational cost. To reduce the computational cost and to apply PMC online, we use a noise masking effect - the energy in a frequency band is dominated either by clean speech energy or by noise energy - in the process of model compensation. Experiments on artificially produced noisy speech data confirm that the proposed technique is fast and effective for the on-line model compensation.
-
In this paper, we study the usage of phoneme duration information for rejection garbage sentence. First, we build a phoneme duration modeling in a speech recognition system based on dicicion tree state tying, We assume that phone duration has a Gamma distribution. Next, we build a verification module in which word-level confidence measure is used. Finally, we make a comparative study on phoneme duration with speech DB obtained from the live system. This DB consistes of OOT(out-of-task) and ING(in-grammar) utterences. the usage of phone duration information yields that OOT recognition rate is improved by 46% and that another 8.4% error rate is reduced when combined with utterence verification module.