Proceedings of the KSPS conference (대한음성학회:학술대회논문집)
The Korean Society Of Phonetic Sciences And Speech Technology
- Semi Annual
Domain
- Linguistics > Linguistics, General
2005.11a
-
This paper gives a brief outline of STiLL related research and its practical applications in Europe, U.S.A, and Japan with a view to encouraging Korean speech scientists and developers as well as linguists and pedagogues to cooperate further for more active participation in this field. The state-of-the-art technologies, academic groups, conferences as well as major STiLL softwares are introduced followed by considerations on the current states, problems of STiLL in Korea, and some suggestions for future development. ?⨀
-
As language learning that utilizes speech and information processing technology is getting popular. Speech Information Technology & Promotion Center(SiTEC) has created and is distributing speech corpora for STiLL in order to support basic research and development of products. We will introduce the corpus for Korean and those for English which we have created and are distributing.
-
The purpose of this project is to develop a device that can automatically measure pronunciation of English speech produced by Korean learners of English. Pronunciation proficiency will be measured largely in two areas; suprasegmental and segmental areas. In suprasegmental area, intonation and word stress will be traced and compared with those of native speakers by way of statistical methods using tilt parameters. Durations of phones are also examined to measure speakers' naturalness of their pronunciations. In doing so, statistical duration modelling from a large speech database using CART will be considered. For segmental measurement of pronunciation, acoustic probability of a phone, which is a byproduct when doing the forced alignment, will be a basis of scoring pronunciation accuracy of a phone. The final score will be a feedback to the learners to improve their pronunciation.
-
This paper analyzes the vocal tract area estimation algorithm used as a part of a speech analysis program to help deaf children correct their pronunciations by comparing their vocal tract shape with normal children's. Assuming that a vocal tract is a concatenation of cylinder tubes with a different cross section, we compute the relative vocal tract area of each tube using the reflection coefficients obtained from linear predictive coding. Then, obtain the absolute vocal tract area by computing the height of lip opening with a formula modified for children's speech. Using the speech data for five Korean vowels (/a/, /e/, /i/, /o/, and /u/), we investigate the effects of the sampling frequency, frame size, and model order. We compare vocal tract shapes obtained from deaf and normal children's speech.
-
Music Summarization is to extract there presentative section of a song such as chorus or motif. In previous work, the length of music summarization was fixed, and the threshold to determine the chorus section was so sensitive that the tuning was needed. Also, the rapid change of rhythm or variation of sound effects make the chorus extraction errors. We suggest the linear regression for extracting the changeable length and for minimizing the effects of threshold variation. The experimental result shows that proposed method outperforms conventional one.
-
Voice activity detection (VAD) is important in many areas of speech processing technology. Speech/nonspeech discrimination in noisy environments is a difficult task because the feature parameters used for the VAD are sensitive to the surrounding environments. Thus the VAD performance is severely degraded at low signal-to-noise ratios (SNRs). In this paper, a new VAD algorithm is proposed based on the degree of voicing and Quantile SNR (QSNR). These two feature parameters are more robust than other features such as energy and spectral entropy in noisy environments. The effectiveness of proposed algorithm is evaluated under the diverse noisy environments in the Aurora2 DB. According to out experiment, the proposed VAD outperforms the ETSI Advanced Frontend VAD.
-
For the Large Corpus based TTS the consistency of the speech corpus is very important. It is because the inconsistency of the speech quality in the corpus may result in a distortion at the concatenation point. And because of this inconsistency, large corpus must be tuned repeatedly One of the reasons for the inconsistency of the speech corpus is the different glottal characteristics of the speech sentence in the corpus. In this paper, we adjusted the glottal characteristics of the speech in the corpus to prevent this distortion. And the experimental results are showed.
-
Voice conversion (VC) is a technique for modifying the speech signal of a source speaker so that it sounds as if it is spoken by a target speaker. Most previous VC approaches used a linear transformation function based on GMM to convert the source spectral envelope to the target spectral envelope. In this paper, we propose several nonlinear GMM-based transformation functions in an attempt to deal with the over-smoothing effect of linear transformation. In order to obtain high-quality modifications of speech signals our VC system is implemented using the Harmonic plus Noise Model (HNM)analysis/synthesis framework. Experimental results are reported on the English corpus, MOCHA-TlMlT.
-
The purpose of this study is to observe how Korean learners of English perceive a weak syllable in words with WS syllable pattern. According to the automated discrimination task using E-Prime, the proportion of right answer and reaction time of the stimuli with same word pairs (a-a, b-b) was more and faster respectively than that with different word pairs (a-b, b-a). Specifically, in a-b or b-a stimuli structure, familiarity(word frequency) of stressed word succeeding weak syllable and whether the weak syllable had coda in it was two important factors in distinguishing between a word with and without weak syllable. Even though the high English proficiency Koreans had faster reaction time than the low English proficiency Koreans, all Korean learners somewhat had difficulty perceiving the weak syllable at the beginning of the word.
-
In this Paper, we are trying to compare the normal speech with emotional speech -happy, sad, and angry states- through the changes of fundamental frequency. Based on the distribution charts of the normal and emotional speech, there are distinctive cues such as range of distribution, average, maximum, minimum, and so on. On the whole, the range of the fundamental frequency is extended in happy and angry states. On the other hand, sad states make the range relatively lessened. Nevertheless, the ranges of the 10 frequency in sad states are wider than the normal speech. In addition, we can verify that ending boundary tones reflect the information of whole speech.
-
The aim of this paper is to investigate the prosodic characteristics of Korean distant speech. 36 2-syllable words of 4 speakers (2 males and 2 females) produced in both distant-talking and normal environments were used. The results showed that ratios of second syllable to first syllable in vowel duration and vowel energy were significantly larger in the distant-talking environment compared to the normal environment and f0 range also bigger in the distant-talking environment. In addition, 'HL%' contour boundary tone in the second syllable and/or 'L +H' contour tone in the first syllable were used in the distant-talking environment.
-
The aim of this paper is to investigate the effects of postvocalic voicing(Contrasting voiceless fricative and affricate with voiced fricative and affricate) on vowel duration. In particular we focused on the durational differences between vowels followed by voiceless and voiced consonants across three groups of speakers: English speakers, English bilinguals and Korean learners of English. the result of experimental I showed that durations of vowels preceding voiced fricative and affricates as well as voiced stops are significantly longer than those preceding voiceless counterparts. Experiment Ⅱ indicated that as the subjects exposed themselves longer to English speaking society, their pronunciation was increasingly similar to those of English native speakers.
-
The aim of the present study is to compare the vowel formants between generations in Daegu dialect. 20 Daegu dialect speakers were participated in this study; 10 were in their 40's, the other 10were in their 20's. As a result, the distance of /ㅣ/ and /ㅐ/, and, /ㅡ/ and /ㅓ/ in 20's is further than 40's, while the distance of /ㅗ/ and in 20's is closer than 40's. It seems reasonable to conclude that vowels in Daegu dialect change to have their own stable space, but /ㅗ/ and /ㅜ/ does not.
-
In this paper, we propose a language modeling approach to improve the performance of a large vocabulary continuous speech recognition system. The proposed approach is based on the active learning framework that helps to select a text corpus from a plenty amount of text data required for language modeling. The perplexity is used as a measure for the corpus selection in the active learning. From the recognition experiments on the task of continuous Korean speech, the speech recognition system employing the language model by the proposed language modeling approach reduces the word error rate by about 6.6 % with less computational complexity than that using a language model constructed with randomly selected texts.
-
연속음성인식을 위한 언어모델 적응기법은 특정 영역의 정보만을 담고 있는 적응 코퍼스를 이용해 작성한 적응 언어모델과 기본 언어모델을 병합하는 방법이다. 본 논문에서는 추가되는 자료 없이 인식 시스템이보유하고 있는 코퍼스만을 사용하여 적응 코퍼스를 구축하기 위해 언어모델에 기반한 정보검색 기법을 사영하였다. 검색된 적응 코퍼스로 작성된 적응 언어모델과 기본 언어모델과의 병합을 위해 본 논문에서는 입력음성을 분할하여 각 구간에 최적인 동적 보간 계수를 구하는 방법을 제안하였다. 제안된 적응 코퍼스를 구하는 방법과 동적 보간 계수는 기본 언어모델 대비절대 3.6%의 한국어 방송뉴스 인식 성능 향상을 보여주었으며 기존의 검증자료를 이용한 정적 보간 계수에 비해 상대 13.6%의 한국어 방송뉴스 인식 성능 향상을 보여 주었다.
-
In this paper, we present POSSDM (POSTECH Situation-Based Dialogue Manager) for a spoken dialogue system using a new example and situation-based dialogue management techniques for effective generation of appropriate system responses. Spoken dialogue system should generate cooperative responses to smoothly control dialogue flow with the users. We introduce a new dialogue management technique incorporating dialogue examples and situation-based rules for EPG (Electronic Program Guide) domain. For the system response inference, we automatically construct and index a dialogue example database from dialogue corpus, and the best dialogue example is retrieved for a proper system response with the query from a dialogue situation including a current user utterance, dialogue act, and discourse history. When dialogue corpus is not enough to cover the domain, we also apply manually constructed situation-based rules mainly for meta-level dialogue management.
-
This paper suggests an algorithm that can estimate the direction of the sound source in realtime. Our intelligent service robot, WEVER, is used to implement the proposed method at the home environment. The algorithm uses the time difference and sound intensity information among the recorded sound source by four microphones. Also, to deal with noise of robot itself, the kalman filter is implemented. The proposed method takes shorter execution time than that of an existing algorithm to fit the real-time service robot. The result shows relatively small error within the range of
${\pm}$ 7 degree. -
In this paper, we propose a sound localization algorithm for two simultaneous speakers. Because speech is wide-band signal, there are many frequency sub-bands in that two speech sounds are mixed. However, in some sub-bands, one speech sound is more dominant than other sounds. In such sub-bands, dominant speech sounds are little interfered by other speech or noise. In speech sounds, overtones of fundamental frequency have large amplitude, and that are called 'Harmonic structure of speech'. Sub-bands inharmonic structure are more likely dominant. Therefore, the proposed localization algorithm is based on harmonic structure of each speakers. At first, sub-bands that belong to harmonic structure of each speech signal are selected. And then, two speakers are localized using selected sub-bands. The result of simulation shows that localization using selected sub-bands are more efficient and precise than localization methods using all sub-bands.
-
Foreign Accent syndrome refers to segmental and suprasegmental changes of speech characteristics following brain lesion which is perceived by listeners as a foreign accent. Change in dialect after a stroke, however, have rarely been reported. We describe a patient who showed prominent change of accent from one to another Korean dialect and discuss about the alteration of prosodic patterns and the changes in segmental level of speech.
-
Previous studies on the bilinguals' lexical selection have suggested some evidence in favor of language-specific hypothesis. The purpose of this study was to see whether Korean-English bilinguals' semantic systems of Korean and English are shared or separated between the two languages. In a series of picture-word interference tasks, participants were to name the pictures in Korean or in English with distractor words printed either in Korean or English. The distractor words were either semantically identical, related, unrelated to the picture, or nonexistant. The response time of naming was facilitated when distractor words were semantically identical for both same-(Naming pictures in English/korean with English/Korean distractor words) and different-language pairs(Naming pictures in English with Korean distractor words and vice versa). But this facilitation effect was stronger when naming was produced in their native language, which in this case was Korean. Also, inhibitory effect was shown when the picture and its distractor word were semantically related in both same- and different-language paired conditions. These results show that bilinguals'two lexicons compete to some extent when selecting the target word. In this viewpoint, it can be concluded that the lexicons of either languages may not be entirely but partly overlapping in bilinguals.
-
In this study we investigated the cognitive neuropsychological characteristics and the underlying mechanism in a letter-by-letter reading dyslexic patient after cerebral infarct of left posterior cerebral artery using fMRl, The results of cognitive neuropsychological assesment are visual perception was appropriate, and semantic categorization, picture naming and picture-word matching tasks were above83% correct, respectively. However, she was very poor in lexical decision task. The selective reading impairment is thought to result from the disruption of the left occipitotemporal region included fusiform gyrus. In fMRl results, the activation level increase din the right occipitotemporal region included fusiform gyrus compared with normal group in compensation for left impairment and more increased in pseudo word reading task than word reading on account of familiarity.
-
The purpose of this study is to identify the acoustic changes according to age and to provide the evaluation criteria of elderly voice. The number of 120 Korean adults (three age groups * two sex groups) proceeded sustained three vowels, read apart of 'Taking a walk' and explained a picture. The data was analyzed acoustically with MDVP of CSL. The results showed that: 1)there was statistically most significant changes in sex and age in F0 than the others but no significant in Shimmer. 2)acoustic parameters were changed from young adulthood to old age. Different patterns of change with aging were observed in men and women.
-
The performance of speech recognition in car environment is severely degraded when there is music or news coming from a radio or a CD player. Since reference signals are available from the audio unit in the car, it is possible to remove them with an adaptive filter. In this paper, we present experimental results of speech recognition in car environment using the echo canceller. For this, we generate test speech signals by adding music or news to the car noisy speech from Aurora2 DB. The HTK-based continuous HMT system is constructed for a recognition system. In addition, the MMSE-STSA method is used to the output of the echo canceller to remove the residual noise more.
-
원거리 음성인식에서 인식률의 성능향상을 위해 필수적인 다채널 마이크 환경에서 방 안의 도처에 분산되어있는 원거리 마이크를 사용하여 TV, 조명 등의 주변 환경을 음성으로 제어하고자 한다. 이를 위해 각 채널의 인식결과를 통합하여 최적의 결과를 얻고자 채널의N-best 결과와 N-best 결과에 포함된 hypothesis의 frame-normalized likelihood 값을 사용하여 Bayesian network을 훈련하고 인식결과를 통합하여 최선의 결과를 decision 하는데 사용함으로써 원거리 음성인식의 성능을 향상시키고 또한 hands-free 응용을 현실화하기위한 방향을 제시한다.
-
Keyword spotting is effective to find keyword from the continuously pronounced speech. However, non-keyword may be accepted as keyword when the environmental noise occurs or speaker changes. To overcome this performance degradation, utterance rejection techniques using confidence measure on the recognition result have been developed. In this paper, we apply DTW to the HMM based broadcasting news keyword spotting system for rejecting non-keyword. Experimental result shows that false acceptance rate is decreased to 50%.
-
Identification is the process automatically identify who is speaking on the basis of information obtained from speech waves. In training phase, each speaker models are trained using each speaker's speech data. GMMs (Gaussian Mixture Models), which have been successfully applied to speaker modeling in text-independent speaker identification, are not efficient in insufficient training data environment. This paper proposes speaker modeling method using MLLR (Maximum Likelihood Linear Regression) method which is used for speaker adaptation in speech recognition. We make SD-like model using MLLR adaptation method instead of speaker dependent model (SD). Proposed system outperforms the GMMs in small training data environment.
-
We propose a speech emotion recognition method for natural human-robot interface. In the proposed method, emotion is classified into 6 classes: Angry, bored, happy, neutral, sad and surprised. Features for an input utterance are extracted from statistics of phonetic and prosodic information. Phonetic information includes log energy, shimmer, formant frequencies, and Teager energy; Prosodic information includes pitch, jitter, duration, and rate of speech. Finally a patten classifier based on Gaussian support vector machines decides the emotion class of the utterance. We record speech commands and dialogs uttered at 2m away from microphones in 5different directions. Experimental results show that the proposed method yields 59% classification accuracy while human classifiers give about 50%accuracy, which confirms that the proposed method achieves performance comparable to a human.
-
The aim of this paper is to analyze the phonetic features of the lexical contrastive focus and the segmental contrastive focus. In this paper, I made two variables to study the realization of the contrastive focus. One is the three phonation types of the Korean plosive, a lenis, a fortis and an aspirate. The other is the positions of the segmental contrastive focus syllable in a word. I examined pitch, duration, intensity, VOT, formant, and so on. The realization of focus is different by the phonation types and the positions of the focused syllable.
-
We presents a new method for the measurement and analysis of the volume of the vocal tract using 3D magnetic resonance image. The relative ratios of volume A, B, and C, which are divided by the 2constriction points formed on the horizontal and vertical plane in vocal tract, take a decisive role indiscriminating Korean monophthong. Together with Fl-F2 and the minimum cross sectional area in the vocal tract, the relative ratios of the regional volumes were proved to be significant parameter in statistic viewpoint.
-
This study investigates the asymmetry effect between acoustics and perception. The examined cues are closure duration, closure voicing, VOT, release, pre-vowel duration, post-vowel duration. Five native speakers of English and 30 Korean college students participated in the present study. The results showed that high level Korean English learners parallels native speakers in their responses, while mid and low level Korean learners are substantially different from natives.
-
The aim of this paper is to exam why the nouns that used /kh, ph, ts, th/, as the final phoneme changed. Assuming that these change related to the aspects of the word usage, we collected the word frequency and the phonetic form of words. The results are as follows : ① The realization of standard phonetic form is related to the frequency of case marker that could not be omitted, combined with the word. ② The changing into /s/ in a coronal consonant is related to the case marker [i].
-
We present a survey on the evaluation methods of speech recognition technology and propose a procedure for evaluating Korean speech recognition systems. Currently there are various kinds of evaluation events conducted by NIST and ELDA every year. In this paper, we introduce these activities, and propose an evaluation procedure for Korean speech recognition systems. In designing the procedure, we consider the characteristics of Korean language, as well as the trends of Korean speech technology industry.
-
This paper suggests some guidelines on evaluating Korean text-to-speech systems in various aspects. Guidelines are suggested in terms of text analysis, intelligibility test and naturalness test and also in terms of generalities and specialties.
-
본고에서는 산업용 음성 DB를 위한 XML 기반 메타데이터의 표준화에 대한 현재 상황과 표준화 활동에 대하여 소개한다. 산업용 음성 DB는 구축에 많은 시간과 비용을 요구하며, 양질의 음성 처리 시스템 (인식/합성/인증)의 개발을 위해서는 가능한 많은 양의 음성 데이터가 필요하다. 산업용 음성 DB 메타데이터 표준화는 서로 다른 기관에서 구축한 음성 DB의 공유와 재사용을 원활히 하기 위하여, 2004년 9월부터 요구사항 분석을 시작하여, 2005년 3월 초안이 완성되었다. 본 표준안은 음성 DB 메타데이터의 구조를 XML 기반으로 정의한 것이며, 음성 파일 이름, 화자 식별자, 음소 기호와 같은 구조 외의 표준화 대상에 대해서는 다루지 않는다. 이미 ETRI와 SiTEC [5]에서 XML 기반의 메타데이터 구조와 내용 표준안을 제안한 바 있으나. [5]에서 제안한 구조는 평면 구조를 취하고 있어 내용의 중복성등의 단점이 있어, 이를 보완하여 음성 DB 데이터 모델을 객체지향 방식으로 설계하였다.
-
Kwon, Oh-Wook;Kwon, Suk-Bong;Jang, Gyu-Cheol;Yun, Sung-rack;Kim, Yong-Rae;Jang, Kwang-Dong;Kim, Hoi-Rin;Yoo, Chang-Dong;Kim, Bong-Wan;Lee, Yong-Ju 215
This paper reports the current status of development of the Korean speech recognition platform (ECHOS). We implement new modules including ETSI feature extraction, backward search with trigram, and utterance verification. The ETSI feature extraction module is implemented by converting the public software to an object-oriented program. We show that trigram language modeling in the backward search pass reduces the word error rate from 23.5% to 22% on a large vocabulary continuous speech recognition task. We confirm the utterance verification module by examining word graphs with confidence score. -
본 논문에서는 한국어 TSS 시스템을 위한 언어처리부의 설계 및 구현 과정을 설명한다. 구현된 언어처리부는 형태소 분석, 품사 태깅, 발음 변환 과정을 거쳐, 주어진 문장의 가장 적절한 발음열과 각 음소의 해당 품사를 출력한다. 프로그램은 표준 C언어로 구현되어 있고, Windows와 Linux에서 모두 동작되는 것을 확인하였다. 수동으로 품사가 할당된 4.5만 어절의 코퍼스로부터 형태소 사전을 구축하였으며, 모든 단어가 사전에 등록되어 있다고 가정할 경우, 488문장의 실험 자료에 대해 어절 단위 오류율이 3.25%이었다.
-
This paper summarizes some recent findings with respect to how prosodic structure is manifested in fine-grained phonetic details and how such phonetic manifestation of prosodic structure may be exploited in spoken word recognition.