Proceedings of the KSPS conference (대한음성학회:학술대회논문집)
The Korean Society Of Phonetic Sciences And Speech Technology
- Semi Annual
Domain
- Linguistics > Linguistics, General
2006.11a
-
This paper proposes a confidence measure employed for utterance verification in noisy environments. Most of conventional approaches estimate the proper threshold of confidence measure and apply the value to utterance rejection in recognition process. As such, their performance may degrade for noisy speech since the threshold can be changed in noisy environments. This paper presents further robust confidence measure based on the multi-pass confidence measure. The isolated word recognition based experimental results demonstrate that the proposed method outperforms conventional approaches as utterance verifier.
-
본 논문에서는 음성의 모델을 이용하여 확률적인 기반으로 잡음의 마스킹 정도를 측정하는 방법에 대해서 제시한다. 잡음의 마스킹 정도를 측정하는 기준으로서 '잡음 마스킹 확률'을 구하는 방법에 대해서 설명하고 이의 특성에 대해서 알아본다. 그리고 잡음에 대한 '잡음 마스킹 확률'을 이용하여 잡음 환경에서의 음성인식 특징벡터의 성능 향상에 대해 적용해 보았다. 제안된 방법은 ETSI 에서 음성인식 표준실험으로 제시한 Aurora2 데이터베이스 상에서 실험해 보았다. 그 결과 기존의 알고리즘에 비해 16.58%의 성능 향상을 이루어 낼 수 있었다.
-
The present paper focuses on the interaction between lexical-semantic information and affective prosody. More specifically, we explore whether affective prosody influence on evaluation of affective meaning of a word. To this end, we asked participants to listen a word and to evaluate the emotional content of the word which were recoded with affective prosody. Results showed that first, emotional evaluation was slower when the word meaning is negative than when they is positive. Second, when the prosody of words is negative, evaluation time is faster than when it is neutral or positive. And finally, when the affective meaning of word and prosody is congruent, response time is faster than it is incongruent.
-
The role of dialogue manager is to select proper actions based on observed environment and inferred user intention. This paper presents stochastic model for dialogue manager based on Markov decision process. To build a mixed initiative dialogue manager, we used accumulated user utterance, previous act of dialogue manager, and domain dependent knowledge as the input to the MDP. We also used dialogue corpus to train the automatically optimized policy of MDP with reinforcement learning algorithm. The states which have unique and intuitive actions were removed from the design of MDP by using the domain knowledge. The design of dialogue manager included the usage of natural language understanding and response generator to build short message based remote control of home networked appliances.
-
This paper is for the comparison of prosodic phrasing in Korean spontaneous speech and read speech. For this comparison, The subjects read the transcriptions from their own spontaneous speech. The number of IP in spontaneous speech is more than in read speech, while The number of AP has no difference between them. A accentual phrase in spontaneous speech has less syllable than read speech.
-
This paper describes how a domain dependent pronunciation lexicon is generated and optimized for Korean large vocabulary continuous speech recognition(LVCSR). At the level of lexicon, pronunciation variations are usually modeled by adding pronunciation variants to the lexicon. We propose the criteria for selecting appropriate pronunciation variants in lexicon: (i) likelihood and (ii) frequency factors to select variants. Our experiment is conducted in three steps. First, the variants are generated with knowledge-based rules. Second, we generate a domain dependent lexicon which includes various numbers of pronunciation variants based on the proposed criteria. Finally, the WERs and RTFs are examined with each lexicon. In the experiment, 0.72% WER reduction is obtained by introducing the variants pruning criteria. Furthermore, RTF is not deteriorated although the average number of variants is higher than that of compared lexica.
-
Research in the Center for Pediatric Auditory and Speech Sciences (CPASS) is attempting to characterize or phenotype children with speech delays based on acoustic-phonetic evidence and relate those phenotypes to chromosome loci believed to be related to language and speech. To achieve this goal we have adopted a highly interdisciplinary approach that merges fields as diverse as automatic speech recognition, human genetics, neuroscience, epidemiology, and speech-language pathology. In this presentation I will trace the background of this project, and the rationale for our approach. Analyses based on a large amount of speech recorded from 18 children with speech delays will be presented to illustrate the approach we will be taking to characterize the acoustic phonetic properties of disordered speech in young children. The ultimate goal of our work is to develop non-invasive and objective measures of speech development that can be used to better identify which children with apparent speech delays are most in need of, or would receive the most benefit from the delivery of therapeutic services.
-
Many works have been done in the field of retrieving audio segments that contain human speeches without captions. To retrieve newly coined words and proper nouns, subwords were commonly used as indexing units in conjunction with query or document expansion. Among them, document expansion with subwords has serious drawback of large computation overhead. Therefore, in this paper, we propose Expected Matching Score based document expansion that effectively reduces computational overhead without much loss in retrieval precisions. Experiments have shown 13.9 times of speed up at the loss of 0.2% in the retrieval precision.
-
In this study, we investigate ways that focus is realized in English utterances produced by native speakers of English and Korean learners. As compared to the previous studies which deal mainly with functional aspects of focus as a part of intonational structure, we attempt to provide more quantitative information on F0 and discover the extent to which Korean learners distinguish focus types in their English utterance production. On the test sentences designed to be disambiguated by correct focus realization, it is found that, even advanced-level Korean learners, unlike native speakers, hardly employ F0 to clarify the specific meaning of English utterances.
-
The purpose of this paper is to build a pronunciation lexicon with estimated likelihoods of the phonetic rules based on the phonetic realizations and therefore to improve the performance of CSR using the dictionary. In the baseline system, the phonetic rules and their application probabilities are defined with the knowledge of Korean phonology and experimental tuning. The advantage of this approach is to implement the phonetic rules easily and to get stable results on general domains. However, a possible drawback of this method is that it is hard to reflect characteristics of the phonetic realizations on a specific domain. In order to make the system reflect phonetic realizations, the likelihood of phonetic rules is reestimated based on the statistics of the realized phonemes using a forced-alignment method. In our experiment, we generates new lexica which include pronunciation variants created by reestimated phonetic rules and its performance is tested with 12 Gaussian mixture HMMs and back-off bigrams. The proposed method reduced the WER by 0.42%.
-
It have been known that Daegu dialect does not have /ㅆ/ as a phoneme. However, it seems that /ㅅ/ and /ㅆ/ are phonemically distinctive in younger generation. In this paper, we investigate realization of /ㅅ/ and /ㅆ/ of Daegu dialect in their 20's, and compare them with /ㅅ/ and /ㅆ/ of Seoul dialect in their 20's. The result of this study showed that /ㅅ/ and /ㅆ/ were not significantly different between Daegu and Seoul dialect except pitch. Therefore, in Daegu dialect /ㅅ/ and /ㅆ/ are phonemically distinctive in younger generation like Seoul dialect's /ㅅ/ and /ㅆ/ are.
-
Steered response power(SRP) based algorithm uses a focused beamformer which steers the array to various locations and searches for a peak in output power to localize sound sources. SRP-PHAT, a phase transformed SRP, shows high accuracy, but requires a large amount of computation time. This paper proposes an algorithm that clusters search spaces in advance to reduce computation time of SRP based algorithms.
-
The purpose of this study is to find the acoustic parameters on frequency domain to distinguish the Korean nasals, /m, n, ng/ from each other. Since it is not easy to characterize the antiformant on frequency domain, we suggest the new parameters that are calculated by LTAS(Long term average spectrum). Maximum energy value and its frequency and minimum energy and its frequency of zero are obtained from the spectrum respectively. In addition, slope1, slope2, total energy value, centroid, skewness, and kurtosis are suggested as new parameters as well. The parameters that are revealed as to be statistically signigicant difference are roughly peak1_a, zero_f, slope_1, slope_2, highENG, zero_ENG, and centroid.
-
Three experiments were conducted to determine the exact locus of the frequency effect in speech production. In Experiment 1. a picture naming task was used to replicate whether the word frequency effect is due to the processes involved in lexical access or not. The robust word frequency effect of 31ms was obtained. The question to be addressed in Experiment 2 is whether the word frequency effect is originated from the level where a lemma is selected. To the end, using a picture-word interference task, the significance of interactions between the effects of target frequency, distractor frequency and semantic relatedness were tested. Interaction between the distractor frequency and semantic relatedness variables was significant. And interaction between the target and distractor frequency variables showed a significant tendency. In addition, the results of Experiment 2 suggest that the mechanism underlying the word frequency effect is encoded as different resting activation level of lemmas. Experiment 3 explored whether the word frequency effect is attributed to the lexeme level where phonological information of words is represented or not. A methodological logic applied to Experiment 3 was the same as to Experiment 2. Any interaction was not significant. In conclusion, the present study obtained the evidence supporting two assumptions: (a) the locus of the word frequency effect exists in the processes involved in lemma selection, (b) the mechanism for the word frequency effect is encoded as different resting activation level of lemmas. In order to explain the word frequency effect obtained in this study, the core assumptions of current production models need to be modified.
-
This paper presents a dialogue interface using the dialogue management system as a method for controlling home appliances in Home Network Services. In order to realize this type of dialogue interface, we first investigated the user requirements for Home Network Services by analyzing the dialogues entered by users. Based on the analysis, we were able to extract 15 user intentions and 22 semantic components. In our study, example dialogues were collected from WOZ (Wizard-of-OZ) environment to implement a reasoning model for generating meaningful responses for example-based dialogue modeling technique. An overview of the Home Network Control System using proposed dialogue interface will be presented. Lastly, we will show that the Dialogue Management System trained with our collected dialogues behaves properly to achieve its task of controlling Home Network appliances by going through the steps of natural language understanding, response reasoning, response generation.
-
In this paper, we first implement an audio playback system for virtual reality by providing 3D audio effects to listeners. In general, such a 3D audio playback system utilizes a sound localization technique using head related transfer function (HRTF) to generate 3D audio effect. However, the 3D audio effect is degraded due to the crosstalk in the stereo loudspeaker environment. To enhance the 3D sound effect, we implement the crosstalk cancellation technique proposed by Atal and Schroeder and apply it to the 3D audio system.
-
This paper discusses the importance of silent pauses in the perception of prosodic boundaries in Korean speech. It is suggested that in speech in general, and in particular in spontaneous speech, silent pauses are neither necessary nor sufficient for the perception of prosodic boundaries. In read speech, however, there is a high correlation between the presence of a pause and the perception of a boundary. An experiment was carried out to determine whether removing the silent boundary from an extract of speech had a significant effect on the perception of boundaries in Korean read speech. Results suggest that while the presence of a silent boundary slightly reinforces the perception of a prosodic boundary, subjects are in general capable of perceiving the boundary without the silent pause.
-
Ahn, Se-Yeol;Park, Sung-Chan;Park, Seong-Soo;Koo, Myung-Wan;Jeong, Yeong-Joon;Kim, Myung-Sook 120
The provision of personalized user interfaces for mobile devices is expected to be used for different devices with a wide variety of capabilities and interaction modalities. In this paper, we implemented a multimodal context-aware middleware incorporating XML-based languages such as XHTML, VoiceXML. SCXML uses parallel states to invoke both XHTML and VoiceXML contents as well as to gather composite multimodal inputs or synchronize inter-modalities through man-machine I/Os. We developed home networking service named "HomeN" based on our middleware framework. It demonstrates that users could maintain multimodal scenarios in a clear, concise and consistent manner under various user's interactions. -
The purpose of this study is to examine the difference in acoustic features between Young Voices and Aged Voices, which are actually come from the same age group. The 12 female subjects in their thirties were participated and recorded their sustained vowel /a/, connected speech, and reading. Their voices were divided into Younger Voices and Aged Voices, which means voices sound like younger person and sound like in their age or more aged ones. Praat 4.4.22 was used to record and analyze their acoustic features like Fo, SFF, Jitter, Shimmer, HNR, Pitch-range. And the six female listeners guessed the subjects' age and judged whether they sound younger or as like their actual age. We used the Independent t-Test to find the significant difference between those two groups' acoustic features. The result shows a significant difference in Fo, SFF. The above and the previous studies tell us the group who sounds like younger or baby like voice has the similar acoustic features of actually young people.
-
Quality of narrowband speech
$(0{\sim}4kHz)$ can be enhanced by the bandwidth expansion technique, by which the high- band components are estimated. This paper proposes the bandwidth expansion method using the spline codebook based spectral folding. For the performance evaluation, the PESQ(Perceptual Evaluation of Speech Quality) scores are measured as the objective measurement In addition, the MOS (Mean Opinion Score) and the preference tests are performed as the subjective measurement. The results show our proposed method outperforms the existing spline based one. -
In this paper, we introduce a method for extracting grapheme-to-phoneme conversion rules from the transcription of speech synthesis database and a prosody modeling method using the light version of ToBI for a Korean conversational style TTS. We focused on representing the characteristics of the conversational speech style and the experimental results show that our proposed methods are suitable for developing a Korean conversional style TTS.
-
Korean college students' experience in studying English overseas playes the significant role to their perception, not production. Korean Group which experiences foreign-stay for almost 1 year shows the similar pattern with its counterpart, Korean Non- Experiencing Group, in producing the signal of pre-vowel. On the contrary, Korean Experiencing Group shows the similar perceptual pattern with Native Speakers in word-final non-release stops.
-
Discrimination between speech and music is important in many multimedia applications. Previously we proposed a new parameter for speech/music discrimination, the mean of minimum cepstral distances (MMCD), and it outperformed the conventional parameters. One weakness of it is that its performance depends on range of candidate frames to compute the minimum cepstral distance, which requires the optimal selection of the range experimentally. In this paper, to alleviate the problem, we propose a multi-dimensional MMCD parameter which consists of multiple MMCDs with different ranges of candidate frames. Experimental results show that the multi-dimensional MMCD parameter yields an error rate reduction of 22.5% compared with the optimally chosen one-dimensional MMCD parameter.