Proceedings of the KSPS conference (대한음성학회:학술대회논문집)
The Korean Society Of Phonetic Sciences And Speech Technology
- Semi Annual
Domain
- Linguistics > Linguistics, General
2006.05a
-
The purpose of this paper is to investigate the actual pronunciation of the words of foreign origin on TV news programs, and to review the regulations related to it. To investigate the actual pronunciation of the foreign words, the frequency data of the National Korean Language Institute is used as the subject of investigation. There is a big gap between the actual pronunciation and the orthography of the words of foreign origin. And received pronunciation of foreign words is need to teach or learn Korean efficiently. I suggest the pronunciation of foreign words is marked on Korean dictionary instead of revising the related regulations.
-
This paper describes a speech recognizer implemented on PDAs. The recognizer consists of feature extraction module, search module and utterance verification module. It can recognize 37 words that can be used in the telematics application and fixed-point operation is performed for real-time processing. Simulation results show that recognition accuracy is 94.5% for the in-vocabulary words and 56.8% for the out-of-task words.
-
In this paper, we propose a VQ codebook design of speech recognition feature parameters in order to improve the performance of a distributed speech recognition system. For the context-dependent HMMs, a VQ codebook should be correlated with phonetic distributions in the training data for HMMs. Thus, we focus on a selection method of training data based on phonetic distribution instead of using all the training data for an efficient VQ codebook design. From the speech recognition experiments using the Aurora 4 database, the distributed speech recognition system employing a VQ codebook designed by the proposed method reduced the word error rate (WER) by 10% when compared with that using a VQ codebook trained with the whole training data.
-
The purpose of state tying is to reduce the number of models and to use relatively reliable output probability distributions. There are two approaches: one is top down clustering and the other is bottom up clustering. For seen data, the performance of bottom up approach is better than that of top down approach. In this paper, we propose a new clustering technique that can enhance the undertrained triphone clustering performance. The basic idea is to tie unreliable triphones before clustering. An unreliable triphone is the one that appears in the training data too infrequently to train the model accurately. We propose to use monophone distance to preprocess these unreliable triphones. It has been shown in a pilot experiment that the proposed method reduces the error rate significantly.
-
As the speech recognition systems are used in many emerging applications, robust performance of speech recognition systems under extremely noisy conditions become more important. The voice activity detection (VAD) has been taken into account as one of the important factors for robust speech recognition. In this paper, we investigate conventional VAD algorithms and analyze the weak and the strong points of each algorithm.
-
The performance of a sound source localization system degrades severely in reverberant and noisy environments. In addition, restriction on the distance between microphones, which is required by portable devices, also lower the system performance. This paper compares the sound source localization algorithms based on time delay of arrival, which are robust to reverberation and noises considering microphone sensor distance. As well, post filter which outputs maximum count time delay is adopted to increase the accuracy.
-
The aim of this paper is to investigate the effects of postvocalic consonant cluster (Contrasting nasal-stops consonant with stops) on vowel duration. In particular we focused on the rate of vowel duration in their words. (Experimental I ) and the tendency of unreleased voiceless stops at the end of the words.(Experimental II). The result of experimental I showed that the rate of vowel duration which is preceding single voiceless stops are significantly longer than those preceding nasal-stops counterparts and the percentage of English native speakers was longer than those of Korean leaners of English Experiment II indicated that the tendency of unreleased stop consonants occurred more frequently on single voiceless stops than nasal-stop clusters and Korean learners of English were more frequently produced the unreleased stops than English natives.
-
Young children's speech is compared to adult-to-adult speech and adult-to-child speech by measuring durations and variability of each segment in CVC words. The results demonstrate that child speech exhibits an inconsistent timing relationship between consonants and vowels within a word. In contrast, consonant and vowel durations in adult-to-adult speech and adult-to-child speech exhibit significant relationships across segments, despite variability of segments when speaking rate is decreased. The results suggest that temporal patterns of young children are quite different from those of adults, and provide some evidence for lack of motor control capability and great variance in articulatory coordination.
-
This present study investigates the vowels in Bahasa Malaysia and Bahasa Indonesia in terms of the first two formant frequencies. For this study, we recruited 30 male native speakers of Bahasa Malaysia and Bahasa Indonesia (15 each) which include 6 vowels (i, e, a, o, u, a) in various contexts. The present study provides a three-dimensional vowel space by plotting F1, F2, and the frequency of
datapoints. This study is significant in that the geophysics of vowel space presents yet another view of the vowel space. -
Prosody can be used to resolve syntactic ambiguity of a sentence. English relative clause construction with complex NP(the N1, N2, and RC sequence) has syntactic ambiguity and the clause can be interpreted as modyfying N1(high attachment) or N2(low attachment), Speakers and listeners can disambiguate those sentences based on the prosody. In this paper, we investigate the Korean English learners production on the prosodic structure of English relative clause construction. The production experiment shows that the beginner learners use the phrasing frequently and the advanced learners depend on both the phrasing and the accent. One of the characteristic of the Korean English learners' intonation is that the Korean accentual phrase tone pattern LHa is transferred to their production.
-
The purpose of this study is to observe how Korean listeners detect a target phoneme with 'Focus' represented by prosodic prominence and question-induced semantic emphasis. According to the automated phoneme detection task using E-Prime, Korean listeners detected phoneme targets more rapidly when the target-bearing words were in prominence position and in question-induced position. However, when phoneme targets were in prominence position, response time was much faster than in question-induced position. The results suggest that the prosodic prominence which is explicit method of focus representation be more effective than question-inducing, implicit method of it, in phoneme detecting.
-
This paper presents a technique of imposing the prosodic features of a native speaker's utterance onto the same sentence uttered by a non-native speaker. Three acoustic aspects of the prosodic features were considered: the fundamental frequency (F0) contour, segmental durations, and the intensity contour. The fundamental frequency contour and the segmental durations of the native speaker's utterance were imposed on the non-native speaker's utterance by using the PSOLA (pitch-synchronous overlap and add) algorithm [1] implemented in Praat[2]. The intensity contour transfer was also done in Praat. The technique of transferring one or more of these prosodic features was elaborated and its implications in the area of language education were discussed.
-
Generally the news subtitle is considered as be free of fault. But actually it has committed a fault in many respects. Among these I made a special study of phonological faults; alternation of graphemes, insertion of graphemes, deletion of graphemes and the orthography of loanwords. It is very surprising that the news subtitle has many faults against Korean orthography, We must try to get rid of the faults in the news subtitle.
-
This paper presents ETRI broadcast news speech recognition system. There are two major issues on the broadcast news speech recognition: 1) real-time processing and 2) out-of-vocabulary handling. For real-time processing, we devised the dual decoder architecture. The input speech signal is segmented based on the long-pause between utterances, and each decoder processes the speech segment alternatively. One decoder can start to recognize the current speech segment without waiting for the other decoder to recognize the previous speech segment completely. Thus, the processing delay is not accumulated. For out-of-vocabulary handling, we updated both the vocabulary and the language model, based on the recent news articles on the internet. By updating the language model as well as the vocabulary, we can improve the performance up to 17.2% ERR.
-
A prompter software is used, behind the camera, to scroll the script for a TV narrator. So far it has been manually operated by an assistant, who scrolls the caption following narrator's speech. Automating this procedure using a speech recognition technology has been investigated in this project. The developed auto-scrolling software was tested in offline and online, which shows performance good enough to replace an existing prompter software. This paper describes the whole development process and concerns to be cared.
-
The present study examined two acoustic characteristics(duration and intensity) of vowels produced by 4 cerebral palsic adults and 4 nondisabled adults in conversational and clear speech. In this study, clear speech means: (1) slow one's speech rate just a little, (2) articulate all phonemes accurately and increase vocal volume. Speech material included 10 bisyllabic real words in the frame sentences. Temporal-acoustic analysis showed that vowels produced by two speaker groups in clear speech(in this case, more accurate and louder speech) were significantly longer than vowels in conversational speech. In addition, intensity of vowels produced by cerebral palsic speakers in clear speech(in this case, more accurate and louder speech) was higher than in conversational speech.
-
The purpose of this study was to investigate the diadochokinetic characters in the patients with spastic cerebral palsy(CP) in severity. The diadochokinetic characters were measured through rate, regularity, accuracy and consistency. The subjects participated in this study included 27 persons with spastic CP(mild- 9, moderate- 9, severe- 9) and 9 normal persons who is around 11-20 years old. On the result of this study, rate in AMR was significant difference between all spastic groups and normal group, and rate in SMR was significant difference between normal and mild groups and moderate and severe groups. In regularity of diadochokinetic task, severe group had significant difference the other groups. Finally, accuracy and consistency of diadochokinetic task exhibited significant difference between all spastic groups and normal group.
-
This study investigated the acoustical characteristics of phonatory offset-onset mechanisms. And this study shows the comparative results between non-stutterers (N=3) and a stutterer (N=1). Phonatory offset-onset means a laryngeal articulatory in the connected speech. In the phonetic context V_V), pattern 0(there is no changes) appeared in all subjects, and pattern 4(this indicate the trace of glottal fry and closure in spectrogram)was only in a Stutterer. In high vowels(/i/, /u/), pattern 3 and 4 appeared only in a stutterer. Although there is no common pattern among the non-stutterers, individual's preference pattern was founded. This study offers the key to an understanding of physiological movement on a block of stutter.
-
This research was investigate to understand how moderate bilingual subjects represent the lexicon in second language. Although most researches have focused only on high proficient bilinguals, we analysed how moderate bilinguals who have learned English mostly in school represent the prototype of verb and its inflected form of verb. Results of lexical decision task showed that moderate bilingual subjects used different mental representation depending on whether the verb have regular or irregular conjugation. With regular verbs, the identification of an inflected form was affected by both the frequency of its prototype and that of inflected form, but with irregular verbs, it is affected only by the frequency of inflected form.
-
This paper examined two hypotheses. Firstly, if the first syllable of word play an important role in visual word recognition, it may be the unit of word neighbor. Secondly, if the first syllable is the unit of lexical access, the neighborhood size effect and the neighborhood frequency effect would appear in a lexical decision task and a form primed lexical decision task. We conducted two experiments. Experiment 1 showed that words had large neighbors made a inhibitory effect in the LDT(lexical decision task). Experiment 2 showed the interaction between the neighborhood frequency effectand the word form similarity in the form primed LDT. We concluded that the first syllable in Korean words might be the unit of word neighborhood and play a central role in a lexical access.
-
One of the biggest problems unsolved in emotional speech acquisition is how to make or find a situation which is close to natual or desired state from humans. We proposed a method to collect emotional speech data by scriptual context. Several contexts from the scripts of drama were chosen by the experts in the area. Context were divided into 6 classes according to the contents. Two actors, one male and one female, read the text after recognizing the emotional situations in the script.
-
This paper suggests an algorithm that can estimate the direction of the sound source with three microphones arranged on a circle. The algorithm is robust to microphones' gains because it uses only the time differences between microphones. To make this possible, a cost function which normalizes the microphone's gains is utilized and a procedure to detect the rough position of the sound source is also proposed. Through our experiments, we obtained significant performance improvement compared with the energy-based localizer.
-
The narrowband speech over the telephone network is lacking in the information from low-band (0-300 Hz) and high-band (3400-8000 Hz) that are found in wideband speech (0-8000 Hz). As a result, narrowband speech is characterized by the reduced intelligibility and muffled quality, and degraded speaker identification. Spectral folding is the easiest way to reconstruct the missing high-band; however, the reconstructed speech still brings the sense of band-limited characteristic because of the absence of low-band and mid-band frequency components. To compensate for the lack of the extended speech, we propose to combine the spectral folding method and GMM transformation method, which is a statistical method to reconstruct wideband speech. The reconstructed wideband speech showed that the absent frequency components was filled up with relatively low spectral mismatch. According to the subjective speech quality evaluations, the proposed method was preferred to other methods.
-
The communication method between human and robot is one of the important parts for a human-robot interaction. And speech is easy and intuitive communication method for human-being. By using speech as a communication method for robot, we can use robot as familiar way. In this paper, we developed TTS for human-robot interaction. Synthesis algorithms were modified for an efficient utilization of restricted resource in robot. And synthesis database were reconstructed for an efficiency. As a result, we could reduce the computation time with slight degradation of the speech quality.
-
Speaker verification systems can be implemented using speaker adaptation methods if the amount of speech available for each target speaker is too small to train the speaker model. This paper shows experimental results using well-known adaptation methods, namely Maximum A Posteriori (MAP) and Maximum Likelihood Linear Regression (MLLR). Experimental results using Korean speech show that MLLR is more effective than MAP for short enrollment utterances.
-
Music is now digitally produced and distributed via internet and we face a huge amount of music day by day. A music summarization technology has been studied in order to help people concentrate on the most impressive section of the song andone can skim a song as listening the climax(chorus, refrain) only. Recent studies try to find the climax section using various methods such as finding diagonal line segment or kernel based segmentation. All these methods fail to capture the inherent structure of music due to polyphonic and noisy nature of music. In this paper, after applying moving average filter to time domain of MFCC/chroma feature, we achieved a remarkable result to capture the music structure.