Go to the main menu
Skip to content
Go to bottom
REFERENCE LINKING PLATFORM OF KOREA S&T JOURNALS
> Journal Vol & Issue
Phonetics and Speech Sciences
Journal Basic Information
Journal DOI :
The Korean Society of Speech Sciences
Editor in Chief :
Volume & Issues
Volume 7, Issue 4 - Dec 2015
Volume 7, Issue 3 - Sep 2015
Volume 7, Issue 2 - Jun 2015
Volume 7, Issue 1 - Mar 2015
Selecting the target year
Input Dimension Reduction based on Continuous Word Vector for Deep Neural Network Language Model
Kim, Kwang-Ho ; Lee, Donghyun ; Lim, Minkyu ; Kim, Ji-Hwan ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 3~8
DOI : 10.13064/KSSS.2015.7.4.003
In this paper, we investigate an input dimension reduction method using continuous word vector in deep neural network language model. In the proposed method, continuous word vectors were generated by using Google's Word2Vec from a large training corpus to satisfy distributional hypothesis. 1-of-
coding discrete word vectors were replaced with their corresponding continuous word vectors. In our implementation, the input dimension was successfully reduced from 20,000 to 600 when a tri-gram language model is used with a vocabulary of 20,000 words. The total amount of time in training was reduced from 30 days to 14 days for Wall Street Journal training corpus (corpus length: 37M words).
Performance Comparison of Deep Feature Based Speaker Verification Systems
Kim, Dae Hyun ; Seong, Woo Kyeong ; Kim, Hong Kook ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 9~16
DOI : 10.13064/KSSS.2015.7.4.009
In this paper, several experiments are performed according to deep neural network (DNN) based features for the performance comparison of speaker verification (SV) systems. To this end, input features for a DNN, such as mel-frequency cepstral coefficient (MFCC), linear-frequency cepstral coefficient (LFCC), and perceptual linear prediction (PLP), are first compared in a view of the SV performance. After that, the effect of a DNN training method and a structure of hidden layers of DNNs on the SV performance is investigated depending on the type of features. The performance of an SV system is then evaluated on the basis of I-vector or probabilistic linear discriminant analysis (PLDA) scoring method. It is shown from SV experiments that a tandem feature of DNN bottleneck feature and MFCC feature gives the best performance when DNNs are configured using a rectangular type of hidden layers and trained with a supervised training method.
Improvement of convergence speed in FDICA algorithm with weighted inner product constraint of unmixing matrix
Quan, Xingri ; Bae, Keunsung ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 17~25
DOI : 10.13064/KSSS.2015.7.4.017
For blind source separation of convolutive mixtures, FDICA(Frequency Domain Independent Component Analysis) algorithms are generally used. Since FDICA algorithm such as Sawada FDICA, IVA(Independent Vector Analysis) works on the frequency bin basis with a natural gradient descent method, it takes much time to converge. In this paper, we propose a new method to improve convergence speed in FDICA algorithm. The proposed method reduces the number of iteration drastically in the process of natural gradient descent method by applying a weighted inner product constraint of unmixing matrix. Experimental results have shown that the proposed method achieved large improvement of convergence speed without degrading the separation performance of the baseline algorithms.
Audio Event Classification Using Deep Neural Networks
Lim, Minkyu ; Lee, Donghyun ; Kim, Kwang-Ho ; Kim, Ji-Hwan ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 27~33
DOI : 10.13064/KSSS.2015.7.4.027
This paper proposes an audio event classification method using Deep Neural Networks (DNN). The proposed method applies Feed Forward Neural Network (FFNN) to generate event probabilities of ten audio events (dog barks, engine idling, and so on) for each frame. For each frame, mel scale filter bank features of its consecutive frames are used as the input vector of the FFNN. These event probabilities are accumulated for the events and the classification result is determined as the event with the highest accumulated probability. For the same dataset, the best accuracy of previous studies was reported as about 70% when the Support Vector Machine (SVM) was applied. The best accuracy of the proposed method achieves as 79.23% for the UrbanSound8K dataset when 80 mel scale filter bank features each from 7 consecutive frames (in total 560) were implemented as the input vector for the FFNN with two hidden layers and 2,000 neurons per hidden layer. In this configuration, the rectified linear unit was suggested as its activation function.
A Speech Waveform Forgery Detection Algorithm Based on Frequency Distribution Analysis
Heo, Hee-Soo ; So, Byung-Min ; Yang, IL-Ho ; Yu, Ha-Jin ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 35~40
DOI : 10.13064/KSSS.2015.7.4.035
We propose a speech waveform forgery detection algorithm based on the flatness of frequency distribution. We devise a new measure of flatness which emphasizes the local change of the frequency distribution. Our measure calculates the sum of the differences between the energies of neighboring frequency bands. We compare the proposed measure with conventional flatness measures using a set of a large amount of test sounds. We also compare- the proposed method with conventional detection algorithms based on spectral distances. The results show that the proposed method gives lower equal error rate for the test set compared to the conventional methods.
A Study on Word Vector Models for Representing Korean Semantic Information
Yang, Hejung ; Lee, Young-In ; Lee, Hyun-jung ; Cho, Sook Whan ; Koo, Myoung-Wan ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 41~47
DOI : 10.13064/KSSS.2015.7.4.041
This paper examines whether the Global Vector model is applicable to Korean data as a universal learning algorithm. The main purpose of this study is to compare the global vector model (GloVe) with the word2vec models such as a continuous bag-of-words (CBOW) model and a skip-gram (SG) model. For this purpose, we conducted an experiment by employing an evaluation corpus consisting of 70 target words and 819 pairs of Korean words for word similarities and analogies, respectively. Results of the word similarity task indicated that the Pearson correlation coefficients of 0.3133 as compared with the human judgement in GloVe, 0.2637 in CBOW and 0.2177 in SG. The word analogy task showed that the overall accuracy rate of 67% in semantic and syntactic relations was obtained in GloVe, 66% in CBOW and 57% in SG.
Korean Semantic Similarity Measures for the Vector Space Models
Lee, Young-In ; Lee, Hyun-jung ; Koo, Myoung-Wan ; Cho, Sook Whan ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 49~55
DOI : 10.13064/KSSS.2015.7.4.049
It is argued in this paper that, in determining semantic similarity, Korean words should be recategorized with a focus on the semantic relation to ontology in light of cross-linguistic morphological variations. It is proposed, in particular, that Korean semantic similarity should be measured on three tracks, human judgements track, relatedness track, and cross-part-of-speech relations track. As demonstrated in Yang et al. (2015), GloVe, the unsupervised learning machine on semantic similarity, is applicable to Korean with its performance being compared with human judgement results. Based on this compatability, it was further thought that the model's performance might most likely vary with different kinds of specific relations in different languages. An attempt was made to analyze them in terms of two major Korean-specific categories involved in their lexical and cross-POS-relations. It is concluded that languages must be analyzed by varying methods so that semantic components across languages may allow varying semantic distance in the vector space models.
Comparison of Voice Characteristics Before and After High-Caffeine Intake
Lee, Areum ; Kim, Eunyun ; Yoo, Hyunji ; Choi, Yaelin ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 59~65
DOI : 10.13064/KSSS.2015.7.4.059
This study was conducted to identify the differences in voice characteristic variables before and after taking a certain amount of high-caffeine. Linear PCM-M10 Recorder (SONY) was used for the recorder and basic frequency of the voice (Fo), frequency fluctuation rate (jitter), amplitude fluctuation rate (shimmer) and Signal-to-Noise Ratio (SNR) were measured using TF-32(University of Wisconsin-Madison, USA). First, prolonged phonation analysis results of /ah/ by male subjects showed the shimmer values after taking high-caffeine increased statistically significantly(p<.05) compared with before the intake and SNR values significantly decreased. (p<.05). On the other hand, female subjects didn't show any statistically significant differences in all variables. Second, male subjects showed statistically significant increased shimmer values after the intake compared with before the intake at /ah/ of syllable 'na' and /ah/ in 'ra' in 'autumn' paragraph (p<.05), and jitter values significantly increased at /ah/ in 'ah' (p<.05). However, female subjects didn't show any statistically significant differences in all variables. Results of this study showed that high-caffeine intake more affects male subjects than female subjects. In male subjects, shimmer and SNR changed at vowel prolonged phonation, /ah/, and study results showed that shimmer and SNR in 'Autumn' paragraph /na/, /ra/ and jitter in /ah/ could be identified as the variables to show the voice change.
Comparison of Self-Reporting Voice Evaluations between Professional and Non-Professional Voice Users with Voice Disorders by Severity and Type
Kim, Jaeock ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 67~76
DOI : 10.13064/KSSS.2015.7.4.067
The purpose of this study was to compare professional (Pro) and non-professional (Non-pro) voice users with voice disorders in self-reporting voice evaluation using Korean-Voice Handicap Index (K-VHI) and Korean-Voice Related Quality of Life (K-VRQOL). In addition, those were compared by voice quality and voice disorder type. 94 Pro and 106 Non-pro were asked to fill out the K-VHI and K-VRQOL, perceptually evaluated on GRBAS scales, and divided into three types of voice disorders (functional, organic and neurologic) by an experienced speech-language pathologist and an otolaryngologist. The results showed that the functional (F) and physical (P) scores of K-VHI in Pro group were significantly higher than those in Non-pro group. As the voice quality evaluated by G scale got worse, the scores of all aspects except emotional (E) of K-VHI and social-emotional (SE) of K-VRQOL were higher. All scores of K-VHI and K-VRQOL in neurologic voice disorders were significantly higher than those in functional and organic voice disorders. In conclusion, professional voice users are more sensitive to their functional and physical handicap resulted by their voice problems and that goes double for the patients with severe and neurologic voice disorders.
Acoustic Analysis of Voice Change According to Extent of Thyroidectomy
Kang, Young Ae ; Koo, Bon Seok ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 77~83
DOI : 10.13064/KSSS.2015.7.4.077
Voice complication without the laryngeal nerve injury can occur after thyroidectomy. The purpose of this study is to investigate voice changes according to extent of thyroidectomy with acoustic analysis. Thirty-five female patients with papillary thyroid carcinoma took voice evaluation at before and 1 month, and 3 months after thyroidectomy. Acoustic analysis parameters were speaking fundamental frequency(SFF), min
, dynamic range
, jitter, shimmer, noise-to-harmonic ratio(NHR), and Cepstral prominence peak(CPP). Repeated-measured analysis of variance was applied. Time-related voice changes showed significant differences in all parameters except NHR. At 1 month after surgery, voice quality was worse and pitch was decreasing, but voice quality and pitch were improving at 3-month follow-up. Voice changes according to the extent of surgery were in SFF, max
, and dynamic range
. Time by surgery-related voice change existed only in min
. The result showed that the severity of voice complication depended on the extend of thyroidectomy which had a negative impact on
-related parameters. The deterioration of voice quality at 1 month after thyroidectomy may be affected by the loss of thyroid hormone in the blood. The descent of
-related parameters may be impacted by laryngeal fixation of surgical site adhesion.
Acoustic Features of Oral Vowels in the Esophagus Speakers
Yun, Eunmi ; Mok, Eunhee ; Minh, Phan huu Ngoc ; Hong, Kihwan ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 85~92
DOI : 10.13064/KSSS.2015.7.4.085
This study aimed to establish characteristics related to voice and speech through the natural base frequency analysis of esophagus vocalization. In the study, 8 subjects were selected for esophagus vocals, and 10 other subjects were selected for a control group. MDVP(Multi-dimensional Voice Program, Model 4800, USA, 2001), Multi Speech(Model 3700, Kaypantax, USA, 2008) were used as experiment equipment. The speech samples selected for evaluation were vowels and sentences (both declarative and interrogative). For acoustic analysis, the intonation form of fo, jitter, energy, shimmer, HNR, and intonation patterns of the speech sample were measured. The results were as follows: First, the natural intrinsic frequency of extended vowels in the esophagus vocal group was lower than the frequency in the normal vocal group. In particular, the intrinsic frequency difference for high vowel /i/ was much greater than the frequency difference for low vowel /a/. Second, the jitter values of the esophagus vocal group were higher than the control group. In particular, there was a large difference between the jitter values for /a/ and /i/, with the jitter values being highest for /i/. Third, there was no significant difference in vocal strength between the esophagus vocal patient group and the control group. Fourth, the shimmer values of the voices in the esophagus vocal group were higher than shimmer values in the control group. In particular, there was a large difference in shimmer values for low vowel /a/. Fifth, the HNR values of the esophagus vocal group were showed significantly lower than the control group. In particular, the largest difference in HNR values between the two groups was for high vowel /i/. Sixth, the pitch contours of interrogative and declarative sentences of the esophagus vocal patient group showed a different form or only had with small differences compared to the pitch contours of the normal vocal group, thus presenting an inconsistent pattern.
Comparison of Aerodynamic Variables according to the Execution Methods of KayPENTAX Phonatory Aerodynamic System Model 6600
Ko, Hyeju ; Choi, Hong-Shik ; Lim, Sung-Eun ; Choi, Yaelin ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 93~99
DOI : 10.13064/KSSS.2015.7.4.093
In case of PAS test, the air is sometimes leaked although the mask is tightly attached to the face, which is not reliable on the measured values. Therefore, this study aimed to assist the clinical practice suggesting the test method of PAS without air leakage. In the healthy subjects with 12 males and 12 females over 19 years old, three types of tests were performed on the voicing efficiency among the protocol of PAS Model 6600. They are; first, to attach the mask tightly to the face holding the handle of PAS with the subject's two hands (Method 1); second, to attach the mask tightly to the face holding the handle of PAS with the subject's one hand and pushing the body of PAS strongly with the other hand (Method 2); and third, to attach the mask tightly to the face pushing the upper part of the mask by the tester when the subject attached the mask to his or her face holding the handle of PAS with two hands (Method 3). Upon the study analyses, the mean negative pressure, the mean phonogram, subglottic air pressure, and voicing efficiency were shown to be statistically significantly different during PAS test in males depending on the methods. (p<.05) In case of females, only the target airflow rate showed significant difference depending on the methods during PAS test. (p<.001) In conclusion, Method 2 enhanced the noise level and strength while Method 1 was likely to leak the air more compared to the other two methods in males. In case of females, Method 1 showed significant leakage of the air flow. Not to allow the air flow leakage without affecting the outcome of PAS test, it will be the most useful for the tester to push the mask to the subject's face tightly (Method 3).
A Study on the Vowel Duration of the Buckeye Corpus
Chung, Hyejung ; Yoon, Kyuchul ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 103~110
DOI : 10.13064/KSSS.2015.7.4.103
The purpose of this study is to assess the vowel property by examining the vowel duration of the American English vowles found in the Buckeye corpus. The vowel durations were analyzed in terms of various linguistic factors including the number of syllables of the word containing the vowel, the location of the vowel in a word, types of stress, function versus content word, the word frequency in the corpus and the speech rate calculated from the three consecutive words. The findings from this work agreed mostly with those from earlier studies, but with some exceptions. The relationship between the speech rate and the vowel duration proved non-linear.
Perceptual Boundary on a Synthesized Korean Vowel /o/-/u/ Continuum by Chinese Learners of Korean Language
Yun, Jihyeon ; Kim, EunKyung ; Seong, Cheoljae ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 111~121
DOI : 10.13064/KSSS.2015.7.4.111
The present study examines the auditory boundary between Korean /o/ and /u/ on a synthesized vowel continuum by Chinese learners of Korean language. Preceding researches reported that the Chinese learners have difficulty pronouncing Korean monophthongs /o/ and /u/. In this experiment, a nine-step continuum was resynthesized using Praat from a vowel token from a recording of a male announcer who produced it in isolated form. F1 and F2 were synchronously shifted in equal steps in qtone (quarter tone), while F3 and F4 values were held constant for the entire stimuli. A forced choice identification task was performed by the advanced learners who speak Mandarin Chinese as their native language. Their experiment data were compared to a Korean native group. ROC (Receiver Operating Characteristic) analysis and logistic regression were performed to estimate the perceptual boundary. The result indicated the learner group has a different auditory criterion on the continuum from the Korean native group. This suggests that more importance should be placed on hearing and listening training in order to acquire the phoneme categories of the two vowels.
Gender difference in the sound change of lexical pitch accents of South Kyungsang Korean
Lee, Hyunjung ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 123~130
DOI : 10.13064/KSSS.2015.7.4.123
Given a recent finding showing that female speakers of South Kyungsang Korean is undergoing a sound change of the lexical pitch accent, this study tested whether the change is also reflected for male speech. This study compared F0 scaling and timing properties of accent words produced by younger female and male speakers of South Kyungsang Korean. The results indicated clear gender-related differences, showing more distinct acoustic properties across the accent words for male production compared to females. Despite the better distinction, however, younger male speakers showed peak delay where the F0 peaks are located further to the right compared to conservative speakers' production. Therefore, it might be suggested that younger male speakers' accent productions are in between conservative and innovative phonetic forms.
An Analysis of Short and Long Syllables of Sino-Korean Words Produced by College Students with Kyungsang Dialect
Yang, Byunggon ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 131~138
DOI : 10.13064/KSSS.2015.7.4.131
The initial syllables of a pair of Sino-Korean words are generally differentiated in their meaning by either short or long durations. They are realized differently by the dialect and generation of speakers. Recent research has reported that the temporal distinction has gradually faded away. The aim of this study is to examine whether college students with Kyungsang dialect made the distinction temporally using a statistical method of Mixed Effects Model. Thirty students participated in the recording of five pairs of Korean words in clear or casual speaking styles. Then, the author measured the durations of the initial syllables of the words and made a descriptive analysis of the data followed by applying Mixed Effects Models to the data by setting gender, length, and style as fixed effects, and subject and syllable as random effects, and tested their effects on the initial syllable durations. Results showed that college students with Kyungsang dialect did not produce the long and short syllables distinctively with any statistically significant difference between them. Secondly, there was a significant difference in the duration of the initial syllables between male and female students. Thirdly, there was also a significant difference in the duration of the initial syllables produced in the clear or casual styles. The author concluded that college students with Kyungsang dialect do not produce long and short Sino-Korean syllables distinctively, and any statistical analysis on the temporal aspect should be carefully made considering both fixed and random effects. Further studies would be desirable to examine production and perception of the initial syllables by speakers with various dialect, generation, and age groups.
Effects of syllable structure and prominence on the alignment and the scaling of the phrase-initial rising tone in Seoul Korean: A preliminary study
Kim, Sahyang ;
Phonetics and Speech Sciences, volume 7, issue 4, 2015, Pages 139~145
DOI : 10.13064/KSSS.2015.7.4.139
The present study investigates the effects of syllable structure and prosodic prominence on the patterns of tonal alignment and scaling of the phrase-initial rise in Seoul Korean. Two syllable structures (Onset (/#CVC.../ as in minsa) vs. No-onset (/#VC.../ as in insa)) and two prominence conditions (Focus vs. Neutral) were considered. Results showed that the alignment of the L and the H tones in the phrase-initial rise was affected by syllable structure but not by prominence. The time of L was before the vowel onset of the first syllable in the Onset condition (i.e., within the onset consonant) and it was after the vowel onset in the No-onset condition. The difference was attributable to the fact that the initial L was anchored at a fixed distance from the phrase boundary, which was about 30ms after the onset of the syllable in both cases. The time of H was also consistently observed about 20ms after the second vowel onset (i.e., /a/ in minsa/insa). Moreover, the rise time (the duration from the L to the H tones) was longer as the local syllable duration became longer due to different syllable structure and prominence conditions. Taken together, the results provide a support for the segmental anchoring hypothesis, which claims that both the beginning and the end of F0 movement are consistently aligned with segmental 'anchor' points with relatively high stability (Ladd et al., 1999). Results also showed that the scaling of the early rise was slightly influenced by syllable structure but not by prominence. The differences between the results of the current study and a previous study (Cho, 2011) are further discussed.