• Title/Summary/Keyword: t-SNE

Search Result 41, Processing Time 0.025 seconds

A review on the t-distributed stochastic neighbors embedding (t-SNE에 대한 요약)

  • Kipoong Kim;Choongrak Kim
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.2
    • /
    • pp.167-173
    • /
    • 2023
  • This paper investigates several methods of visualizing high-dimensional data in a low-dimensional space. At first, principal component analysis and multidimensional scaling are briefly introduced as linear approaches, and then kernel principal component analysis, self-organizing map, locally linear embedding, Isomap, Laplacian Eigenmaps, and local multidimensional scaling are introduced as nonlinear approaches. In particular, t-SNE, which is widely used but relatively unfamiliar in the field of statistics, is described in more detail. We also present a simple example for several methods, including t-SNE. Finally, we provide a review of several recent studies pointing out the limitations of t-SNE and discuss the future research problems presented.

Violation Pattern Analysis for Good Manufacturing Practice for Medicine using t-SNE Based on Association Rule and Text Mining (우수 의약품 제조 기준 위반 패턴 인식을 위한 연관규칙과 텍스트 마이닝 기반 t-SNE분석)

  • Jun-O, Lee;So Young, Sohn
    • Journal of Korean Society for Quality Management
    • /
    • v.50 no.4
    • /
    • pp.717-734
    • /
    • 2022
  • Purpose: The purpose of this study is to effectively detect violations that occur simultaneously against Good Manufacturing Practice, which were concealed by drug manufacturers. Methods: In this study, we present an analysis framework for analyzing regulatory violation patterns using Association Rule Mining (ARM), Text Mining, and t-distributed Stochastic Neighbor Embedding (t-SNE) to increase the effectiveness of on-site inspection. Results: A number of simultaneous violation patterns was discovered by applying Association Rule Mining to FDA's inspection data collected from October 2008 to February 2022. Among them there were 'concurrent violation patterns' derived from similar regulatory ranges of two or more regulations. These patterns do not help to predict violations that simultaneously appear but belong to different regulations. Those unnecessary patterns were excluded by applying t-SNE based on text-mining. Conclusion: Our proposed approach enables the recognition of simultaneous violation patterns during the on-site inspection. It is expected to decrease the detection time by increasing the likelihood of finding intentionally concealed violations.

Cluster Analysis of Daily Electricity Demand with t-SNE

  • Min, Yunhong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.5
    • /
    • pp.9-14
    • /
    • 2018
  • For an efficient management of electricity market and power systems, accurate forecasts for electricity demand are essential. Since there are many factors, either known or unknown, determining the realized loads, it is difficult to forecast the demands with the past time series only. In this paper we perform a cluster analysis on electricity demand data collected from Jan. 2000 to Dec. 2017. Our purpose of clustering on electricity demand data is that each cluster is expected to consist of data whose latent variables are same or similar values. Then, if properly clustered, it is possible to develop an accurate forecasting model for each cluster separately. To validate the feasibility of this approach for building better forecasting models, we clustered data with t-SNE. To apply t-SNE to time series data effectively, we adopt the dynamic time warping as a similarity measure. From the result of experiments, we found that several clusters are well observed and each cluster can be interpreted as a mix of well-known factors such as trends, seasonality and holiday effects and other unknown factors. These findings can motivate the approaches which build forecasting models with respect to each cluster independently.

Extra-tidal stars around globular clusters NGC 5024 and NGC 5053 and their chemical abundances

  • Chun, Sang-Hyun;Lee, Jae-Joon
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.43 no.2
    • /
    • pp.40.2-40.2
    • /
    • 2018
  • NGC 5024 and NGC 5053 are among the most metal-poor globular clusters in the Milky Way. Both globular clusters are considered to be accreted from dwarf galaxies (like Sagittarius dwarf galaxy or Magellanic clouds), and common stellar envelope and tidal tails between globular clusters are also detected. We present a search for extra-tidal cluster member candidates around these globular clusters from APOGEE survey data. Using 20 chemical elements (e.g., Fe, C, Mg, Al) and radial velocities, t-distributed stochastic neighbour embedding (t-SNE), which identifies an optimal mapping of a high-dimensional space into fewer dimensions, was explored, and we find that globular cluster stars are well separated from the field stars in 2-dimensional map from t-SNE. We also find that some stars selected in t-SNE map are placed outside of the tidal radius of the clusters. The proper motion of stars outside tidal radius is also comparable to that of globular clusters, which suggest that these stars are tidally decoupled from the globular clusters. We manually measure chemical abundances for the clusters and extra-tidal stars, and discuss the association of extra-tidal stars with the clusters.

  • PDF

Physiological Signal-Based Emotion Recognition in Conversations Using T-SNE (생체신호 기반의 T-SNE 를 활용한 대화 내 감정 인식 )

  • Subeen Leem;Byeongcheon Lee;Jihoon Moon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.703-705
    • /
    • 2023
  • 본 연구는 대화 중 생체신호 데이터를 활용하여 감정 인식 분야에서 더욱 정확하고 범용성이 높은 인식 기술을 제안한다. 이를 위해, 먼저 대화별 길이에 따른 측정값의 개수를 동일하게 조정하고 효과적인 생체신호 데이터의 조합을 비교 및 분석하기 위해 차원 축소 기법인 T-SNE (T-distributed Stochastic Neighbor Embedding)을 활용하여 감정 라벨의 분포를 확인한다. 또한, AutoML (Automated Machine Learning)을 이용하여 축소된 데이터로 감정을 분류 및 각성도와 긍정도를 예측하여 감정을 가장 잘 인식하는 생체신호 데이터의 조합을 발견한다.

Phenolic Composition, Fermentation Profile, Protozoa Population and Methane Production from Sheanut (Butryospermum Parkii) Byproducts In vitro

  • Bhatta, Raghavendra;Mani, Saravanan;Baruah, Luna;Sampath, K.T.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.25 no.10
    • /
    • pp.1389-1394
    • /
    • 2012
  • Sheanut cake (SNC), expeller (SNE) and solvent extractions (SNSE) samples were evaluated to determine their suitability in animal feeding. The CP content was highest in SNSE (16.2%) followed by SNE (14.7%) and SNC (11.6%). However, metabolizable energy (ME, MJ/kg) was maximum in SNC (8.2) followed by SNE (7.9) and SNSE (7.0). The tannin phenol content was about 7.0 per cent and mostly in the form of hydrolyzable tannin (HT), whereas condensed tannin (CT) was less than one per cent. The in vitro gas production profiles indicated similar y max (maximum potential of gas production) among the 3 by-products. However, the rate of degradation (k) was maximum in SNC followed by SNE and SNSE. The $t^{1/2}$ (time taken for reaching half asymptote) was lowest in SNC (14.4 h) followed by SNE (18.7 h) and SNSE (21.9 h). The increment in the in vitro gas volume (ml/200 mg DM) with PEG (polyethylene glycol)-6000 (as a tannin binder) addition was 12.0 in SNC, 9.6 in SNE and 11.0 in SNSE, respectively. The highest ratio of $CH_4$ (ml) reduction per ml of the total gas, an indicator of the potential of tannin, was recorded in SNE (0.482) followed by SNC (0.301) and SNSE (0.261). There was significant (p<0.05) reduction in entodinia population and total protozoa population. Differential protozoa counts revealed that Entodinia populations increased to a greater extent than Holotricha when PEG was added. This is the first report on the antimethanogenic property of sheanut byproducts. It could be concluded that all the three forms of SN byproducts are medium source of protein and energy for ruminants. There is a great potential for SN by-products to be incorporated in ruminant feeding not only as a source of energy and protein, but also to protect the protein from rumen degradation and suppress enteric methanogenesis.

Research Trends of Ergonomics in Occupational Safety and Health through MEDLINE Search: Focus on Abstract Word Modeling using Word Embedding (MEDLINE 검색을 통한 산업안전보건 분야에서의 인간공학 연구동향 : 워드임베딩을 활용한 초록 단어 모델링을 중심으로)

  • Kim, Jun Hee;Hwang, Ui Jae;Ahn, Sun Hee;Gwak, Gyeong Tae;Jung, Sung Hoon
    • Journal of the Korean Society of Safety
    • /
    • v.36 no.5
    • /
    • pp.61-70
    • /
    • 2021
  • This study aimed to analyze the research trends of the abstract data of ergonomic studies registered in MEDLINE, a medical bibliographic database, using word embedding. Medical-related ergonomic studies mainly focus on work-related musculoskeletal disorders, and there are no studies on the analysis of words as data using natural language processing techniques, such as word embedding. In this study, the abstract data of ergonomic studies were extracted with a program written with selenium and BeutifulSoup modules using python. The word embedding of the abstract data was performed using the word2vec model, after which the data found in the abstract were vectorized. The vectorized data were visualized in two dimensions using t-Distributed Stochastic Neighbor Embedding (t-SNE). The word "ergonomics" and ten of the most frequently used words in the abstract were selected as keywords. The results revealed that the most frequently used words in the abstract of ergonomics studies include "use", "work", and "task". In addition, the t-SNE technique revealed that words, such as "workplace", "design", and "engineering," exhibited the highest relevance to ergonomics. The keywords observed in the abstract of ergonomic studies using t-SNE were classified into four groups. Ergonomics studies registered with MEDLINE have investigated the risk factors associated with workers performing an operation or task using tools, and in this study, ergonomics studies were identified by the relationship between keywords using word embedding. The results of this study will provide useful and diverse insights on future research direction on ergonomic studies.

Decision support system for underground coal pillar stability using unsupervised and supervised machine learning approaches

  • Kamran, Muhammad;Shahani, Niaz Muhammad;Armaghani, Danial Jahed
    • Geomechanics and Engineering
    • /
    • v.30 no.2
    • /
    • pp.107-121
    • /
    • 2022
  • Coal pillar assessment is of broad importance to underground engineering structure, as the pillar failure can lead to enormous disasters. Because of the highly non-linear correlation between the pillar failure and its influential attributes, conventional forecasting techniques cannot generate accurate outcomes. To approximate the complex behavior of coal pillar, this paper elucidates a new idea to forecast the underground coal pillar stability using combined unsupervised-supervised learning. In order to build a database of the study, a total of 90 patterns of pillar cases were collected from authentic engineering structures. A state-of-the art feature depletion method, t-distribution symmetric neighbor embedding (t-SNE) has been employed to reduce significance of actual data features. Consequently, an unsupervised machine learning technique K-mean clustering was followed to reassign the t-SNE dimensionality reduced data in order to compute the relative class of coal pillar cases. Following that, the reassign dataset was divided into two parts: 70 percent for training dataset and 30 percent for testing dataset, respectively. The accuracy of the predicted data was then examined using support vector classifier (SVC) model performance measures such as precision, recall, and f1-score. As a result, the proposed model can be employed for properly predicting the pillar failure class in a variety of underground rock engineering projects.

A study on intrusion detection performance improvement through imbalanced data processing (불균형 데이터 처리를 통한 침입탐지 성능향상에 관한 연구)

  • Jung, Il Ok;Ji, Jae-Won;Lee, Gyu-Hwan;Kim, Myo-Jeong
    • Convergence Security Journal
    • /
    • v.21 no.3
    • /
    • pp.57-66
    • /
    • 2021
  • As the detection performance using deep learning and machine learning of the intrusion detection field has been verified, the cases of using it are increasing day by day. However, it is difficult to collect the data required for learning, and it is difficult to apply the machine learning performance to reality due to the imbalance of the collected data. Therefore, in this paper, A mixed sampling technique using t-SNE visualization for imbalanced data processing is proposed as a solution to this problem. To do this, separate fields according to characteristics for intrusion detection events, including payload. Extracts TF-IDF-based features for separated fields. After applying the mixed sampling technique based on the extracted features, a data set optimized for intrusion detection with imbalanced data is obtained through data visualization using t-SNE. Nine sampling techniques were applied through the open intrusion detection dataset CSIC2012, and it was verified that the proposed sampling technique improves detection performance through F-score and G-mean evaluation indicators.

Research Trends in Record Management Using Unstructured Text Data Analysis (비정형 텍스트 데이터 분석을 활용한 기록관리 분야 연구동향)

  • Deokyong Hong;Junseok Heo
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.23 no.4
    • /
    • pp.73-89
    • /
    • 2023
  • This study aims to analyze the frequency of keywords used in Korean abstracts, which are unstructured text data in the domestic record management research field, using text mining techniques to identify domestic record management research trends through distance analysis between keywords. To this end, 1,157 keywords of 77,578 journals were visualized by extracting 1,157 articles from 7 journal types (28 types) searched by major category (complex study) and middle category (literature informatics) from the institutional statistics (registered site, candidate site) of the Korean Citation Index (KCI). Analysis of t-Distributed Stochastic Neighbor Embedding (t-SNE) and Scattertext using Word2vec was performed. As a result of the analysis, first, it was confirmed that keywords such as "record management" (889 times), "analysis" (888 times), "archive" (742 times), "record" (562 times), and "utilization" (449 times) were treated as significant topics by researchers. Second, Word2vec analysis generated vector representations between keywords, and similarity distances were investigated and visualized using t-SNE and Scattertext. In the visualization results, the research area for record management was divided into two groups, with keywords such as "archiving," "national record management," "standardization," "official documents," and "record management systems" occurring frequently in the first group (past). On the other hand, keywords such as "community," "data," "record information service," "online," and "digital archives" in the second group (current) were garnering substantial focus.