• Title/Summary/Keyword: Kaggle

Search Result 49, Processing Time 0.035 seconds

A Comparative Analysis of the Pre-Processing in the Kaggle Titanic Competition

  • Tai-Sung, Hur;Suyoung, Bang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.3
    • /
    • pp.17-24
    • /
    • 2023
  • Based on the problem of 'Tatanic - Machine Learning from Disaster', a representative competition of Kaggle that presents challenges related to data science and solves them, we want to see how data preprocessing and model construction affect prediction accuracy and score. We compare and analyze the features by selecting seven top-ranked solutions with high scores, except when using redundant models or ensemble techniques. It was confirmed that most of the pretreatment has unique and differentiated characteristics, and although the pretreatment process was almost the same, there were differences in scores depending on the type of model. The comparative analysis study in this paper is expected to help participants in the kaggle competition and data science beginners by understanding the characteristics and analysis flow of the preprocessing methods of the top score participants.

PLS Path Modeling to Investigate the Relations between Competencies of Data Scientist and Big Data Analysis Performance : Focused on Kaggle Platform (데이터 사이언티스트의 역량과 빅데이터 분석성과의 PLS 경로모형분석 : Kaggle 플랫폼을 중심으로)

  • Han, Gyeong Jin;Cho, Keuntae
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.42 no.2
    • /
    • pp.112-121
    • /
    • 2016
  • This paper focuses on competencies of data scientists and behavioral intention that affect big data analysis performance. This experiment examined nine core factors required by data scientists. In order to investigate this, we conducted a survey to gather data from 103 data scientists who participated in big data competition at Kaggle platform and used factor analysis and PLS-SEM for the analysis methods. The results show that some key competency factors have influential effect on the big data analysis performance. This study is to provide a new theoretical basis needed for relevant research by analyzing the structural relationship between the individual competencies and performance, and practically to identify the priorities of the core competencies that data scientists must have.

Improvement Method of Classification Rate in ML Antivirus systems using Kaggle Datasets (캐글 데이터셋을 이용한 머신러닝 악성코드 분류시스템에서 분류정확도 향상방법)

  • Kim, Kyungshin
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2019.07a
    • /
    • pp.49-52
    • /
    • 2019
  • 머신러닝을 이용한 악성코드 분류 시스템의 대부분이 캐글 데이터셋 10,868건을 사용하여 분류의 정확도를 측정한다. 이 데이터셋에 포함된 바이러스 바이트코드에는 미확인(undefined)필드라는 부분이 과도하게 존재한다. 캐글 데이터셋 특정 Label의 미확인필드 포함도는 75%가 넘는 경우도 존재한다. 이 경우 미확인 필드를 어떻게 처리하느냐가 시스템의 성능에 가장 큰 영향을 끼친다. 본 연구에서는 이러한 캐글 데이터셋의 미확인필드 처리방법을 제시하고 그에 따른 분류 정확도를 연구하였다. 다양한 처리방법에 대한 정확도를 측정하여 제안한 방식의 타당성을 증명하였다.

  • PDF

Multiple image classification using label mapping (레이블 매핑을 이용한 다중 이미지 분류)

  • Jeon, Seung-Je;Lee, Dong-jun;Lee, DongHwi
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.05a
    • /
    • pp.367-369
    • /
    • 2022
  • In this paper, the predicted results were confirmed by label mapping for each class while implementing multi-class image classification to confirm accurate results for images in which the trained model failed classification. A CNN model was constructed and trained using Kaggle's Intel Image Classification dataset, and the mapped label values of multiple classes of images and the values classified by the model were compared by label mapping the images of the test dataset.

  • PDF

Image Scene Classification of Multiclass (다중 클래스의 이미지 장면 분류)

  • Shin, Seong-Yoon;Lee, Hyun-Chang;Shin, Kwang-Seong;Kim, Hyung-Jin;Lee, Jae-Wan
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.551-552
    • /
    • 2021
  • In this paper, we present a multi-class image scene classification method based on transformation learning. ImageNet classifies multiple classes of natural scene images by relying on pre-trained network models on large image datasets. In the experiment, we obtained excellent results by classifying the optimized ResNet model on Kaggle's Intel Image Classification data set.

  • PDF

Analysis of YouTube Trending Video Dataset by Country and Category (YouTube 인기 급상승 동영상 데이터셋의 국가별-카테고리별 분석)

  • Jung, Jimin;Kim, Seungjin;Jung, Sungwook;Lee, Dongyun
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.05a
    • /
    • pp.209-211
    • /
    • 2022
  • YouTube, a video platform used by millions of people worldwide, provides a rapidly growing video service. This study aims to understand the characteristics and cultural differences of each country using the Kaggle dataset, one of the public datasets, and to show the usefulness of the public dataset. For this purpose, we analyze data from 11 countries, 15 categories, and about 1.1 million trending videos. This study adopts Python to obtain the number of videos by category for data analysis, the selection period of videos rapidly increasing in popularity, and the ratio of unique videos. In the future, based on machine learning, we plan to research to help diagnose individual videos and establish channel operation plans and strategies by predicting the selection possibility and selection period based on machine learning.

  • PDF

On Building the Solar Dataset Form using the Kaggle Platform: The applicability of Machine Learning (캐글 플랫폼 활용한 태양광 데이터셋 형태 구축: 머신 러닝의 적용 가능성)

  • Ko, Ju-won;Park, Jung-jin;Park, Jin-woo;Oh, Do-hee;Kim, Mincheol
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.05a
    • /
    • pp.255-258
    • /
    • 2022
  • As environmental pollution continues, attention on renewable energy is on the constant rise in recent days. Although various kinds of renewable energy such as solar, wind power and biomass energy have been generated in Jeju, opening and analyzing cases on related data seem insufficient. Therefore, this study is being conducted to deduce the variables which have high relation with solar panel&s output and to understand machine learning methods that can be applied to solar power generation data by utilizing Kaggle platform, which is actively used by a number of scientists. Then, it is planned to propose a form of solar power generation dataset by researching machine learning methods that could be applied to the data. To be specific, analyzing solar power generation data with the Kaggle platform, this study will provide complements on gathering solar power data in Jeju. This study is anticipated to be utilized on data analysis for developing the solar power industry in Jeju. That is, this study is expected to reveal the room for improvement inherent in existing open datasets in Jeju, so that they could be constructed in a suitable form for machine learning for AI analytics. Through this process, a method to increase efficiency of solar power generation is anticipated to be prepared.

  • PDF

Correlation Analysis of Airline Customer Satisfaction using Random Forest with Deep Neural Network and Support Vector Machine Model

  • Hong, Sang Hoon;Kim, Bumsu;Jung, Yong Gyu
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.12 no.4
    • /
    • pp.26-32
    • /
    • 2020
  • There are many airline customer evaluation data, but they are insufficient in terms of predicting customer satisfaction in practice. In particular, they are generally insufficient in case of verification of data value and development of a customer satisfaction prediction model based on customer evaluation data. In this paper, airline customer satisfaction analysis is conducted through an experiment of correlation analysis between customer evaluation data provided by Google's Kaggle. The difference in accuracy varied according to the three types, which are the overall variables, the top 4 and top 8 variables with the highest correlation. To build an airline customer satisfaction prediction model, they are applied to three classification algorithms of Random Forest, SVM, DNN and conduct a classification experiment. They are divided into training data and verification data by 7:3. As a result, the DNN model showed the lowest accuracy at 86.4%, while the SVM model at 89% and the Random Forest model at 95.7% showed the highest accuracy and performance.

Smart Mirror for Facial Expression Recognition Based on Convolution Neural Network (컨볼루션 신경망 기반 표정인식 스마트 미러)

  • Choi, Sung Hwan;Yu, Yun Seop
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.05a
    • /
    • pp.200-203
    • /
    • 2021
  • This paper introduces a smart mirror technology that recognizes a person's facial expressions through image classification among several artificial intelligence technologies and presents them in a mirror. 5 types of facial expression images are trained through artificial intelligence. When someone looks at the smart mirror, the mirror recognizes my expression and shows the recognized result in the mirror. The dataset fer2013 provided by kaggle used the faces of several people to be separated by facial expressions. For image classification, the network structure is trained using convolution neural network (CNN). The face is recognized and presented on the screen in the smart mirror with the embedded board such as Raspberry Pi4.

  • PDF

Image Classification Method Using Learning (학습을 이용한 영상 분류 방법)

  • Shin, Seong-Yoon;Lee, Hyun-Chang;Shin, Kwang-Seong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2021.01a
    • /
    • pp.285-286
    • /
    • 2021
  • 본 논문에서는 변환 학습을 기반으로 한 다중 클래스 이미지 장면 분류 방법을 제안하도록 한다. ImageNet 대형 이미지 데이터 세트에서 사전 훈련 된 네트워크 모델을 사용하여 다중 클래스의 자연 장면 이미지를 분류하였다. 실험에서 최적화 된 ResNet 모델은 Kaggle의 Intel Image Classification 데이터 세트에 분류되어 우수한 결과를 얻었다.

  • PDF