• Title/Summary/Keyword: imputation

Search Result 243, Processing Time 0.021 seconds

Household, personal, and financial determinants of surrender in Korean health insurance

  • Shim, Hyunoo;Min, Jung Yeun;Choi, Yang Ho
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.5
    • /
    • pp.447-462
    • /
    • 2021
  • In insurance, the surrender rate is an important variable that threatens the sustainability of insurers and determines the profitability of the contract. Unlike other actuarial assumptions that determine the cash flow of an insurance contract, however, it is characterized by endogenous variables such as people's economic, social, and subjective decisions. Therefore, a microscopic approach is required to identify and analyze the factors that determine the lapse rate. Specifically, micro-level characteristics including the individual, demographic, microeconomic, and household characteristics of policyholders are necessary for the analysis. In this study, we select panel survey data of Korean Retirement Income Study (KReIS) with many diverse dimensions to determine which variables have a decisive effect on the lapse and apply the lasso regularized regression model to analyze it empirically. As the data contain many missing values, they are imputed using the random forest method. Among the household variables, we find that the non-existence of old dependents, the existence of young dependents, and employed family members increase the surrender rate. Among the individual variables, divorce, non-urban residential areas, apartment type of housing, non-ownership of homes, and bad relationship with siblings increase the lapse rate. Finally, among the financial variables, low income, low expenditure, the existence of children that incur child care expenditure, not expecting to bequest from spouse, not holding public health insurance, and expecting to benefit from a retirement pension increase the lapse rate. Some of these findings are consistent with those in the literature.

An Intelligent Framework for Feature Detection and Health Recommendation System of Diseases

  • Mavaluru, Dinesh
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.3
    • /
    • pp.177-184
    • /
    • 2021
  • All over the world, people are affected by many chronic diseases and medical practitioners are working hard to find out the symptoms and remedies for the diseases. Many researchers focus on the feature detection of the disease and trying to get a better health recommendation system. It is necessary to detect the features automatically to provide the most relevant solution for the disease. This research gives the framework of Health Recommendation System (HRS) for identification of relevant and non-redundant features in the dataset for prediction and recommendation of diseases. This system consists of three phases such as Pre-processing, Feature Selection and Performance evaluation. It supports for handling of missing and noisy data using the proposed Imputation of missing data and noise detection based Pre-processing algorithm (IMDNDP). The selection of features from the pre-processed dataset is performed by proposed ensemble-based feature selection using an expert's knowledge (EFS-EK). It is very difficult to detect and monitor the diseases manually and also needs the expertise in the field so that process becomes time consuming. Finally, the prediction and recommendation can be done using Support Vector Machine (SVM) and rule-based approaches.

Synthetic data generation by probabilistic PCA (주성분 분석을 활용한 재현자료 생성)

  • Min-Jeong Park
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.4
    • /
    • pp.279-294
    • /
    • 2023
  • It is well known to generate synthetic data sets by the sequential regression multiple imputation (SRMI) method. The R-package synthpop are widely used for generating synthetic data by the SRMI approaches. In this paper, I suggest generating synthetic data based on the probabilistic principal component analysis (PPCA) method. Two simple data sets are used for a simulation study to compare the SRMI and PPCA approaches. Simulation results demonstrate that pairwise coefficients in synthetic data sets by PPCA can be closer to original ones than by SRMI. Furthermore, for the various data types that PPCA applications are well established, such as time series data, the PPCA approach can be extended to generate synthetic data sets.

A Big Data-Driven Business Data Analysis System: Applications of Artificial Intelligence Techniques in Problem Solving

  • Donggeun Kim;Sangjin Kim;Juyong Ko;Jai Woo Lee
    • The Journal of Bigdata
    • /
    • v.8 no.1
    • /
    • pp.35-47
    • /
    • 2023
  • It is crucial to develop effective and efficient big data analytics methods for problem-solving in the field of business in order to improve the performance of data analytics and reduce costs and risks in the analysis of customer data. In this study, a big data-driven data analysis system using artificial intelligence techniques is designed to increase the accuracy of big data analytics along with the rapid growth of the field of data science. We present a key direction for big data analysis systems through missing value imputation, outlier detection, feature extraction, utilization of explainable artificial intelligence techniques, and exploratory data analysis. Our objective is not only to develop big data analysis techniques with complex structures of business data but also to bridge the gap between the theoretical ideas in artificial intelligence methods and the analysis of real-world data in the field of business.

Methods for screening time series data according to data quality and statistical status (품질 및 조건 기반 시계열 데이터 선별 활용 방법)

  • Moon, JaeWon;Yu, MiSeon;Oh, SeungTaek;Kum, SeungWoo;Hwang, JiSoo;Lee, JiHoon
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2022.01a
    • /
    • pp.399-402
    • /
    • 2022
  • 본 논문에서는 불완전한 시계열 데이터를 활용하기 전 데이터를 선별하여 활용하는 방법을 소개한다. 시계열 데이터의 품질은 수집 네트워크와 수집 기기의 시간적 변화와 같은 가변적 상황에 의존적이므로 불규칙적으로 이상 혹은 누락 데이터가 발생한다. 이때 에러를 포함하였다는 이유로 일괄적으로 데이터를 제거하여 활용하지 않거나, 혹은 누락 데이터의 구간을 조건 없이 복원하여 활용한다면 원하지 않는 결과를 초래할 수 있다. 제안하는 방법은 시계열 데이터의 구간에 대한 누락 데이터의 통계적 정보를 축출하고 이에 기반하여 활용 목적과 활용 가능한 품질의 기준에 부합하지 않는다면 활용 불가능한 데이터라고 판별하고 미리 분석 등의 데이터 활용 시 자동 제외하는 구조를 제안하고 실험하였다. 제안하는 방법은 활용 목적과 상황에 적응적으로 누락 값을 포함하는 데이터의 빠른 활용 판단이 가능하며 보다 나은 분석 결과를 얻을 수 있다.

  • PDF

Development of Truck Axle Load Estimation Model Using Weigh-In-Motion Data (WIM 자료를 활용한 화물차량의 축중량 추정 모형 개발에 관한 연구)

  • Oh, Ju Sam
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.31 no.4D
    • /
    • pp.511-518
    • /
    • 2011
  • Truck weight data are essential for road infrastructure design, maintenance and management. WIM (Weigh-In-Motion) system provides highway planners, researchers and officials with statistical data. Recently high speed WIM data also uses to support a vehicle weight regulation and enforcement activities. This paper aims at developing axle load estimating models with high speed WIM data collected from national highway. We also suggest a method to estimate axle load using simple regression model for WIM system. The model proposed by this paper, resulted in better axle load estimation in all class of vehicle than conventional model. The developed axle load estimating model will used for on-going or re-calibration procedures to ensure an adequate level of WIM system performance. This model can also be used for missing axle load data imputation in the future.

Association of Genetic Polymorphism of IL-2 Receptor Subunit and Tuberculosis Case

  • Lee, Sang-In;Jin, Hyun-Seok;Park, Sangjung
    • Biomedical Science Letters
    • /
    • v.24 no.2
    • /
    • pp.94-101
    • /
    • 2018
  • Tuberculosis (TB) is infectious disease caused by Mycobacterium tuberculosis (MTB) infection. It is known that not only the property of microorganism but also the genetic susceptibility of infected patients is controlled. Interleukin 2 (IL-2) is a cytokine belonging to type 1 T helper (Th1) activity. In addition, IL-2, when infected with MTB, binds IL-2 receptor and promotes T cell replication and is involved in granuloma formation. The aim of this study was to investigate the genetic polymorphisms of the IL-2 receptor gene in tuberculosis patients and normal individuals. We analyzed 22 SNPs in three genes using the genotype data of 443 tuberculosis cases and 3,228 healthy controls from the Korea Association Resource for their correlation with tuberculosis case. IL2RA, IL2RB, and IL2RG genes were genotyped of 16, 4, and 2 SNPs, respectively. Among three genes, only IL2RA gene polymorphisms showed statistically significant association with tuberculosis case. 6 SNPs with high significance were identified in the IL2RA gene. In addition, the linkage disequilibrium (LD) structure of IL2RA gene was confirmed. SNP imputation of IL2RA gene was performed, it was confirmed that more SNPs were significant between case and control. If we look at the results of IL2RA gene analysis above, we can see that genetic polymorphism in the gene expressing $IL-2R{\alpha}$ will regulate the expression level of $IL-2R{\alpha}$, and the change in the immune system involved in $IL-2R{\alpha}$. In this study, genetic polymorphism that may affect host immunity suggests that susceptibility to tuberculosis may be controlled.

Performance Comparison of Two Gene Set Analysis Methods for Genome-wide Association Study Results: GSA-SNP vs i-GSEA4GWAS

  • Kwon, Ji-Sun;Kim, Ji-Hye;Nam, Doug-U;Kim, Sang-Soo
    • Genomics & Informatics
    • /
    • v.10 no.2
    • /
    • pp.123-127
    • /
    • 2012
  • Gene set analysis (GSA) is useful in interpreting a genome-wide association study (GWAS) result in terms of biological mechanism. We compared the performance of two different GSA implementations that accept GWAS p-values of single nucleotide polymorphisms (SNPs) or gene-by-gene summaries thereof, GSA-SNP and i-GSEA4GWAS, under the same settings of inputs and parameters. GSA runs were made with two sets of p-values from a Korean type 2 diabetes mellitus GWAS study: 259,188 and 1,152,947 SNPs of the original and imputed genotype datasets, respectively. When Gene Ontology terms were used as gene sets, i-GSEA4GWAS produced 283 and 1,070 hits for the unimputed and imputed datasets, respectively. On the other hand, GSA-SNP reported 94 and 38 hits, respectively, for both datasets. Similar, but to a lesser degree, trends were observed with Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets as well. The huge number of hits by i-GSEA4GWAS for the imputed dataset was probably an artifact due to the scaling step in the algorithm. The decrease in hits by GSA-SNP for the imputed dataset may be due to the fact that it relies on Z-statistics, which is sensitive to variations in the background level of associations. Judicious evaluation of the GSA outcomes, perhaps based on multiple programs, is recommended.

Estimation of Freeway Accident Likelihood using Real-time Traffic Data (실시간 교통자료 기반 고속도로 교통사고 발생 가능성 추정 모형)

  • Park, Joon-Hyung;Oh, Cheol;NamKoong, Seong
    • Journal of Korean Society of Transportation
    • /
    • v.26 no.2
    • /
    • pp.157-166
    • /
    • 2008
  • This study proposed a model to estimate traffic accident likelihood using real-time traffic data obtained from freeway traffic surveillance systems. Traffic variables representing spatio-temporal variations of traffic conditions were utilized as independent variables in the proposed models. Binary logistics regression modelings were conducted to correlate traffic variables and accident data that were collected from the Seohaean freeway during recent three years, from 2004 to 2006. To apply more reliable traffic variables, outlier filtering and data imputation were also performed. The outcomes of the model that are actually probabilistic measures of accident occurrence would be effectively utilized not only in designing warning information systems but also in evaluating the effectiveness of various traffic operations strategies in terms of traffic safety.

Exploiting Patterns for Handling Incomplete Coevolving EEG Time Series

  • Thi, Ngoc Anh Nguyen;Yang, Hyung-Jeong;Kim, Sun-Hee
    • International Journal of Contents
    • /
    • v.9 no.4
    • /
    • pp.1-10
    • /
    • 2013
  • The electroencephalogram (EEG) time series is a measure of electrical activity received from multiple electrodes placed on the scalp of a human brain. It provides a direct measurement for characterizing the dynamic aspects of brain activities. These EEG signals are formed from a series of spatial and temporal data with multiple dimensions. Missing data could occur due to fault electrodes. These missing data can cause distortion, repudiation, and further, reduce the effectiveness of analyzing algorithms. Current methodologies for EEG analysis require a complete set of EEG data matrix as input. Therefore, an accurate and reliable imputation approach for missing values is necessary to avoid incomplete data sets for analyses and further improve the usage of performance techniques. This research proposes a new method to automatically recover random consecutive missing data from real world EEG data based on Linear Dynamical System. The proposed method aims to capture the optimal patterns based on two main characteristics in the coevolving EEG time series: namely, (i) dynamics via discovering temporal evolving behaviors, and (ii) correlations by identifying the relationships between multiple brain signals. From these exploits, the proposed method successfully identifies a few hidden variables and discovers their dynamics to impute missing values. The proposed method offers a robust and scalable approach with linear computation time over the size of sequences. A comparative study has been performed to assess the effectiveness of the proposed method against interpolation and missing values via Singular Value Decomposition (MSVD). The experimental simulations demonstrate that the proposed method provides better reconstruction performance up to 49% and 67% improvements over MSVD and interpolation approaches, respectively.