• Title/Summary/Keyword: privacy-preserving data publishing

Search Result 8, Processing Time 0.03 seconds

Development of a Privacy-Preserving Big Data Publishing System in Hadoop Distributed Computing Environments (하둡 분산 환경 기반 프라이버시 보호 빅 데이터 배포 시스템 개발)

  • Kim, Dae-Ho;Kim, Jong Wook
    • Journal of Korea Multimedia Society
    • /
    • v.20 no.11
    • /
    • pp.1785-1792
    • /
    • 2017
  • Generally, big data contains sensitive information about individuals, and thus directly releasing it for public use may violate existing privacy requirements. Therefore, privacy-preserving data publishing (PPDP) has been actively researched to share big data containing personal information for public use, while protecting the privacy of individuals with minimal data modification. Recently, with increasing demand for big data sharing in various area, there is also a growing interest in the development of software which supports a privacy-preserving data publishing. Thus, in this paper, we develops the system which aims to effectively and efficiently support privacy-preserving data publishing. In particular, the system developed in this paper enables data owners to select the appropriate anonymization level by providing them the information loss matrix. Furthermore, the developed system is able to achieve a high performance in data anonymization by using distributed Hadoop clusters.

Models for Privacy-preserving Data Publishing : A Survey (프라이버시 보호 데이터 배포를 위한 모델 조사)

  • Kim, Jongseon;Jung, Kijung;Lee, Hyukki;Kim, Soohyung;Kim, Jong Wook;Chung, Yon Dohn
    • Journal of KIISE
    • /
    • v.44 no.2
    • /
    • pp.195-207
    • /
    • 2017
  • In recent years, data are actively exploited in various fields. Hence, there is a strong demand for sharing and publishing data. However, sensitive information regarding people can breach the privacy of an individual. To publish data while protecting an individual's privacy with minimal information distortion, the privacy- preserving data publishing(PPDP) has been explored. PPDP assumes various attacker models and has been developed according to privacy models which are principles to protect against privacy breaching attacks. In this paper, we first present the concept of privacy breaching attacks. Subsequently, we classify the privacy models according to the privacy breaching attacks. We further clarify the differences and requirements of each privacy model.

A Solution to Privacy Preservation in Publishing Human Trajectories

  • Li, Xianming;Sun, Guangzhong
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.8
    • /
    • pp.3328-3349
    • /
    • 2020
  • With rapid development of ubiquitous computing and location-based services (LBSs), human trajectory data and associated activities are increasingly easily recorded. Inappropriately publishing trajectory data may leak users' privacy. Therefore, we study publishing trajectory data while preserving privacy, denoted privacy-preserving activity trajectories publishing (PPATP). We propose S-PPATP to solve this problem. S-PPATP comprises three steps: modeling, algorithm design and algorithm adjustment. During modeling, two user models describe users' behaviors: one based on a Markov chain and the other based on the hidden Markov model. We assume a potential adversary who intends to infer users' privacy, defined as a set of sensitive information. An adversary model is then proposed to define the adversary's background knowledge and inference method. Additionally, privacy requirements and a data quality metric are defined for assessment. During algorithm design, we propose two publishing algorithms corresponding to the user models and prove that both algorithms satisfy the privacy requirement. Then, we perform a comparative analysis on utility, efficiency and speedup techniques. Finally, we evaluate our algorithms through experiments on several datasets. The experiment results verify that our proposed algorithms preserve users' privay. We also test utility and discuss the privacy-utility tradeoff that real-world data publishers may face.

Efficient K-Anonymization Implementation with Apache Spark

  • Kim, Tae-Su;Kim, Jong Wook
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.11
    • /
    • pp.17-24
    • /
    • 2018
  • Today, we are living in the era of data and information. With the advent of Internet of Things (IoT), the popularity of social networking sites, and the development of mobile devices, a large amount of data is being produced in diverse areas. The collection of such data generated in various area is called big data. As the importance of big data grows, there has been a growing need to share big data containing information regarding an individual entity. As big data contains sensitive information about individuals, directly releasing it for public use may violate existing privacy requirements. Thus, privacy-preserving data publishing (PPDP) has been actively studied to share big data containing personal information for public use, while preserving the privacy of the individual. K-anonymity, which is the most popular method in the area of PPDP, transforms each record in a table such that at least k records have the same values for the given quasi-identifier attributes, and thus each record is indistinguishable from other records in the same class. As the size of big data continuously getting larger, there is a growing demand for the method which can efficiently anonymize vast amount of dta. Thus, in this paper, we develop an efficient k-anonymity method by using Spark distributed framework. Experimental results show that, through the developed method, significant gains in processing time can be achieved.

A Study on Performing Join Queries over K-anonymous Tables

  • Kim, Dae-Ho;Kim, Jong Wook
    • Journal of the Korea Society of Computer and Information
    • /
    • v.22 no.7
    • /
    • pp.55-62
    • /
    • 2017
  • Recently, there has been an increasing need for the sharing of microdata containing information regarding an individual entity. As microdata usually contains sensitive information on an individual, releasing it directly for public use may violate existing privacy requirements. Thus, to avoid the privacy problems that occur through the release of microdata for public use, extensive studies have been conducted in the area of privacy-preserving data publishing (PPDP). The k-anonymity algorithm, which is the most popular method, guarantees that, for each record, there are at least k-1 other records included in the released data that have the same values for a set of quasi-identifier attributes. Given an original table, the corresponding k-anonymous table is obtained by generalizing each record in the table into an indistinguishable group, called the equivalent class, by replacing the specific values of the quasi-identifier attributes with more general values. However, query processing over the anonymized data is a very challenging task, due to generalized attribute values. In particular, the problem becomes more challenging with an equi-join query (which is the most common type of query in data analysis tasks) over k-anonymous tables, since with the generalized attribute values, it is hard to determine whether two records can be joinable. Thus, to address this challenge, in this paper, we develop a novel scheme that is able to effectively perform an equi-join between k-anonymous tables. The experiment results show that, through the proposed method, significant gains in accuracy over using a naive scheme can be achieved.

Enhanced Regular Expression as a DGL for Generation of Synthetic Big Data

  • Kai, Cheng;Keisuke, Abe
    • Journal of Information Processing Systems
    • /
    • v.19 no.1
    • /
    • pp.1-16
    • /
    • 2023
  • Synthetic data generation is generally used in performance evaluation and function tests in data-intensive applications, as well as in various areas of data analytics, such as privacy-preserving data publishing (PPDP) and statistical disclosure limit/control. A significant amount of research has been conducted on tools and languages for data generation. However, existing tools and languages have been developed for specific purposes and are unsuitable for other domains. In this article, we propose a regular expression-based data generation language (DGL) for flexible big data generation. To achieve a general-purpose and powerful DGL, we enhanced the standard regular expressions to support the data domain, type/format inference, sequence and random generation, probability distributions, and resource reference. To efficiently implement the proposed language, we propose caching techniques for both the intermediate and database queries. We evaluated the proposed improvement experimentally.

Privacy Disclosure and Preservation in Learning with Multi-Relational Databases

  • Guo, Hongyu;Viktor, Herna L.;Paquet, Eric
    • Journal of Computing Science and Engineering
    • /
    • v.5 no.3
    • /
    • pp.183-196
    • /
    • 2011
  • There has recently been a surge of interest in relational database mining that aims to discover useful patterns across multiple interlinked database relations. It is crucial for a learning algorithm to explore the multiple inter-connected relations so that important attributes are not excluded when mining such relational repositories. However, from a data privacy perspective, it becomes difficult to identify all possible relationships between attributes from the different relations, considering a complex database schema. That is, seemingly harmless attributes may be linked to confidential information, leading to data leaks when building a model. Thus, we are at risk of disclosing unwanted knowledge when publishing the results of a data mining exercise. For instance, consider a financial database classification task to determine whether a loan is considered high risk. Suppose that we are aware that the database contains another confidential attribute, such as income level, that should not be divulged. One may thus choose to eliminate, or distort, the income level from the database to prevent potential privacy leakage. However, even after distortion, a learning model against the modified database may accurately determine the income level values. It follows that the database is still unsafe and may be compromised. This paper demonstrates this potential for privacy leakage in multi-relational classification and illustrates how such potential leaks may be detected. We propose a method to generate a ranked list of subschemas that maintains the predictive performance on the class attribute, while limiting the disclosure risk, and predictive accuracy, of confidential attributes. We illustrate and demonstrate the effectiveness of our method against a financial database and an insurance database.

Secure Healthcare Management: Protecting Sensitive Information from Unauthorized Users

  • Ko, Hye-Kyeong
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.13 no.1
    • /
    • pp.82-89
    • /
    • 2021
  • Recently, applications are increasing the importance of security for published documents. This paper deals with data-publishing where the publishers must state sensitive information that they need to protect. If a document containing such sensitive information is accidentally posted, users can use common-sense reasoning to infer unauthorized information. In recent studied of peer-to-peer databases, studies on the security of data of various unique groups are conducted. In this paper, we propose a security framework that fundamentally blocks user inference about sensitive information that may be leaked by XML constraints and prevents sensitive information from leaking from general user. The proposed framework protects sensitive information disclosed through encryption technology. Moreover, the proposed framework is query view security without any three types of XML constraints. As a result of the experiment, the proposed framework has mathematically proved a way to prevent leakage of user information through data inference more than the existing method.