• Title/Summary/Keyword: Data Similarity

Search Result 2,044, Processing Time 0.029 seconds

Similarity Measure Design on High Dimensional Data

  • Nipon, Theera-Umpon;Lee, Sanghyuk
    • Journal of the Korea Convergence Society
    • /
    • v.4 no.1
    • /
    • pp.43-48
    • /
    • 2013
  • Designing of similarity on high dimensional data was done. Similarity measure between high dimensional data was considered by analysing neighbor information with respect to data sets. Obtained result could be applied to big data, because big data has multiple characteristics compared to simple data set. Definitely, analysis of high dimensional data could be the pre-study of big data. High dimensional data analysis was also compared with the conventional similarity. Traditional similarity measure on overlapped data was illustrated, and application to non-overlapped data was carried out. Its usefulness was proved by way of mathematical proof, and verified by calculation of similarity for artificial data example.

A Method of Reducing the Processing Cost of Similarity Queries in Databases (데이터베이스에서 유사도 질의 처리 비용 감소 방법)

  • Kim, Sunkyung;Park, Ji Su;Shon, Jin Gon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.4
    • /
    • pp.157-162
    • /
    • 2022
  • Today, most data is stored in a database (DB). In the DB environment, the users requests the DB to find the data they wants. Similarity Query has predicate that explained by a similarity. However, in the process of processing the similarity query, it is difficult to use an index that can reduce the range of processed records, so the cost of calculating the similarity for all records in the table is high each time. To solve this problem, this paper defines a lightweight similarity function. The lightweight similarity function has lower data filtering accuracy than the similarity function, but consumes less cost than the similarity function. We present a method for reducing similarity query processing cost by using the lightweight similarity function features. Then, Chebyshev distance is presented as a lightweight similarity function to the Euclidean distance function, and the processing cost of a query using the existing similarity function and a query using the lightweight similarity function is compared. And through experiments, it is confirmed that the similarity query processing cost is reduced when Chebyshev distance is applied as a lightweight similarity function for Euclidean similarity.

Transactions Clustering based on Item Similarity (아이템의 유사도를 고려한 트랜잭션 클러스터링)

  • 이상욱;김재련
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2002.11a
    • /
    • pp.250-257
    • /
    • 2002
  • Clustering is a data mining method, which consists in discovering interesting data distributions in very large databases. In traditional data clustering, similarity of a cluster of object is measured by pairwise similarity of objects in that paper. In view of the nature of clustering transactions, we devise in this paper a novel measurement called item similarity and utilize this to perform clustering. With this item similarity measurement, we develop an efficient clustering algorithm for target marketing in each group.

  • PDF

Clustering method for similar user with Miexed Data in SNS

  • Song, Hyoung-Min;Lee, Sang-Joon;Kwak, Ho-Young
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.11
    • /
    • pp.25-30
    • /
    • 2015
  • The enormous increase of data with the development of the information technology make internet users to be hard to find suitable information tailored to their needs. In the face of changing environment, the information filtering method, which provide sorted-out information to users, is becoming important. The data on the internet exists as various type. However, similarity calculation algorithm frequently used in existing collaborative filtering method is tend to be suitable to the numeric data. In addition, in the case of the categorical data, it shows the extreme similarity like Boolean Algebra. In this paper, We get the similarity in SNS user's information which consist of the mixed data using the Gower's similarity coefficient. And we suggest a method that is softer than radical expression such as 0 or 1 in categorical data. The clustering method using this algorithm can be utilized in SNS or various recommendation system.

Information Quantification Application to Management with Fuzzy Entropy and Similarity Measure

  • Wang, Hong-Mei;Lee, Sang-Hyuk
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.10 no.4
    • /
    • pp.275-280
    • /
    • 2010
  • Verification of efficiency in data management fuzzy entropy and similarity measure were discussed and verified by applying reliable data selection problem and numerical data similarity evaluation. In order to calculate the certainty or uncertainty fuzzy entropy and similarity measure are designed and proved. Designed fuzzy entropy and similarity are considered as dissimilarity measure and similarity measure, and the relation between two measures are explained through graphical illustration. Obtained measures are useful to the application of decision theory and mutual information analysis problem. Extension of data quantification results based on the proposed measures are applicable to the decision making and fuzzy game theory.

Similarity measurement based on Min-Hash for Preserving Privacy

  • Cha, Hyun-Jong;Yang, Ho-Kyung;Song, You-Jin
    • International Journal of Advanced Culture Technology
    • /
    • v.10 no.2
    • /
    • pp.240-245
    • /
    • 2022
  • Because of the importance of the information, encryption algorithms are heavily used. Raw data is encrypted and secure, but problems arise when the key for decryption is exposed. In particular, large-scale Internet sites such as Facebook and Amazon suffer serious damage when user data is exposed. Recently, research into a new fourth-generation encryption technology that can protect user-related data without the use of a key required for encryption is attracting attention. Also, data clustering technology using encryption is attracting attention. In this paper, we try to reduce key exposure by using homomorphic encryption. In addition, we want to maintain privacy through similarity measurement. Additionally, holistic similarity measurements are time-consuming and expensive as the data size and scope increases. Therefore, Min-Hash has been studied to efficiently estimate the similarity between two signatures Methods of measuring similarity that have been studied in the past are time-consuming and expensive as the size and area of data increases. However, Min-Hash allowed us to efficiently infer the similarity between the two sets. Min-Hash is widely used for anti-plagiarism, graph and image analysis, and genetic analysis. Therefore, this paper reports privacy using homomorphic encryption and presents a model for efficient similarity measurement using Min-Hash.

Information Management by Data Quantification with FuzzyEntropy and Similarity Measure

  • Siang, Chua Hong;Lee, Sanghyuk
    • Journal of the Korea Convergence Society
    • /
    • v.4 no.2
    • /
    • pp.35-41
    • /
    • 2013
  • Data management with fuzzy entropy and similarity measure were discussed and verified by applying reliable data selection problem. Calculation of certainty or uncertainty for data, fuzzy entropy and similarity measure are designed and proved. Proposed fuzzy entropy and similarity are considered as dissimilarity measure and similarity measure, and the relation between two measures are explained through graphical illustration.Obtained measures are useful to the application of decision theory and mutual information analysis problem. Extension of data quantification results based on the proposed measures are applicable to the decision making and fuzzy game theory.

Reliable Data Selection using Similarity Measure (유사측도를 이용한 신뢰성 있는 데이터의 추출)

  • Ryu, Soo-Rok;Lee, Sang-Hyuk
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.18 no.2
    • /
    • pp.200-205
    • /
    • 2008
  • For data analysis, fuzzy entropy is introduced as the measure of fuzziness, similarity measure is also constructed to represent similarity between data. Similarity measure between fuzzy membership functions is constructed through distance measure, and the proposed similarity measure are proved. Application of proposed similarity measure to the example of reliable data selection is also carried out. Application results are compared with the previous results that is obtained through fuzzy entropy and statistical knowledge.

Relation between Certainty and Uncertainty with Fuzzy Entropy and Similarity Measure

  • Lee, Sanghyuk;Zhai, Yujia
    • Journal of the Korea Convergence Society
    • /
    • v.5 no.4
    • /
    • pp.155-161
    • /
    • 2014
  • We survey the relation of fuzzy entropy measure and similarity measure. Each measure represents features of data uncertainty and certainty between comparative data group. With the help of one-to-one correspondence characteristics, distance measure and similarity measure have been expressed by the complementary characteristics. We construct similarity measure using distance measure, and verification of usefulness is proved. Furthermore analysis of similarity measure from fuzzy entropy measure is also discussed.

Evaluation of Positioning Effectiveness Based on the Preference and Similarity Data Derived from Consumers' Choice from Different Choice Sets (선택집합의 변화를 통하여 도출된 선호도 및 유사성 정보를 활용한 포지셔닝 우위 평가)

  • Won, Jee-Sung
    • Korean Management Science Review
    • /
    • v.28 no.1
    • /
    • pp.61-74
    • /
    • 2011
  • Not only the preference data but also the similarity data can be used for developing effective marketing strategies. Hahn et al.[10] proposes a methodology of representing a brand(focal brand)'s competitors in a single map called the Preference-Similarity Map, according to their relative preference to and similarity with the focal brand. They also proposes a way to derive the relative preference and similarity values from the survey collecting the choice data from differing choice sets. This study identifies the limitations of the preference and similarity measures proposed by Hahn et al.[10] and shows how these measures can be revised. This study also proposes how to implement the revised measures and analyze brands' positioning strategies. Based on the results of the previous studies on the effect of inter brand similarity on brand evaluations, this study assumes that it is important to analyze how much a specific brand is preferred to its close competitors when evaluating the effectiveness of the brand's positioning in the market. This study applies the proposed measures to the data used in Hahn et al.[10] and also show how the proposed measures are related to the parameters of the choice model proposed by Batsell and Polking[1].