DOI QR코드

DOI QR Code

Efficient Mining of Frequent Itemsets in a Sparse Data Set

희소 데이터 집합에서 효율적인 빈발 항목집합 탐사 기법

  • 박인창 (삼성전자) ;
  • 장중혁 (연세대학교 소프트웨어응용연구소) ;
  • 이원석 (연세대학교 컴퓨터과학과)
  • Published : 2005.12.01

Abstract

The main research problems in a mining frequent itemsets are reducing memory usage and processing time of the mining process, and most of the previous algorithms for finding frequent itemsets are based on an Apriori-property, and they are multi-scan algorithms. Moreover, their processing time are greatly increased as the length of a maximal frequent itemset. To overcome this drawback, another approaches had been actively proposed in previous researches to reduce the processing time. However, they are not efficient on a sparse .data set This paper proposed an efficient mining algorithm for finding frequent itemsets. A novel tree structure, called an $L_2$-tree, was proposed int, and an efficient mining algorithm of frequent itemsets using $L_2$-tree, called an $L_2$-traverse algorithm was also proposed. An $L_2$-tree is constructed from $L_2$, i.e., a set of frequent itemsets of size 2, and an $L_2$-traverse algorithm can find its mining result in a short time by traversing the $L_2$-tree once. To reduce the processing more, this paper also proposed an optimized algorithm $C_3$-traverse, which removes previously an itemset in $L_2$ not to be a frequent itemsets of size 3. Through various experiments, it was verified that the proposed algorithms were efficient in a sparse data set.

빈발 항목집합 마이닝 분야의 주된 연구 주제는 수행과정에서의 메모리 사용량을 줄이고 짧은 수행 시간에 마이닝 결과 집합을 얻는 것으로서, 빈발항목 탐색을 위한 다수의 방법들은 Apriori 알고리즘에 기반을 둔 다중 탐색 방법들이다. 또한 최대 빈발 패턴의 길이가 길어질수록 마이닝 수행 시간이 급격히 증가되는 단점을 가진다. 이를 극복하기 위해서 이전의 연구에서 마이닝 수행 시간을 단축하기 위한 다양한 방법들이 제안되었다. 하지만, 다수의 이들 방법들은 희소 데이터 집합에서는 다소 비효율적인 성능을 나타낸다. 본 논문에서도 효율적인 빈발항목 탐색 방법을 제안하였다. 먼저 빈발항목 탐색을 위한 새로운 트리 구조인 $L_2$-tree 구조를 제안하였으며, 더불어 $L_2$-tree를 이용하여 빈발 항목집합을 탐색하는 $L_2$-traverse 알고리즘을 제안하였다. $L_2$-traverse 구조는 길이가 2인 빈발 항목집합 $L_2$에 기반하여 생성되는 것으로서 크기가 매우 작으며, 이를 활용한 $L_2$-traverse 알고리즘은 $L_2$-tree를 단순히 한번 탐색함으로써 전체 빈발 항목집합을 빠른 시간에 구한다. 또한 수행 시간을 보다 단축할 수 있는 방법으로 길이가 3인 빈발 항목집합 $L_3$가 될 수 없는 $L_2$ 패턴들을 미리 제거하는 $C_3$-traverse 알고리즘도 제안하였다. 다양한 실험을 통해 제안된 방법들은 특히 $L_2$가 상대적으로 적은 희소 데이터 집합 환경일 때 기존의 다른 방법들보다 우수함을 검증하였다.

Keywords

References

  1. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. 'Finding interesting rules from large sets of discovered association rules,' In Proc. of the 3rd Int. Conf. on Information and Knowledge Management, Gaithersburg, MD, pp.401-408, November, 1994 https://doi.org/10.1145/191246.191314
  2. B. Lent, A. Swami, and I. Widom. 'Clustering association rules,' In Proc. of the Int. Conf on Data Engineering, Birmingham, England, pp.220-231. April, 1997 https://doi.org/10.1109/ICDE.1997.581756
  3. J.S, Park, M.S. Chen, and P.S. Yu. 'An effective hash-based algorithm for mining association rules,' In Proc. of the ACM-SIGMOD Int. Conf. on Management of Data, San Jose, CA, pp.l75-186, May, 1995 https://doi.org/10.1145/223784.223813
  4. S. Sarawagi, S. Thomas, and R. Agrawal. 'Integrating association rule mining with relational database systems: Alternatives and implications,' In Proc. of the ACM-SIGMOD Int. Conf. on Management of Data, Seattle, WA, pp.343-354, June, 1998 https://doi.org/10.1145/276305.276335
  5. R. Srikant and R. Agrawal. 'Mining generalized association rules,' In Proc. of the Int Conf. on Very Large Data Bases, Zurich, Switzerland, pp.407-419, September, 1995
  6. R. Srikant, Q. Vu, and R. Agrawal. 'Mining association rules with item constraints,' In Proc. of the 3rd Int. Conf. on Knowledge Discovery and Data Mining, Newport Beach, CA, pp.67-73, August, 1997
  7. R. Agarwal, C. Aggarwal, and V.V.V. Prasad. 'Depth first generation of long patterns,' In Proc. of the 6th Int. Con. Knowledge Discovery and Data Mining, pp.108-118, August, 2000 https://doi.org/10.1145/347090.347114
  8. C. Hidber, 'Online Association Rule Mining', In Proc. of the ACM-SIGMOD Int. Conf on Management of Data, Philadelphia, PA, pp.145-156, May, 1999 https://doi.org/10.1145/304182.304195
  9. R. Agrawal and R. Srikant. 'Fast algorithms for mining association rules,' In Proc. of the Int. Conf. on Very Large DataBases, Santiago, Chile, pp.487-499, September, 1994
  10. J. Han, J. Pei, and Y. Yin. 'Mining frequent patterns without candidate generation,' In Proc. of the ACM-SIGMOD Int. Conf on Management of Data, Dallas, TA, pp.1-12, May, 2000 https://doi.org/10.1145/342009.335372
  11. R. Agarwal, C. Aggarwal, and V.V.V. Prasad. 'A tree projection algorithm for generation of frequent itemsets,' In Journal of Parallel and Distributed Computing, Vo1.61, No. 3, pp.350-371, 2001 https://doi.org/10.1006/jpdc.2000.1693
  12. A Savasere, E. Omiecinski, and S. Navathe. 'An efficient algorithm for mining association rules in large databases', In Proc. of the Int. Conf on Very Large DataBases, Zurich, Switzerland, pp.432-443, September, 1995
  13. S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. 'Dynamic itemset counting and implication rules for market basket analysis,' In Proc. of the ACM-SIGMOD Int. Conf on Management of Data, Tucson, AZ, pp.255-264, May, 1997 https://doi.org/10.1145/253260.253325
  14. J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, 'H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases', In Proc. of the Int. Conf on Data. Mining, San Jose, CA, pp.441-448, November, 2001 https://doi.org/10.1109/ICDM.2001.989550