DOI QR코드

DOI QR Code

Implementation of Parallel Local Alignment Method for DNA Sequence using Apache Spark

Apache Spark을 이용한 병렬 DNA 시퀀스 지역 정렬 기법 구현

  • 김보성 ((주)이글루시큐리티) ;
  • 김진수 (한국교통대학교 컴퓨터공학과) ;
  • 최도진 (충북대학교 정보통신공학과) ;
  • 김상수 (한국산업인력관리공단) ;
  • 송석일 (한국교통대학교 컴퓨터공학과)
  • Received : 2016.09.26
  • Accepted : 2016.10.11
  • Published : 2016.10.28

Abstract

The Smith-Watrman (SW) algorithm is a local alignment algorithm which is one of important operations in DNA sequence analysis. The SW algorithm finds the optimal local alignment with respect to the scoring system being used, but it has a problem to demand long execution time. To solve the problem of SW, some methods to perform SW in distributed and parallel manner have been proposed. The ADAM which is a distributed and parallel processing framework for DNA sequence has parallel SW. However, the parallel SW of the ADAM does not consider that the SW is a dynamic programming method, so the parallel SW of the ADAM has the limit of its performance. In this paper, we propose a method to enhance the parallel SW of ADAM. The proposed parallel SW (PSW) is performed in two phases. In the first phase, the PSW splits a DNA sequence into the number of partitions and assigns them to multiple nodes. Then, the original Smith-Waterman algorithm is performed in parallel at each node. In the second phase, the PSW estimates the portion of data sequence that should be recalculated, and the recalculation is performed on the portions in parallel at each node. In the experiment, we compare the proposed PSW to the parallel SW of the ADAM to show the superiority of the PSW.

Keywords

DNA Sequence;Local Alignment;Prallel Processing;Apache Spark;Smith Waterman

Acknowledgement

Grant : BK21플러스

Supported by : 한국교통대학교

References

  1. 복경수, 유재수, "빅데이터 활성화 정책 및 응용 사례," 정보과학회지, 제32권, 제11호, pp.46-57, 2014.
  2. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: Cluster Computing with Working Sets," Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, Vol.10, pp.10-10, 2010.
  3. M. Massie, F. Nothaft, C. Hartl, C. Kozanitis, A. Schumacher, A. D. Joseph, and D. A. Patterson, "Adam: Genomics Formats and Processing Patterns for Cloud Scale Computing," University of California, Berkeley Technical Report, No. UCB/EECS-2013, 2013.
  4. Parquet, http://www.parquet.io.
  5. Avro, http://avro.apache.org.
  6. T. F. Smith and M. S. Waterman, "Identification of Common Molecular Subsequences," Journal of Molecular Biology, Vol.147, No.1, pp.195-197, 1981. https://doi.org/10.1016/0022-2836(81)90087-5
  7. https://en.wikipedia.org/wiki/Smith-Waterman _algorithm
  8. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic Local Alignment Search Tool," Journal of Molecular Biology, Vol.215, No.3, pp.403-410, 1990. https://doi.org/10.1016/S0022-2836(05)80360-2
  9. G. Zhao, C. Ling, and D. Sun, "SparkSW: Scalable Distributed Computing System for Large-scale Biological Sequence Alignment," In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp.845-852, 2015.
  10. D. R. Mathog, "Parallel BLAST on Split Databases," Bioinformatics Application Node, Vol.19, No.14, pp.1865-1866, 2003. https://doi.org/10.1093/bioinformatics/btg250
  11. 김동욱, 최한석 "그리드 컴퓨팅을 이용한 BLAST 성능개선 및 유전체 서열분석 시스템 구현," 한국콘텐츠학회논문지, 제10권, 제7호, pp.81-87, 2010.
  12. A. Julich, "Implementations BLAST for Parallel Computers," CABIOS, Vol.11, No.14, pp.3-6, 1995.
  13. V. Breton, E. Caron, F. Desprez, and G. L. Mahec, "BLAST Application with Data-Aware Desktop Grid Middleware," In Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp.284-291, 2009.