DOI QR코드

DOI QR Code

Comparison of Distributed and Parallel NGS Data Analysis Methods based on Cloud Computing

  • Kang, Hyungil (Dept. of Semiconductor Electronics Engineering Chungbuk Health & Science University) ;
  • Kim, Sangsoo (Dept. of Course-based Qualification Exam Team2 Human Resources Development Service of Korea)
  • Received : 2018.01.03
  • Accepted : 2018.04.02
  • Published : 2018.03.28

Abstract

With the rapid growth of genomic data, new requirements have emerged that are difficult to handle with big data storage and analysis techniques. Regardless of the size of an organization performing genomic data analysis, it is becoming increasingly difficult for an institution to build a computing environment for storing and analyzing genomic data. Recently, cloud computing has emerged as a computing environment that meets these new requirements. In this paper, we analyze and compare existing distributed and parallel NGS (Next Generation Sequencing) analysis based on cloud computing environment for future research.

Keywords

E1CTBR_2018_v14n1_34_f0001.png 이미지

Fig. 1. Cost of DNA analysis

E1CTBR_2018_v14n1_34_f0002.png 이미지

Fig. 2. NGS analysis process of [28]

E1CTBR_2018_v14n1_34_f0003.png 이미지

Fig. 3. Overall Architecture of Halvade [24]

E1CTBR_2018_v14n1_34_f0004.png 이미지

Fig. 4. Workflow of SparkGA

Table 1. Tools for each NGS steps

E1CTBR_2018_v14n1_34_t0001.png 이미지

References

  1. M. Choi, "Development Trends of Medical Genomics Using Next Generation Sequencing Techniques," Molecular Cell Biology Newsletter, Apr. 2014.
  2. https://www.genome.gov/sequencingcostsdata/
  3. M. C. Schatz, B. Langmead, and S. L. Salzberg, "Cloud Computing and the DNA Data Race," Nature Biotechnology, vol. 28, no. 7, 2010, pp. 691-693. https://doi.org/10.1038/nbt0710-691
  4. M. Baker, "Next-generation Sequencing: Adjusting to Data Overload," Nature Methods, vol. 7, no. 7, 2010, pp. 495-499. https://doi.org/10.1038/nmeth0710-495
  5. B. Calabrese and M. Cannataro, "Bioinformatics and Microarray Data Analysis on the Cloud," Methods in Molecular Biology, vol. 1375, 2016, pp. 25-39.
  6. http://ngenebio.com/
  7. C. Lee, Bioinformatics Analysis of Next-Generation Sequence Data, BRIC View Trend Report, 2016
  8. A. Geraldine, V. Auwera, M. O. Carneiro, C. Hartl, R. Poplin, G. Angel, A. Levy-Moonshine, T. Jordan, K. Shakir, D. Roazen, J. Thibault, E. Banks, K. V. Garimella, D. Altshuler, S. Gabriel, and M. A. DePristo, "From FastQ Data to High Confidence Variant Calls: the Genome Analysis Toolkit Best Practices Pipeline," Current Protocols in Bioinformatics, 2013, pp. 11-10.
  9. https://www.bioin.or.kr/board.do?cmd=view&bid=tech&num=216321
  10. BWA, https://github.com/lh3/bwa
  11. GATK, https://software.broadinstitute.org/gatk/
  12. B. Langmead, C. Trapnell, M. Pop, and S. Salzberg, "Ultrafast and Memory-efficient Alignment of Short DNA Sequences to the Human Genome," Genome biology, vol. 10, no. 3, 2009.
  13. http://broadinstitute.github.io/picard/
  14. https://github.com/GregoryFaust/samblaster
  15. https://github.com/broadinstitute/mutect
  16. https://hpc.nih.gov/apps/MutSig.html
  17. https://github.com/ekg/freebayes
  18. https://github.com/WGLab/doc-ANNOVAR/
  19. https://www.ensembl.org/vep
  20. https://gencore.bio.nyu.edu/variant-calling-pipeline/
  21. https://wikis.utexas.edu/display/bioiteam/DNAseq+Variant+Calling+Pipeline
  22. https://hadoop.apache.org/
  23. https://spark.apache.org/
  24. D. Decap, J. Reumers, C. Herzeel, P. Costanza, and J. Fostier, "Halvade: Scalable Sequence Analysis with MapReduce," Bioinformatics, vol. 31, no. 15, 2015, pp. 2482-2488. https://doi.org/10.1093/bioinformatics/btv179
  25. https://github.com/citiususc/BigBWA
  26. https://github.com/citiususc/SparkBWA
  27. J. Lee, H. Lee, J. Moon, H. Kang, S. Song, and S. Yu, "Parallel and Distributed PCR Duplication Marking Algorithm Integrated with Genome Sequence Alignment by Using Streaming Technology," Proceedings of TBC 2017, 2017.
  28. H. Mushtaq and Z. Al-Ars, "Cluster-based Apache Spark Implementation of the GATK DNA Analysis Pipeline," In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2015, pp. 1471-1477.
  29. H. Mushtaq, F. Liu, C. Costa, G. Liu, P. Hofstee, and Z. Al-Ars, "Sparkga: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale," In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 2017, pp. 148-157.