DOI QR코드

DOI QR Code

K-mer Based RNA-seq Read Distribution Method For Accelerating De Novo Transcriptome Assembly

  • Kwon, Hwijun (School of Computer Science and Engineering, Kyungpook National University) ;
  • Jung, Inuk (School of Computer Science and Engineering, Kyungpook National University)
  • Received : 2020.07.28
  • Accepted : 2020.08.13
  • Published : 2020.08.31

Abstract

In this paper, we propose a gene family based RNA-seq read distribution method in means to accelerate the overal transcriptome assembly computation time. To measure the performance of our transcriptome sequence data distribution method, we evaluated the performance by testing four types of data sets of the Arabidopsis thaliana genome (Whole Unclassified Reads, Family-Classified Reads, Model-Classified Reads, and Randomly Classified Reads). As a result of de novo transcript assembly in distributed nodes using model classification data, the generated gene contigs matched 95% compared to the contig generated by WUR, and the execution time was reduced by 4.2 times compared to a single node environment using the same resources.

본 논문에서는 드노보 전사체 어셈블리의 수행시간을 단축하기 위해 RNA-seq 서열을 유전자계 정보를 활용하여 여러 노드로 분산이 가능한 방법을 제시한다. 제안하는 전사체 서열 데이터 분산기법의 성능을 측정하기 위해 애기장대의 리드를 4개의 데이터 셋(전체 비분류 리드, 완전 분류 리드, 모델 분류 리드, 무작위 분류 리드)으로 구성하여 실험을 수행하였다. 전체 비분류 데이터와 비교하여 생성된 유전자 콘티그(Contig)는 95% 일치하였고 동일한 리소스들을 사용하는 단일 노드에 비해 본 연구에서 제시하는 분산환경분산 환경 기반의 어셈블리 수행시간은 4.2배 단축되었다.

Keywords

References

  1. Mardis, Elaine R. "The impact of next-generation sequencing technology on genetics." Trends in genetics, Vol. 24, No. 3, pp. 133-141, Mar 2008, DOI: 10.1016/j.tig.2007.12.007
  2. Robert Henschel, Matthias Lieber, Le-Shin Wu, Phillip M. Nista, Brian J. Haas, and Richard D. LeDuc, "Trinity RNA-Seq assembler performance optimization", In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond (XSEDE '12). Association for Computing Machinery, New York, NY, USA, Article 45, pp. 1-8, Jul 2012, DOI: 10.1145/2335755.2335842
  3. Holzer, Martin, and Manja Marz. "De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers." GigaScience, Vol. 8, May 2019, DOI: 10.1093/gigascience/giz039
  4. Goswami, Sayan, et al. "Gpu-accelerated large-scale genome assembly." 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, May 2018, DOI:10.1109/IPDPS.2018.00091
  5. Varma, B. Sharat Chandra, et al. "FAssem: FPGA based acceleration of de novo genome assembly." 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines. pp. 173-176, Apr 2013, ,DOI:10.1109/FCCM.2013.25.
  6. Ellis, Marquita, et al. "diBELLA: Distributed long read to long read alignment." Proceedings of the 48th International Conference on Parallel Processingm, Num 70, pp. 1-11, Aug 2019, DOI:10.1145/3337821.3337919
  7. Henschel, Robert, et al. "Trinity RNA-Seq assembler performance optimization." Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond , Jul 2012, DOI:10.1145/2335755.2335842.
  8. Haas, B., Papanicolaou, A., Yassour, M. et al, "De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis", Nature Protocols 8, pp. 1494-1512, Jul 2013, DOI: 10.1038/nprot.2013.084
  9. Kim, C.S., Winn, M.D., Sachdeva, V. et al. "K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity", BMC Bioinformatics 18, Nov 2017, DOI: 10.1186/s12859-017-1881-8
  10. Zhao, Q., Wang, Y., Kong, Y. et al. "Optimizing de novo transcri ptome assembly from short-read RNA-Seq data: a comparative study", BMC Bioinformatics 12, Dec 2011, DOI: 10.1186/1471-2105-12-S14-S2
  11. Wagner, Michael & Fulton, Ben & Henschel, Robert. "Perform ance Optimization for the Trinity RNA-Seq Assembler", Tools for High Performance Computing 2015, pp. 29-40, Jan 2016, DOI: 10.1007/978-3-319-39589-0_3.
  12. D. Yan, H. Chen, J. Cheng, Z. Cai and B. Shao, "Scalable De Novo Genome Assembly Using Pregel," 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, pp. 1216-1219, Jan 2018, DOI: 10.1109/ICDE.2018.00114.
  13. Lamesch P, Berardini TZ, Li D, et al, "The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools", Nucleic Acids Res, pp. D1202-D1210, Jan 2012, DOI:10.1093/nar/gkr1090
  14. NCBI SRA database(Arabidopsis Thaliana), https://www.ncbi.nlm.nih.gov/sra/SRX5525170%5baccn%5d
  15. Manchanda, N., Portwood, J.L., Woodhouse, M.R. et al. "Geno meQC: a quality assessment tool for genome assemblies and gene structure annotations", BMC Genomics 21, No 193, Mar 2020, DOI: 10.1186/s12864-020-6568-2
  16. Saw, A.K., Raj, G., Das, M. et al., "Alignment-free method for DNA sequence clustering using Fuzzy integral similarity", Scientific Reports volume 9, Num 3753, Mar 2019, DOI: 10.1038/s41598-019-40452-6
  17. Bedre, R, Mandadi, K., "GenFam; A web application and database for gene family-based classification and functional enrichment analysis", Plant Direct, Vol. 3, pp. 1- 7, Dec 2019, DOI:10.1002/pld3.191
  18. Chabikwa, T.G., Barbier, F.F., Tanurdzic, M. et al. "De novo transcriptome assembly and annotation for gene discovery in avocado, macadamia and mango.", Nature, Scientific Data vol. 7, Num. 9, Jan 2020, DOI: 10.1038/s41597-019-0350-9
  19. Seokjun Seo, Minsik Oh, Youngjune Park, Sun Kim, "DeepFam: deep learning based alignment-free method for protein family modeling and prediction", Bioinformatics, Vol. 34, Num 13, pp. 254-262, Jul 2018, DOI: 10.1093/bioinformatics/bty275
  20. Weizhong Li, Limin Fu, Beifang Niu, Sitao Wu, John Wooley, "Ultrafast clustering algorithms for metagenomic sequence analysis", Bioinformatics, Vol. 13, Num. 6, pp. 656-668, Nov 2012, DOI: 10.1093/bib/bbs035