DOI QR코드

DOI QR Code

A Protein Sequence Prediction Method by Mining Sequence Data

서열 데이타마이닝을 통한 단백질 서열 예측기법

  • 조순이 (전남대학교 대학원 전산통계학과) ;
  • 이도헌 (한국과학기술원 바이오시스템학과) ;
  • 조광휘 (숭실대학교 생명정보학과) ;
  • 원용관 (전남대학교 전자컴퓨터정보통신공학부) ;
  • 김병기 (전남대학교 전산학과)
  • Published : 2003.04.01

Abstract

A protein, which is a linear polymer of amino acids, is one of the most important bio-molecules composing biological structures and regulating bio-chemical reactions. Since the characteristics and functions of proteins are determined by their amino acid sequences in principle, protein sequence determination is the starting point of protein function study. This paper proposes a protein sequence prediction method based on data mining techniques, which can overcome the limitation of previous bio-chemical sequencing methods. After applying multiple proteases to acquire overlapped protein fragments, we can identify candidate fragment sequences by comparing fragment mass values with peptide databases. We propose a method to construct multi-partite graph and search maximal paths to determine the protein sequence by assembling proper candidate sequences. In addition, experimental results based on the SWISS-PROT database showing the validity of the proposed method is presented.

단백질은 아미노산의 선형 중합체(linear polymer)로서 생체의 조직을 구성하고 각종 생화학 반응을 조절하는 역할을 하는 가장 중요한 생체 분자에 속한다. 이러한 단백질의 특성과 기능은 해당 단백질을 구성하는 아미노산의 서열에 의해 결정되기 때문에, 주어진 단백질의 서열을 알아내는 것은 단백질 기능 연구의 출발점이다. 본 논문은 기존의 생화학적 단백질 서열 결정 방법의 단점을 극복할 수 있는 데이터 마이닝 기반 단백질 서열 예측 기법을 제안한다. 복수개의 단백질 절단효소(protease)를 적용함으로써, 서로 중첩된 단백질 조각을 얻어내고, 각 조각의 질량 정보와 단백질 데이타베이스를 이용하여 후보 서열을 식별한다. 얻어진 후보 서열의 조립을 통해 전체 서열을 결정하기 위한, 다중 분할 그래프(multi-partite graph) 구축 및 경로 탐색 기법을 제안한다. 아울러, 대표적인 단백질 서열 데이타베이스인 SWISS-PROT을 이용한 실험을 통해 제안한 방법의 성능을 평가한다.

Keywords

References

  1. M. Mann and M. Wilm, 'Error-Tolerant Identification of Peptides in Sequence Data-bases by Peptide Sequence Tags,' Anal. Chem, 66, pp.4390-4399, 1994 https://doi.org/10.1021/ac00096a002
  2. A. Shevchenko et aI., 'Linking Genome and Proteome by Mass Spectrometry: Large-Scale Identification of Yeast Proteins from Two Dimensional Gels,' Proc. Nat'I Acad. Sci, 93, pp.14440-14445, 1996 https://doi.org/10.1073/pnas.93.25.14440
  3. D. N. Perkins et al., 'Probability-Based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data,' Electrophoresis, 20, pp.3551-3567, 1999 https://doi.org/10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
  4. M. Wilm et aI., 'Femtornole Sequencing of Proteins from Polyacrylamide Gels by Nano-Electrospray Mass Spectrometry,' Nature, 379, pp.466-469, 1996 https://doi.org/10.1038/379466a0
  5. G. Neubauer et al., 'Mass Spectrometry and EST-Database Searching Allows Characterization of the MultiProtein Spliceosome Complex,' Nature Genetics, 20, pp. 46-50, 1998 https://doi.org/10.1038/1700
  6. John M. Ward, 'Identification of Novel Families of Membrane Proteins from the Model Plant Arabidopsis Thaliana,' Bioinformatics, 17, pp.560-563, 2001 https://doi.org/10.1093/bioinformatics/17.6.560
  7. Daniel C. Liebler, 'Introduction to Proteomics,' Humana Press, 2001
  8. Edmon de Hlffmann, 'Tandem Mass Spectrometry : a Primer,' Journal of mass spectrometry, Vo1.31, pp.I29-137, 1996 https://doi.org/10.1002/(SICI)1096-9888(199602)31:2<129::AID-JMS305>3.0.CO;2-T
  9. Andrew A. et al., 'A role for Edman degradation in proteome studies,' Electrophoresis, 18, pp.1068-72, 1997 https://doi.org/10.1002/elps.1150180707
  10. Ting Chen, 'Gene-Finding via Tandem Mass Spectrometry,' The ACM-SIGACT Fifth Annual International Conference on Computational Moledular Biology (RECOMBOl), pp. 85-92, 2001 https://doi.org/10.1145/369133.369176
  11. Daniel H. Huson et al., 'The Greedy Path-Merging Algorithm for Sequence Assembly,' RECOMB, pp.157-163, 2001 https://doi.org/10.1145/369133.369190
  12. R. M. Idury and M. S. Waterman. 'A New Algorithm for DNA sequence assembly,' Journal of Computational Biology, 2, pp.291-306, 1995 https://doi.org/10.1089/cmb.1995.2.291
  13. Pavel A. Pevzner and Haixu Tang, 'Fragment assembly with double-barreled data,' Bioinformatics, 17, pp.225S-233S, 2001 https://doi.org/10.1093/bioinformatics/17.suppl_1.S225
  14. Needleman, S. B. and Wunsch, C. D., 'A general method applicable to the search for similarities in the amino acid sequence of two proteins,' J, Mol. Bilo, 48, pp.443-453, 1970 https://doi.org/10.1016/0022-2836(70)90057-4
  15. Gusfield, D., 'Algorithms on Strings, Trees, and Sequences,' Cambridge University Press, 1997