DOI QR코드

DOI QR Code

An Approach for Integrated Modeling of Protein Data using a Fact Constellation Schema and a Tree based XML Model

Fact constellation 스키마와 트리 기반 XML 모델을 적용한 실험실 레벨의 단백질 데이터 통합 기법

  • 박성희 (충북대학교 대학원 전자계산학과) ;
  • 이영화 (연변대학교 컴퓨터과학) ;
  • 류근호 (충북대학교 컴퓨터과학과)
  • Published : 2004.06.01

Abstract

With the explosion of bioinformatics data such proteins and genes, biologists need a integrated system to analyze and organize large datasets that interact with heterogeneous types of biological data. In this paper, we propose a integration system based on a mediated data warehouse architecture using a XML model in order to combine protein related data at biology laboratories. A fact constellation model in this system is used at a common model for integration and an integrated schema it translated to a XML schema. In addition, to track source changes and provenance of data in an integrated database employ incremental update and management of sequence version. This paper shows modeling of integration for protein structures, sequences and classification of structures using the proposed system.

유전자 및 단백질간의 복잡한 상호작용에 의해 기능이 결정되는 생명정보 데이터의 특성으로 인하여 생명정보 데이터 분석을 위해서는 이질적인 데이터를 통합적으로 분석할 수 있는 통합시스템이 요구된다. 따라서 이 논문에서는 생물학 실험실 레벨에서 단백질 구조 관련 데이터를 통합할 수 있도록 XML 모델기반에 웨어하우스 미디에이터 통합시스템을 제안한다. 제안 시스템은 fact constellation 모델을 기반하여 이질적인 소스에 대한 통합 모델링을 진행하고 통합 스키마를 XML 스키마로 변환하여 유지한다. 또한 통합 데이터베이스에 포함된 소스 데이터의 변경 및 출처에 대한 추적 관리를 위해 데이터의 점진적 갱신방법과 서열에 대한 버전관리를 이용한다. 실제로 이 시스템을 단백질 구조(PDB), 서열(Swiss-Prot)과 도메인 분류데이터(CATH) 통합에 적용한 통합 모델링 과정을 보여준다.

Keywords

References

  1. S. B. Davison, J. Crabtree, B. Brunk, J. Schug, V. Tannen, C. Overton and C. Stoeckert 'K2/Kleisli and GUS : Experiments in Integrated Access to Genomic Data Sources,' IBM Systems Journal Deep computing for the life science, Vol.40, No.2, pp.512-535, 2001
  2. A. J. Shepherd, N. J. Martin, R. G. Johnson, P. Kellam and C. A. Orengo 'PFDB : a generic protein family database integrating the CATH domain structure database with sequence based protein family resources' Bioinformatics, Vol.18, No.12, pp.1666-1672, 2002 https://doi.org/10.1093/bioinformatics/18.12.1666
  3. E. Shoop, K. A. T. Silverstein, J. E. Johnson and E. F. Retzel 'MetaFam : a unified classification of protein families.II. Schema and query capabilities' Bioinformatics, Vol.17, No.3, pp.262-271, 2001 https://doi.org/10.1093/bioinformatics/17.3.262
  4. I. A. Chen and V. M. Markowitz, 'An overview of the Object-Protocol Model and OPM Data Management Tools,' Information system, Vol.20, No.5, pp.393-418, 1995 https://doi.org/10.1016/0306-4379(95)00021-U
  5. S. B. Davison, J. Crabtree, B. Brunk, J. Schug and V. Tannen 'BioKleisli : A Digital Library for Biomedical Researchers,' Journal of Digital Library, Vol.1, No.1, pp.36-53, 1996 https://doi.org/10.1007/s007990050003
  6. B. A. Echman, A. S. Kosky and L. A. Laroco, 'Extending traditional query-based integration approaches for functional characterization of post-genomic data,' Bioinformatics, Vol.17, No.7, pp.587-601, 2001 https://doi.org/10.1093/bioinformatics/17.7.587
  7. M. Carey, J. Kiernan, J. Shanmugasundaram, E. Shekita and S. Subramanian, 'XPERANTO : A Middleware for pulishing Object-Relational Data as XML documents,' VLDB, Vol.26, pp.646-648, 2000
  8. S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman and J. Widom, 'the TSIMMIS Project : Integration of Heterogeneous Information Sources,' 16th Meeting of the Information Processing Society of Japan, pp.7-18, 1994
  9. T. Etzold, A. Ulyanov and P. Argos,'SRS : Information Retrieval System for Molecular Biology Data Banks,' Methods in Enzymology, Vol.266, pp.144-128, 1996
  10. C. A. Goble, R. Stevens, G. Ng, S. Bechhofer, N. W. Paton, P. G. Baker, M. Peim, and A. Brass. Transparent n Access to Multiple Bioinformatics Information Sources. IBM Systems Journal, 40(2), pp.532-552, 2001 https://doi.org/10.1147/sj.402.0532
  11. C. Baru, A. Gupta, B. Ludascher, R. Marciano, Y. Papakonstantinou and P. Velikhov, 'XML-based information medination with MIX,' In SIGMOD System Demonstration, 1999 https://doi.org/10.1145/304182.304590
  12. T. Critchlow, M. Ganesh, R. Musick 'Automatic Generation of Warehouse Mediators using an ontology engine,' the 5th International workshop on Knowledge Representation meets Database, Vol.10, pp.8.1-8.8, 1998
  13. D. W. Lee, M. Mani, F. Chiu, W. Chu, 'Net&Cot : Translating Relational Schemas to XML Schemas using Semantic Constraints' 11th International Conference on Information and Knowledge Management, Vol.11, 2002
  14. D. Lee and M. Mani, W. W. Chu, 'Schema Conversion Methods between XML and Relations Models', Knowledge Transformation for the Semantic Web, Borys Omelayenko and Michel Klein editors, IOS Press, 2003
  15. S. H. Park, E. S. Choi and K. H. Ryu, 'Implementation of Algebra and Data Model based on a Directed Graph for XML,' J. of Korean Information Processing Society, Vol.8-D, No.6, pp.799-812, 2001
  16. J. Spitzner, 'Bioinformatics Sequence Markup Language Manual,' LabBook Inc., 1997
  17. F. Achard, G. Vaysseix, E. Barillot 'XML, bioinformatics and data integration' BioinformaticsVol.17, No.2, pp115-125, 2001 https://doi.org/10.1093/bioinformatics/17.2.115
  18. S. H. Park, K. H. Ryu and H. S. S. on, 'A Protein Structural Information Management Based on Spatial Concepts and Active Trigger Rules,' LNCS 2736 pp.413-422, 14th International Conference DEXA03, 2003
  19. K. H. Ryu, 'Building a Genome and Protein Sequence Information Management System,' Korea Institute of Science and Technology Information Project Report, 2002
  20. A. D. Baxevanis and B. F. F. Ouellette, 'Bioinformatics : A Practical Guide to the Analysis of Genes and Proteins,' pp.45-59, Wiley-Liss, Inc, 2001
  21. D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp and D. L. Wheeler 'GenBank,' Nucl. Acids. Res., Vol.30, pp.17-20, 2002 https://doi.org/10.1093/nar/30.1.17
  22. J. Ostell, S. J. Wheelan and J. A. Kans, 'The NCBI data model, Chapter 2 in Bioinformatics : A Practical Guide to the Analysis of Genes and Proteins,' 2nd ed., New York : John Wiley & Sons, pp.19-43. 2001
  23. A. Bairoch and R. Apweiler, 'The Swiss-Prot protein sequence database and its new supplement TrEMBL,' Nucleic Acids Res., Vol.26, pp.21-25, 1996
  24. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I. N. Shindyalov and P. E. Bourne, 'The protein Data Bank,' Nucleic Acids Research, Vol.28, pp.235-242, 2000 https://doi.org/10.1093/nar/28.1.235
  25. C. A. Orengo, A. D. Michie, D. T. Jones, M. B. Swindells and J. M. Thornton, 'CATH- A hierachic classification of protein domain structures,' Structures, Vol.5, pp.1093-1108, 1997 https://doi.org/10.1016/S0969-2126(97)00260-8
  26. K. H. Ryu, 'A Study of Database Schema Integration for Logistics,' Electronics and Telecommunications Research Institute Project Report, 1998
  27. K. H. Ryu, 'Development of Updating Protein 3-Dimensional Database and Similarity Search Syste,' Korea Institute of Science and Technology Information Project Report, 2001
  28. T. Critchlow, K. Fidelis, M. Ganesh, R. Musick and T. Slezak, 'DataFoundry : Information Management for Scientific Data,' IEEE Transactions on Information Technology in Biomedicine, Vol.4, No.1, pp.52-57, 2000 https://doi.org/10.1109/4233.826859
  29. A. J. Mackey and William R. Pearson, 'Relational databases for biologists,' Intelligent Systems for Molecular Biology tutorial, 2002
  30. P. M. Nadkarni, L. Marenco, R. Chen, E. Skoufos, G. Shepherd and P. Miller 'Organization of heterogeneous scientific data using the EAV/CR representation,' J. of Am Med Inform Assoc, Vol.6, No.6, pp.478-493, 1999
  31. S. B. Davidson, C. Overton and P. Buneman 'Challenge in Integrating Biological Data Sources,' Technical Report, 1995
  32. P. G. Barker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens and A. Brass, 'An ontology for bioinformatics applications,' Bioinformatics, Vol.15, No.6, pp.510-520, 1999 https://doi.org/10.1093/bioinformatics/15.6.510
  33. R. H. Li, S. H. Park, B. J. Jeong and K. H. Ryu, 'Transformation of heterogeneous data files for bioinformatics,' Korean Society for Bioinformatics annual meeting, Vol.1, pp.118-124, 2002