DOI QR코드

DOI QR Code

재실행과 Rollback 기법을 사용한 TMR 고장의 시간여분 복구 기법

A Time-Redundant Recovery Scheme of TMR failures Using Retry and Rollback Techniques

  • 강명석 (연세대학교 대학원 전기전자공학과) ;
  • 손병희 (연세대학교 대학원 전기전자공학과) ;
  • 김학배 (연세대학교 전기전자공학과)
  • 발행 : 2006.10.30

초록

본 논문에서는 복잡해져 가는 제어 컴퓨터의 높은 신뢰성 확보를 위해 시간 여분(time redundancy)의 일종인 재실행과 rollback 기법을 TMR 구조에 적절하게 혼용하는 방법을 제안한다 재실행과 rollback 기법은 약간의 추가 시간만으로 재구성(reconfiguration) 없이도 일시적인 결함(fault)에 의해 발생한 TMR 고장(failure)의 회복을 위해 상호 보완적으로 사용될 수 있다. 이를 위해 고장 검출시 가능한 모든 시스템의 고장상태 확률을 추정하였으며, 이를 바탕으로 전체 작업의 평균 실행시간이 최소가 되는 최적의 재실행과 rollback 횟수를 유도하였다. 또한 제안된 방법과 다른 고장회복 기법을 적용했을 때의 평균 실행 시간을 정량적으로 비교하여 그 우수성을 검증하였다.

This paper proposes an integrated recovery approach applying retry and rollback techniques to recover the TMR failure. Combining the time redundancy techniques with W system is apparently effective to recover the TMR failure(or masked error) primarily caused by transient faults. These policies need fewer reconfigurations at the cost of extra time required for the time redundant schemes. The optimal numbers of retry and rollback to minimize the mean execution time of tasks are derived for the proposed method through computing the likelihoods of all possible states of the failed system. The effectiveness of the proposed method is validated through examining certain numerical examples and simulations conducted with a variety of parameters governing environmental characteristics.

키워드

참고문헌

  1. A. Hopkins Jr., T. Smith III, and J. Lala, 'FTMP-a highly reliable fault-tolerant multi-processor for aircraft,' Proceedings of the IEEE, Vol.66, No.10, pp.1221- 239, October, 1978.' https://doi.org/10.1109/PROC.1978.11113
  2. M. Kameyama and T. Higuchi, 'Design of dependent-failure-tolerant microcomputer system using triple-modular redundancy,' IEEE Trans. on Computers, Vol.C-29, No.2, pp. 202-205, February 1980 https://doi.org/10.1109/TC.1980.1675545
  3. P. Ezhilchelvan, J. Helary, M. Raynal, 'Building responsive TMR-based servers in presence of timing constraints,' Proceedings of the Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC'05), pp.267-274, May, 2005 https://doi.org/10.1109/ISORC.2005.18
  4. Yu, Shu-Yi, E.J McCluskey, 'On-line Testing and Recovery in TMR Systems for Real-Time Applications,' International Test Conference (ITC2001), pp.240-249, Oct., 2001 https://doi.org/10.1109/TEST.2001.966639
  5. H. Choo, H. Youn, S. Yoo, 'Two-dimensional TMR with partial majority selection and forwarding,' Proceedings of the IEEE International Symposium on ISIE2001, pp.482-487, June, 200l https://doi.org/10.1109/ISIE.2001.931838
  6. C. Ramamoorthy and Y. Han, 'Reliability analysis of systems with concurrent error detection,' IEEE Trans, Computers, Vol.24, No.9, pp.868-878, Sept., 1975 https://doi.org/10.1109/T-C.1975.224332
  7. X. Zhuo and S. Li, 'A new design method of voter in fault tolerant redundancy multiple-module multi-microcomputer system,' Digest of Papers FTCS-3, pp.472-475, June, 1983
  8. N. Gaitanis, 'The design of totally self-checking TMR fault-tolerant systems,' IEEE Trans. Computers, Vol.37, No. 11, pp.450-1454, Nov., 1988 https://doi.org/10.1109/12.8716
  9. S. McConnel, D. Siewior다, and M. M. Taso, 'The measurement and analysis of transient errors in digital computer systems,' in Digest of Papers, FTCS-9, pp.67-70, June, 1979
  10. Y. Lee and K. Shin, 'Optimal design and use of retry in fault-tolerant computing systems,' Journal of the ACM, Vol. 35, pp.45-69, January, 1988 https://doi.org/10.1145/42267.42269
  11. P. Chande, A. Ramani, and P. Sharma, 'Modular TMR multiprocessor system,' IEEE Trans. on Industrial Electronics, Vol.36, No.1, pp.34-41, February, 1989 https://doi.org/10.1109/41.20342
  12. H. Kim and K. Shin, 'Design and Analysis of an Optimal Instruction Retry Policy for TMR Controller Computers,' IEEE Trans. on Computers, Vo1.45, No.11, pp.1217-1226, Nov., 1996 https://doi.org/10.1109/12.544478
  13. K. Shin and H. Kim, 'A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods,' IEEE Trans. on Computers, Vol.43, No.10, pp.1151-1162, Oct., 1994. https://doi.org/10.1109/12.324541
  14. J. Yoon and H. Kim, 'Time-redundant recovery policy of TMR failures using rollback and roll-forward methods,' IEE Proc.-Comput. Digit. Tech, Vol.147, No.2, pp.124-132, March, 2000 https://doi.org/10.1049/ip-cdt:20000190
  15. H. Kim and K. Shin, 'Evaluation of Fault Tolerance Latency from Real-time Application's Perspectives,' IEEE Transactions on Computers, Vol.49, No.1, January, 2000 https://doi.org/10.1109/12.822564
  16. I. Koren, Z. Koren and S. Y. H. Su, 'Analysis of a Class of Recovery Procedures,' IEEE Transactions on Computers, Vol.C-35, No.8, August, 1986 https://doi.org/10.1109/TC.1986.1676821
  17. D. Pradhan and N. Vaidya, 'Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off,' IEEE Trans. Computers, Vol.46, No.3, pp.372-378, Mar., 1997 https://doi.org/10.1109/12.580435