DOI QR코드

DOI QR Code

RELIABILITY ANALYSIS OF CHECKPOINTING MODEL WITH MULTIPLE VERIFICATION MECHANISM

  • Lee, Yutae (Department of Information and Communications Engineering Dong-eui University)
  • Received : 2018.11.07
  • Accepted : 2019.08.23
  • Published : 2019.11.30

Abstract

We consider a checkpointing model for silent errors, where a checkpoint is taken every fixed number of verifications. Assuming generally distributed i.i.d. inter-occurrence times of errors, we derive the reliability of the model as a function of the number of verifications between two checkpoints and the duration of work interval between two verifications.

Keywords

References

  1. G. Aupy, A. Benoit, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni, On the combination of silent error detection and checkpointing, The 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), 2013.
  2. L. Bautista-Gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert, and H. Sun, Coping with recall and precision of soft error detectors, J. Parallel and Distributed Computing 98 (2016), 8-24. https://doi.org/10.1016/j.jpdc.2016.07.007
  3. A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert, and H. Sun, Coping with silent and fail-stop errors at scale by combining replication and checkpointing, J. Parallel and Distributed Computing 122 (2018), 209-225. https://doi.org/10.1016/j.jpdc.2018.08.002
  4. A. Benoit, A. Cavelan, F. Ciorba, V. Le Fevre, and Y. Robert, Combining checkpointing and replication for reliable execution of linear work ows with fail-stop and silent errors, International Journal of Networking and Computing, 9 (2019), no. 1, 2-27. https://doi.org/10.15803/ijnc.9.1_2
  5. A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, ACM Transactions on Parallel Computing, Association for Computing Machinery 3 (2016), no. 2, 1-36.
  6. A. Benoit, A. Cavelan, Y. Robert, and H. Su, Optimal resilience patterns to cope with fail-stop and silent errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016.
  7. A. Benoit, A. Cavelan, Y. Robert, and H. Su, Multi-level checkpointing and silent error detection for linear work ows, J. Comput. Sci. 28 (2018), 398-415. https://doi.org/10.1016/j.jocs.2017.03.024
  8. A. Benoit, S. K. Raina, and Y. Robert, Effcient checkpoint/verication patterns, International Journal of High Performance Computing Applications 3 (2016), no. 1, 52-65.
  9. M. S. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, International Conference for High Performance Computing, Networking, Storage and Analysis, United States, 1-11, 2011.
  10. M. S. Bouguerra, T. Gautier, D. Trystram, and J. M. Vincent, A flexible check-point/restart model in distributed systems, International Conference on Parallel Processing and Applied mathematics (PPAM), LNCS, 6067 (2010), 206-215.
  11. M. S. Bouguerra, D. Trystram, and F. Wagner, Complexity analysis of checkpoint scheduling with variable costs, IEEE Trans. Comput. 62 (2013), no. 6, 1269-1275. https://doi.org/10.1109/TC.2012.57
  12. K. M. Chandy and L. Lamport, Determining global states of distributed systems, ACM Transactions on Computer Systems, 3 (1985), no. 1, 63-75. https://doi.org/10.1145/214451.214456
  13. J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems (FGCS) 22 (2004), no. 3, 303-312. https://doi.org/10.1016/j.future.2004.11.016
  14. E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, 1 (2004), no. 2, 97-108. https://doi.org/10.1109/TDSC.2004.15
  15. Y. Ling, J. Mi, and X. Lin, A variational calculus approach to optimal checkpoint placement, IEEE Trans. on Computers, 50 (2001), no. 7, 699-708. https://doi.org/10.1109/12.936236
  16. G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed, The 3rd Workshop for Fault-tolerance at Extreme Scale (FTXS), ACM Press, 2013.
  17. R. Lucas et al., Top Ten Exascale Challenges, DOE ASCAC Subcommittee Report, U.S. Department of Energy, Oce of Science, 1-86, 2014.
  18. A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, The 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010), 1-11.
  19. T. O'Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices 41 (1994), no. 4, 553-557. https://doi.org/10.1109/16.278509
  20. T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distribution-free checkpoint placement algorithms based on min-max principle, IEEE Transactions on Dependable and Secure Computing 3 (2006), no. 2, 130-140. https://doi.org/10.1109/TDSC.2006.22
  21. S. Toueg and Babaoglu, On the optimum checkpoint selection problem, SIAM J. Comput. 13 (1984), no. 3, 630-649. https://doi.org/10.1137/0213039
  22. J. W. Young, A rst order approximation to the optimal checkpoit interval, Comm. of the ACM, 17 (1974), no. 9, 530-531. https://doi.org/10.1145/361147.361115
  23. J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM experiments in soft fails in computer electronics, IBM J. Res. Dev. 40 (1996), no. 1, 3-18. https://doi.org/10.1147/rd.401.0003
  24. J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. O'Gorman, and J. Ross, Accelerated testing for cosmic soft-error rate, IBM J. Res. Dev. 40 (1996), no. 1, 51-72. https://doi.org/10.1147/rd.401.0051
  25. J. F. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos, H. P. Muhlfeld, and C. J. Montrose, Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits 33 (1998), no. 2, 246-252. https://doi.org/10.1109/4.658626