• Title/Summary/Keyword: checkpointing

Search Result 72, Processing Time 0.025 seconds

An Asychronous Checkpointing Algorithm Using Virtual Checkpointing On Distributed Systems (분산시스템에서 가상 체크포인팅을 이용한 비동기화 체크포인팅 알고리즘)

  • Kim, Do-Hyung;Park, Chang-Soon;Kim, Jong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.5
    • /
    • pp.1203-1211
    • /
    • 1999
  • Checkpointing is the one of fault-tolerant techniques to restore faults and to restart job fast. Checkpointing algorithms in distributed systems have been studied for many years. These algorithms can be classified into synchronous Checkpointing algorithms and asynchronous Checkpoiting algorithms. In this paper, we propose an independent Checkpointing algorithm that has a minimum Checkpointing counts equal to periodic Checkpointing algorithm, and relatively short rollback distance at faulty situation. Checkpointing count is directly related to task completion time in a fault-free situation and short rollback distance is directly related to task completion time in a faulty situation. The proposed algorithm is compared with the previously proposed asynchronous Checkpointing algorithms using simulation. In the simulation, the proposed Checkpointing algorithm produces better results than other algorithms in terms of task completion time in fault-free as well as faulty situations.

  • PDF

Taking Point Decision Mechanism of Page-level Incremental Checkpointing based on Cost Analysis of Process Execution Time (프로세스 수행 시간의 비용 분석에 기반을 둔 페이지 단위 점진적 검사점의 작성 시점 결정 기법)

  • Yi Sang-Ho;Heo Jun-Young;Hong Ji-Man
    • The KIPS Transactions:PartA
    • /
    • v.13A no.4 s.101
    • /
    • pp.289-294
    • /
    • 2006
  • Checkpointing is an effective mechanism that allows a process to resume its execution that was discontinued by a system failure without having to restart from the beginning. Especially, page-level incremental checkpointing saves only the modified pages of a process to minimize the checkpointing overhead. This means that in incremental checkpointing, the time consumed for checkpointing varies according to the amount of modified pages. Thus, the efficient interval of checkpointing must be determined on run-time of the process. In this paper, we present an efficient and adaptive page-level incremental checkpointing facility that is based on the cost analysis of process execution time. In our simulation, results show that the proposed mechanism significantly reduced the average process execution time compared with existing fixed-interval-based page-level incremental checkpointing.

A Dynamic Checkpoint Scheduling Scheme for Fault Tolerant Distributed Computing Systems (결함 내성 분산 시스템에서의 동적 검사점 스케쥴링 기법)

  • Park, Tae-Soon
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.2
    • /
    • pp.75-86
    • /
    • 2002
  • The selection of the optimal checkpointing interval has been a very critical issue to implement a checkpointing recovery scheme for the fault tolerant distributed system. This paper presents a new scheme that allows a process to select the proper checkpointing interval dynamically. A process in the system evaluates the cost of checkpointing and possible rollback for each checkpointing interval and selects the proper time interval for the next checkpointing Unlike the other scheme, the overhead incurred by both of the checkpointing and rollback activities are considered for the cost evaluation and current communication pattern is reflected in the selection of the checkpointing interval. Moreover, the proposed scheme requires no extra message communication for the checkpointing interval selection and can easily be incorporated into the existing checkpointing coordination schemes.

An Adaptive Checkpointing Scheme for Fault Tolerance of Real-Time Control Systems (실시간 제어 시스템의 결함 허용성을 위한 적응형 체크포인팅 기법)

  • Ryu, Sang-Moon
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.15 no.6
    • /
    • pp.598-603
    • /
    • 2009
  • The checkpointing scheme is a well-known technique to cope with transient faults in digital systems. This paper proposes an adaptive checkpointing scheme for the reliability improvement of real-time control systems. The proposed adaptive checkpointing scheme is based on the previous work about the reliability problem of an equidistant checkpointing scheme. For the derivation of the adaptive scheme, some conditions are introduced which are to be satisfied for the reliability improvement by exploiting an equidistant checkpointing scheme. Numerical data show the proposed adaptive scheme outperforms the equidistant scheme from a reliability point of view.

An Adaptive Checkpointing Scheme for Fault Tolerance of Real-Time Control Systems with Concurrent Fault Detection (동시 결함 검출 기능이 있는 실시간 제어 시스템의 결함 허용성을 위한 적응형 체크포인팅 기법)

  • Ryu, Sang-Moon
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.17 no.1
    • /
    • pp.72-77
    • /
    • 2011
  • The checkpointing scheme is a well-known technique to cope with transient faults in digital systems. This paper proposes an adaptive checkpointing scheme for the reliability improvement of real-time control systems with concurrent fault detection capability. With concurrent fault detection capability the effect of transient faults are assumed to be detected with no latency. The proposed adaptive checkpointing scheme is based on the reliability analysis of an equidistant checkpointing scheme. Numerical data show the proposed adaptive scheme outperforms the equidistant scheme from a reliability point of view.

A Multistriped Checkpointing Scheme for the Fault-tolerant Cluster Computers (다중 분할된 구조를 가지는 클러스터 검사점 저장 기법)

  • Chang, Yun-Seok
    • The KIPS Transactions:PartA
    • /
    • v.13A no.7 s.104
    • /
    • pp.607-614
    • /
    • 2006
  • The checkpointing schemes should reduce the process delay through managing the checkpoints of each node to fit the network load to enhance the performance of the process running on the cluster system that write the checkpoints into its global stable storage. For this reason, a cluster system with single IO space on a distributed RAID chooses a suitable checkpointng scheme to get the maximum IO performance and the best rollback recovery efficiency. In this paper, we improved the striped checkpointing scheme with dynamic stripe group size by adapting to the network bandwidth variation at the point of checkpointing. To analyze the performance of the multi striped checkpointing scheme, we applied Linpack HPC benchmark with MPI on our own cluster system with maximum 512 virtual nodes. The benchmark results showed that the multistriped checkpointing scheme has better performance than the striped checkpointing scheme on the checkpoint writing efficiency and rollback recovery at heavy system load.

Performance Analysis of Checkpointing and Dual Modular Redundancy for Fault Tolerance of Real-Time Control System (실시간 제어 시스템의 결함 극복을 위한 이중화 구조와 체크포인팅 기법의 성능 분석)

  • Ryu, Sang-Moon
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.14 no.4
    • /
    • pp.376-380
    • /
    • 2008
  • This paper deals with a performance analysis of real-time control systems, which engages DMR(dual modular redundancy) to detect transient errors and checkpointing technique to tolerate transient errors. Transient errors are caused by transient faults and the most significant type of errors in reliable computer systems. Transient faults are assumed to occur according to a Poisson process and to be detected by a dual modular redundant structure. In addition, an equidistant checkpointing strategy is considered. The probability of the successful task completion in a real-time control system where periodic checkpointing operations are performed during the execution of a real-time control task is derived. Numerical examples show how checkpoiniting scheme influences the probability of task completion. In addition, the result of the analysis is compared with the simulation result.

Combining replication and checkpointing redundancies for reducing resiliency overhead

  • Motallebi, Hassan
    • ETRI Journal
    • /
    • v.42 no.3
    • /
    • pp.388-398
    • /
    • 2020
  • We herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault-tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in two consecutive phases. First, to compute the replication vector (number of task replicas), we apportion the set of provisioned resources among concurrently executing tasks according to their needs. Subsequently, we obtain the optimal checkpointing interval for each task as a function of the number of replicas and characteristics of tasks and computational environment. We formulate the problem of obtaining the optimal checkpointing interval for replicated tasks in situations where checkpoint files can be exchanged among computational resources. The results of our simulation experiments, on both randomly generated workflow graphs and real-world applications, demonstrated that both the proposed replication vector computation algorithm and the proposed checkpointing scheme reduced the resiliency overhead.

A Checkpointing Framework for Dependable Real-Time Systems (고신뢰 실시간 시스템을 위한 체크포인팅 프레임워크)

  • Lee, Hyo-Soon;Shin, Heonshik-Sin
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.4
    • /
    • pp.176-184
    • /
    • 2002
  • We provide a checkpointing framework reflecting both the timeliness and the dependability in order to make checkpointing applicable to dependable real-time systems. The predictability of real-time tasks with checkpointing is guaranteed by the worst case execution time (WCET) based on the allocated number of checkpoints and the permissible number of failures. The permissible number of failures is derived from fault tolerance requirements, thus guaranteeing the dependability of tasks. Using the WCET and the permissible number of failures of tasks, we develop an algorithm that determines the minimum number of checkpoints allocated to each task in order to guarantee the schedulability of a task set. Since the framework is based on the amount of time redundancy caused by checkpointing, it can be extended to other time redundancy techniques.

Analysis of Checkpointing Model with Instantaneous Error Detection (즉각적 오류 감지가 가능한 경우의 체크포인팅 모형 분석)

  • Lee, Yutae
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.1
    • /
    • pp.170-175
    • /
    • 2022
  • Reactive failure management techniques are required to mitigate the impact of errors in high performance computing. Checkpoint is the standard recovery technique for coping with errors. An application employing checkpoints periodically saves its state, so that when an error occurs while some task is executing, the application is rolled back to its last checkpointed task and resumes execution from that task onward. In this paper, assuming the time-to-errors are independent each other and generally distributed, we analyze the checkpointing model with instantaneous error detection. The conventional assumption that two or more errors do not take place between two consecutive checkpoints is removed. Given the checkpointing time, down-time, and recovery time, we derive the reliability of the checkpointing model. When the time-to-error follows an exponential distribution, we obtain the optimal checkpointing interval to achieve the maximum reliability.