Share to: share facebook share twitter share wa share telegram print page

Application checkpointing

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

Checkpointing in distributed systems

In the distributed computing environment, checkpointing is a technique that helps tolerate failures that otherwise would force long-running application to restart from the beginning. The most basic way to implement checkpointing, is to stop the application, copy all the required data from the memory to reliable storage (e.g., parallel file system) and then continue with the execution.[1] In case of failure, when the application restarts, it does not need to start from scratch. Rather, it will read the latest state ("the checkpoint") from the stable storage and execute from that. While there is ongoing debate on whether checkpointing is the dominating I/O workload on distributed computing systems, there is general consensus that checkpointing is one of the major I/O workloads.[2][3]

There are two main approaches for checkpointing in the distributed computing systems: coordinated checkpointing and uncoordinated checkpointing. In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of two-phase commit protocol algorithm. In the uncoordinated checkpointing, each process checkpoints its own state independently. It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. The need for establishing a consistent state (i.e., no missing messages or duplicated messages) may force other processes to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so-called domino effect).[4][5]

Implementations for applications

Save State

One of the original and now most common means of application checkpointing was a "save state" feature in interactive applications, in which the user of the application could save the state of all variables and other data to a storage medium at the time they were using it and either continue working, or exit the application and at a later time, restart the application and restore the saved state. This was implemented through a "save" command or menu option in the application. In many cases it became standard practice to ask the user if they had unsaved work when exiting the application if they wanted to save their work before doing so.

This sort of functionality became extremely important for usability in applications where the particular work could not be completed in one sitting (such as playing a video game expected to take dozens of hours, or writing a book or long document amounting to hundreds or thousands of pages) or where the work was being done over a long period of time such as data entry into a document such as rows in a spreadsheet.

The problem with save state is it requires the operator of a program to request the save. For non-interactive programs, including automated or batch processed workloads, the ability to checkpoint such applications also had to be automated.

Checkpoint/Restart

As batch applications began to handle tens to hundreds of thousands of transactions, where each transaction might process one record from one file against several different files, the need for the application to be restartable at some point without the need to rerun the entire job from scratch became imperative. Thus the "checkpoint/restart" capability was born, in which after a number of transactions had been processed, a "snapshot" or "checkpoint" of the state of the application could be taken. If the application failed before the next checkpoint, it could be restarted by giving it the checkpoint information and the last place in the transaction file where a transaction had successfully completed. The application could then restart at that point.

Checkpointing tends to be expensive, so it was generally not done with every record, but at some reasonable compromise between the cost of a checkpoint vs. the value of the computer time needed to reprocess a batch of records. Thus the number of records processed for each checkpoint might range from 25 to 200, depending on cost factors, the relative complexity of the application and the resources needed to successfully restart the application.

Fault Tolerance Interface (FTI)

FTI is a library that aims to provide computational scientists with an easy way to perform checkpoint/restart in a scalable fashion.[6] FTI leverages local storage plus multiple replications and erasures techniques to provide several levels of reliability and performance. FTI provides application-level checkpointing that allows users to select which data needs to be protected, in order to improve efficiency and avoid space, time and energy waste. It offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.

Berkeley Lab Checkpoint/Restart (BLCR)

The Future Technologies Group at the Lawrence National Laboratories are developing a hybrid kernel/user implementation of checkpoint/restart called BLCR. Their goal is to provide a robust, production quality implementation that checkpoints a wide range of applications, without requiring changes to be made to application code.[7] BLCR focuses on checkpointing parallel applications that communicate through MPI, and on compatibility with the software suite produced by the SciDAC Scalable Systems Software ISIC. Its work is broken down into 4 main areas: Checkpoint/Restart for Linux (CR), Checkpointable MPI Libraries, Resource Management Interface to Checkpoint/Restart and Development of Process Management Interfaces.

DMTCP

DMTCP (Distributed MultiThreaded Checkpointing) is a tool for transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets.[8] It does not modify the user's program or the operating system. Among the applications supported by DMTCP are Open MPI, Python, Perl, and many programming languages and shell scripting languages. With the use of TightVNC, it can also checkpoint and restart X Window applications, as long as they do not use extensions (e.g. no OpenGL or video). Among the Linux features supported by DMTCP are open file descriptors, pipes, sockets, signal handlers, process id and thread id virtualization (ensure old pids and tids continue to work upon restart), ptys, fifos, process group ids, session ids, terminal attributes, and mmap/mprotect (including mmap-based shared memory). DMTCP supports the OFED API for InfiniBand on an experimental basis.[9]

Collaborative checkpointing

Some recent protocols perform collaborative checkpointing by storing fragments of the checkpoint in nearby nodes.[10] This is helpful because it avoids the cost of storing to a parallel file system (which often becomes a bottleneck for large-scale systems) and it uses storage that is closer.[citation needed] This has found use particularly in large-scale supercomputing clusters. The challenge is to ensure that when the checkpoint is needed when recovering from a failure, the nearby nodes with fragments of the checkpoints are available.[citation needed]

Docker

Docker and the underlying technology contain a checkpoint and restore mechanism.[11]

CRIU

CRIU is a user space checkpoint library.

Implementation for embedded and ASIC devices

Mementos

Mementos is a software system that transforms general-purpose tasks into interruptible programs for platforms with frequent interruptions such as power outages. It was designed for batteryless embedded devices such as RFID tags and smart cards which rely on harvesting energy from ambient background sources. Mementos frequently senses the available energy in the system and decides whether to checkpoint the program due to impending power loss versus continuing computation. If checkpointing, data will be stored in a non-volatile memory. When the energy becomes sufficient for reboot, the data is retrieved from non-volatile memory and the program continues from the stored state. Mementos has been implemented on the MSP430 family of microcontrollers. Mementos is named after Christopher Nolan's Memento.[12]

Idetic

Idetic is a set of automatic tools which helps application-specific integrated circuit (ASIC) developers to automatically embed checkpoints in their designs. It targets high-level synthesis tools and adds the checkpoints at the register-transfer level (Verilog code). It uses a dynamic programming approach to locate low overhead points in the state machine of the design. Since the checkpointing in hardware level involves sending the data of dependent registers to a non-volatile memory, the optimum points are required to have minimum number of registers to store. Idetic is deployed and evaluated on energy harvesting RFID tag device.[13]

See also

References

  1. ^ Plank, J. S., Beck, M., Kingsley, G., & Li, K. (1994). Libckpt: Transparent checkpointing under unix. Computer Science Department.
  2. ^ Wang, Teng; Snyder, Shane; Lockwood, Glenn; Carns, Philip; Wright, Nicholas; Byna, Suren (Sep 2018). "IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs". 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE. pp. 466–476. doi:10.1109/CLUSTER.2018.00062. ISBN 978-1-5386-8319-4. S2CID 53235850.
  3. ^ "Comparative I/O workload characterization of two leadership class storage clusters Logs" (PDF). ACM. Nov 2015.
  4. ^ Bouteiller, B., Lemarinier, P., Krawezik, K., & Capello, F. (2003, December). Coordinated checkpoint versus message log for fault tolerant MPI. In Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on (pp. 242-250). IEEE.
  5. ^ Elnozahy, E. N., Alvisi, L., Wang, Y. M., & Johnson, D. B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3), 375-408.
  6. ^ Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., & Matsuoka, S. (2011, November). FTI: high performance fault tolerance interface for hybrid systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 32). ACM.
  7. ^ Hargrove, P. H., & Duell, J. C. (2006, September). Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.
  8. ^ Ansel, J., Arya, K., & Cooperman, G. (2009, May). DMTCP: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on (pp. 1-12). IEEE.
  9. ^ "GitHub - DMTCP/DMTCP: DMTCP: Distributed MultiThreaded CheckPointing". GitHub. 2019-07-11.
  10. ^ Walters, J. P.; Chaudhary, V. (2009-07-01). "Replication-Based Fault Tolerance for MPI Applications". IEEE Transactions on Parallel and Distributed Systems. 20 (7): 997–1010. CiteSeerX 10.1.1.921.6773. doi:10.1109/TPDS.2008.172. ISSN 1045-9219. S2CID 2086958.
  11. ^ "Docker - CRIU".
  12. ^ Benjamin Ransford, Jacob Sorber, and Kevin Fu. 2011. Mementos: system support for long-running computation on RFID-scale devices. ACM SIGPLAN Notices 47, 4 (March 2011), 159-170. DOI=10.1145/2248487.1950386 http://doi.acm.org/10.1145/2248487.1950386
  13. ^ Mirhoseini, A.; Songhori, E.M.; Koushanfar, F., "Idetic: A high-level synthesis approach for enabling long computations on transiently-powered ASICs," Pervasive Computing and Communications (PerCom), 2013 IEEE International Conference on , vol., no., pp.216,224, 18–22 March 2013 URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6526735&isnumber=6526701

Further reading

  • Yibei Ling, Jie Mi, Xiaola Lin: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Trans. Computers 50(7): 699-708 (2001)
  • R.E. Ahmed, R.C. Frazier, and P.N. Marinos, " Cache-Aided Rollback Error Recovery (CARER) Algorithms for Shared-Memory Multiprocessor Systems", IEEE 20th International Symposium on Fault-Tolerant Computing (FTCS-20), Newcastle upon Tyne, UK, June 26–28, 1990, pp. 82–88.

Read other articles:

SOS

Untuk kegunaan lain, lihat SOS (disambiguasi). SOS SOS adalah nama untuk tanda bahaya kode Morse internasional. (• • • - - - • • •). Tanda ini pertama kali digunakan oleh pemerintah Jerman pada 1 April 1905, dan menjadi standar di seluruh dunia sejak 3 November 1906. Dalam kode Morse, tiga titik adalah kode untuk huruf S dan tiga garis adalah huruf O. Dalam penggunaannya, SOS sering dihubungkan dengan singkatan kata Save Our Ship, Save Our S...

The King: Eternal MonarchPoster promosiNama alternatifThe King: Monarch of EternityHangul더 킹: 영원의 군주 Hanja더 킹:永遠의 君主 GenreSejarahFiksi ilmiahPerjalanan waktuFantasiRomansaSkenarioKim Eun-sookSutradaraBaek Sang-hoonPemeranLee Min-hoKim Go-eunWoo Do-hwanKim Kyung-namJung Eun-chaeLee Jung-jinNegara asalKorea SelatanBahasa asliKoreaJmlh. episode16ProduksiProduser eksekutifJinnie ChoiYoon Ha-rimDurasi70 menitRumah produksiHwa&Dam PicturesStudio DragonDistributorSB...

An mga ganador An 2005 Premio Tomas Arejola para sa Literaturang Bikolnon iyo an ikaduwang taon kan patiribayan. An paggawad nin premio kan patiribayan piggibo kan Desyembre 17 sa St. Vincent de Paul Auditorium kan Holy Rosary Minor Seminary sa Ciudad nin Naga. Pangenot na bisita si propesor Danton Remoto kan Ateneo de Manila na Unibersidad. Nagserbeng mga hurado sinda Maria Lilia F. Realubit (2004 Premio Tomas Arejola Lifetime Achievement Awardee) komo tagapamayo, Orfelina O. Tuy asin Lorna ...

Liberal reforms by the military-backed government This article needs to be updated. Please help update this article to reflect recent events or newly available information. (February 2022) Part of a series on theDemocracy movements in MyanmarThe fighting peacock flag Background Post-independence Burma Internal conflict in Myanmar Burmese Way to Socialism State Peace and Development Council State Administrative Council Mass protests 8888 Uprising Saffron Revolution Spring Revolution Developmen...

Ñico SaquitoBackground informationBirth nameBenito Antonio Fernández OrtizBorn(1901-02-13)13 February 1901Santiago de Cuba, CubaDied4 August 1982(1982-08-04) (aged 81)Santiago de Cuba, CubaGenresTrova, guaracha, guajiraOccupation(s)Musician, songwriterInstrument(s)Guitar, vocalsYears active1917–1982LabelsRCA Victor, Panart, EGREMMusical artist Benito Antonio Fernández Ortiz (13 February 1901 – 4 August 1982), better known as Ñico Saquito, was a Cuban trova songwriter, guitari...

Село Шеліґовопол. Szeligowo Координати 53°44′41″ пн. ш. 15°58′21″ сх. д. / 53.74489900002777176° пн. ш. 15.97255400002777748° сх. д. / 53.74489900002777176; 15.97255400002777748Координати: 53°44′41″ пн. ш. 15°58′21″ сх. д. / 53.74489900002777176° пн. ш. 15.97255400002777748° сх. д. / 5...

2008 single by Missy ElliottBest, BestSingle by Missy ElliottReleasedJune 13, 2008Length4:41Label Goldmind Atlantic Songwriter(s) Missy Elliott Nate Hills Marcella Araica Producer(s) Danja Missy Elliott Missy Elliott singles chronology Need U Bad (2008) Best, Best (2008) Bad Girl (2008) Best, Best is a song by American rapper Missy Elliott. It was written by Elliott, Marcella Araica, and Nate Danja Hills for what was supposed to be Elliott's seventh studio album Block Party, while production ...

Geographic region MENA, WANA, and NAWA redirect here. For other uses, see Mena (disambiguation), Wana (disambiguation), and Nawa (disambiguation). How often countries/territories are included in MENA/WANA definitions:   Almost always included   Sometimes included   Rarely included The Middle East and North Africa is a geographic region whose countries are often referred to by the acronym MENA. It is also known as WANA, SWANA,[1][2] or NAWA,[3&...

American politician Grant StockdaleStockdale in October 1963United States Ambassador to IrelandIn officeMay 17, 1961 – July 7, 1962PresidentJohn F. KennedyPreceded byR. W. Scott McLeodSucceeded byMatthew H. McCloskey Personal detailsBorn(1915-07-31)July 31, 1915Greenville, Mississippi, USDiedDecember 2, 1963(1963-12-02) (aged 48)Miami, Florida, USPolitical partyDemocraticSpouseAlice Boyd MagruderChildren5Alma materUniversity of MiamiMilitary serviceAllegiance United State...

bongAksara Han untuk HuangPengucapanHuáng (bahasa Mandarin) Hwang (bahasa Korea)Huỳnh atau Hoàng (bahasa Vietnam)BahasaTionghoa, Korea, VietnamUrutan Bai Jia XingNo.96Populasi di RRTPeringkat ke-7AsalBahasaBahasa MandarinArtiKuningNama LainVarianNg atau Ung atau Oei atau Oey (Hokkian)Ng atau Ooi atau Hûiⁿ (Tiochiu)Wong (Kantonis)Wee (Hainan) Artikel ini memuat Teks Tionghoa. Tanpa bantuan render yang baik, anda mungkin akan melihat tanda tanya, kotak-kotak, atau simbol lainnya buka...

AC Omonia 2017–18 football seasonAC Omonia2017–18 seasonChairmanAntonis Tzionis(Until 2 May 2018)Loris Kyriakou(From 3 May 2018)Head coachPambos Christodoulou(Until 5 December 2017) Ivaylo Petev(From 14 December 2017 to 21 March 2018)Jesper Fredberg(Caretaker, from 22 March 2018)StadiumGSP StadiumCypriot First Division6thCypriot CupSecond roundTop goalscorerLeague: Matt Derbyshire (23)All: Matt Derbyshire (25)Highest home attendance13,225 vs. APOEL(17 November 2017)Lowest home attendance8...

Doa MengancamGenre Drama Roman Laga Satire Religi BerdasarkanDoa yang Mengancamoleh Hanung BramantyoSkenario Jujur Prananto Hanung Bramantyo Sutradara Hanung Bramantyo Senoaji Julius Pengarah kreatif Salsa Hirawan Akshay Devgan Ray Pemeran Kevin Ardilova Tissa Biani Lagu pembukaDoa yang Mengancam — RohasLagu penutupDoa yang Mengancam — RohasMusikKrisna PurnaNegara asalIndonesiaBahasa asliBahasa IndonesiaJmlh. musim1Jmlh. episode8ProduksiProduser eksekutif Monika Rudijono Tina Arwin Produs...

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: As the Lights Go Down – news · newspapers · books · scholar · JSTOR (May 2011) (Learn how and when to remove this template message) 2010 video by Duran DuranAs the Lights Go DownMarch 2010 MP3 + DVD releaseVideo by Duran DuranReleasedMP3 + DVD - March 2...

2001 studio album by RedmanMalpracticeStudio album by RedmanReleasedMay 22, 2001Recorded2000-2001GenreEast Coast hip hopLength78:39LabelDef Jam RecordingsProducerDa Mascot, Erick Sermon, Adam F, Rockwilder, Saukrates, Diverse, DJ TwinzRedman chronology Blackout!(1999) Malpractice(2001) Red Gone Wild: Thee Album(2007) Singles from Malpractice Let's Get Dirty (I Can't Get in da Club)Released: May 1, 2001 Smash Sumthin'Released: July 3, 2001 Professional ratingsAggregate scoresSourceRati...

Variety television show (1976–81) For the 2015 TV series, see The Muppets (TV series). The Muppet ShowGenre Sketch comedy Variety show Created byJim HensonWritten by Jack Burns (head writer; season 1) Jerry Juhl (head writer; seasons 2-5) Jim Henson David Odell Chris Langham Don Hinkley Joseph A. Bailey Directed by Peter Harris Philip Casson Starring Jim Henson Frank Oz Jerry Nelson Richard Hunt Dave Goelz Steve Whitmire (1978-1981) Louise Gold (1977-1981) Kathryn Mullen (1978-1981) Eren Oz...

Act of engaging in prostitution for an extreme need For prostitution among Internally Displaced People in warzones, see Wartime sexual violence. Sex and the law Social issues Age of consent Antisexualism Bodily integrity Censorship Circumcision Criminalization of homosexuality Deviant sexual intercourse Ethics Freedom of speech Homophobia Intersex rights LGBT rights Miscegenation (interracial relations) Marriageable age Norms Objectification Pornography Laws Public morality Red-light district...

Culturistas participando en un evento organizado por el Comité Nacional de Fisicoculturismo El Comité Nacional de Fisicoculturismo (NPC) (nombre original en inglés: National Physique Committee) es la mayor organización de culturismo amateur de los Estados Unidos.[1]​ Los culturistas aficionados compiten en torneos a nivel local e internacionales, regulados por el NPC. Es la única organización amateur reconocida por la Liga Profesional de la Federación Internacional de Fisicocultu...

For other uses, see Last days (disambiguation). 1998 American filmThe Last DaysDirected byJames MollProduced byJune BeallorKenneth LipperMusic byHans ZimmerProductioncompanyShoah FoundationDistributed byOctober FilmsRelease dates October 23, 1998 (1998-10-23) (Los Angeles) February 5, 1999 (1999-02-05) (U.S. limited) July 15, 1999 (1999-07-15) (Australia) September 9, 1999 (1999-09-09) (Hungary) October 8, 1...

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Yuddha Bhoomi – news · newspapers · books · scholar · JSTOR (June 2019) (Learn how and when to remove this template message) 1988 Indian filmYudda BhoomiDirected byK. Raghavendra RaoWritten byM. V. S. Haranatha Rao (dialogues)Produced byK. Krishna Mohan RaoStar...

Early life and career of Julius Caesar (100 BC - 60 BC) This article relies excessively on references to primary sources. Please improve this article by adding secondary or tertiary sources. Find sources: Early life and career of Julius Caesar – news · newspapers · books · scholar · JSTOR (May 2023) (Learn how and when to remove this template message) The career of Julius Caesar before his consulship in 59 BC was characterized by military adventurism a...

Kembali kehalaman sebelumnya