lassedesignen - Fotolia
Programs that run for a long time face plenty of risks that can cause them to fail. This fact can be especially true if working with a large-scale program that relies on multiple computers, any of which could crash at any given time. In an effort to combat the consequences of these failures, computer experts have been working on a technique called application checkpointing.
The goal of checkpointing is to save the state of the application so that it can be returned to immediately, even if the application restarts or crashes. But while it may seem like a simple enough idea, there are many implications of application checkpointing that make it tricky to implement in real-world scenarios.
This Q&A, part one of two, features highlights from a conversation with Gene Cooperman, a professor in the College of Computer and Information Science at Northeastern University in Boston. He is one of the main members of a team working to run what is known as transparent checkpointing on one of the world's largest supercomputers, called Stampede. In the conversation, Cooperman spoke about some of the biggest challenges of checkpointing, what makes checkpointing transparent and the implications of growing application and data diversity. In part two, he discusses the potential security effects of this technique.
What is the biggest challenge of application checkpointing?
Gene Cooperman: If the program is very large and complicated, to figure out all the different pieces of state for the code is very time-consuming, error-prone and difficult. And every time the program gets updated, you have to go back and figure out a system or state [you] should be saving. So, what we do is transparent checkpointing: We basically copy everything about the program from the outside and there's no need for the programmer to figure out where all the little pieces of state are in their code. We just copy everything whether we need it or not. So, essentially, the simplest view of this is that a program is residing in the memory of the computer, in RAM, and we just copy every single byte in memory, all of it to disk. That becomes the checkpoint image file.
This becomes more difficult [with] a large supercomputer. In the past, the computer network people would often use something like Ethernet. There's a protocol that runs on it called PCP [Port Control Protocol], which is relatively simple: It's just a single channel; you send the information in a rising order and you're done. But on supercomputers, that's not what they do … They need a faster network, so they use tricks like [writing] the information directly into the memory of the other computer.
So, now it's not so simple to just say, 'Stop the conversation between these two computers, but remember where you are in the conversation.' We've developed some techniques to handle this transparently by working at a low level very close to the operating system itself. And, initially, we had tested it on the traditional … cores which would typically be something like 16 computers only, and it worked fine. But as part of this experiment, we wanted to test on as large a cluster of computers as possible, because you get [a] new and different phenomenon when you try to scale up and the problem becomes much harder.
Why is the transparent aspect of application checkpointing important?
Cooperman: So, the other type of checkpointing would be called application-specific. It's [traditionally] been the responsibility of the person who wrote that program to figure out what they need to save and what they don't need to save. Today, they still do [checkpointing] in an application-specific way, which means that everybody who writes a new program has to also write a new checkpointing routine.
If we can prove this technology helped and that it works well, then when people write a large program in the future, they can just write the program. And they can automatically benefit from checkpointing by using our checkpointing program -- they don't have to write a special checkpointing routine themselves. So, that's the benefit of allowing people to switch from their application-specific method to just using our transparent method.
Does this extend at all into mobile computing?
Cooperman: Yes. At this time, we're primarily supporting Linux. We have done some work on the internet of things. So, these are very small devices and probably the most common operating system they would be using is Linux. And we demonstrated that our work works there, too.
We have a recent paper accepted showing that on the Raspberry Pi, we can checkpoint. Now it's a much smaller program since it's for an embedded system. This is important in this regime because the embedded device probably is monitoring something and it cannot stop in time checkpointing for a long period of time. So, by showing that it works in just 0.2 seconds, we can show that it's practical in this regime.
Do you believe that we can checkpoint everything?
Cooperman: If the people developing alternative technologies want to work closely with us, then absolutely we can checkpoint everything. But that's the opposite of the transparent checkpointing. Transparent checkpointing says we can sit on the outside and just make it happen no matter how they change the internals. So, I suspect that what will happen is we'll have to look for some compromise between those two extremes in the future. There will have to be some cooperation between the developers of these new hardware technologies, for example, and the work that we do. But hopefully, we can propose to them very small changes in their hardware that cost them very little which then make it easy for us to do the checkpointing. So, if we can get that cooperation, then yes.
Transparent checkpointing joins the Stampede
In an effort to see how effective transparent checkpointing can be in a supercomputer environment, a team from Northeastern University in Boston was given permission to run on Stampede supercomputer. Team members demonstrated how effectively checkpointing techniques can be implemented in large-scale computing environments without sacrificing speed or performance.
"We were given permission to use literally one-third of the entire supercomputer just for ourselves. And we were able to save 38 terabytes in 11 minutes," said Gene Cooperman, a professor in the College of Computer and Information Science at Northeastern. "We could do it partly because they had this cluster file system which … uses disks in parallel to speed things up. Instead of writing to one disk at a time, you write to many disks at a time."
But while this setup provided an enormous opportunity, it also presented a number of unique challenges. One of these issues was that Stampede used a newer type of network: InifiniBand UD. Cooperman's team had traditionally tested on InfiniBand RC. The UD, or unreliable datagram, protocol allows messages between two communicating computers to be re-sent by one to the other in case one is accidentally missed. This meant that Cooperman had to extend the methods by which conversations between computers were checkpointed in order to allow for both UD type communications as well as older methods.
"Because we wanted to work transparently, we could not assume anything special about how the two computers might be handling the conversation," Cooperman said. "In one program, they might send information in a certain order. But we cannot assume any of that because we're doing it transparently, and, therefore, it should work for absolutely any program of this type, no matter what internal protocol or what internal algorithm they're using."
Learn why disaster recovery testing should be part of your DR plan
Discover how backup and recovery plans can benefit from large solid-state drives
See how developers are getting access to cognitive cloud through supercomputers