Crash-only software: More than meets the eye (2006)

https://lwn.net/Articles/191059/

Ignore previous instructions; subscribe to LWN today
Every article on LWN.net is written by humans, for humans. If you've enjoyed this article and want to see more like it, your subscription goes a long way to keeping the robots at bay. We are offering a free one-month trial subscription (no credit card required) to get you started.

Next time your Linux laptop crashes, pull out your watch (or your cell phone) and time how long it takes to boot up. More than likely, you're running a journaling file system, and not only did your system boot up quickly, but it didn't lose any data that you cared about. (Maybe you lost the last few bytes of your DHCP client's log file, darn.) Now, keep your timekeeping device of choice handy and execute a normal shutdown and reboot. More than likely, you will find that it took longer to reboot "normally" than it did to crash your system and recover it - and for no perceivable benefit.

George Candea and Armando Fox noticed that, counter-intuitively, many software systems can crash and recover more quickly than they can be shutdown and restarted. They reported the following measurements in their paper, Crash-only Software (published in Hot Topics in Operating Systems IX in 2003):

System Clean reboot Crash reboot Speedup

RedHat 8 (ext3) 104 sec 75 sec 1.4x

JBoss 3.0 app server 47 sec 39 sec 1.2x

Windows XP 61 sec 48 sec 1.3x

System	Clean reboot	Crash reboot	Speedup
RedHat 8 (ext3)	104 sec	75 sec	1.4x
JBoss 3.0 app server	47 sec	39 sec	1.2x
Windows XP	61 sec	48 sec	1.3x

In their experiments, no important data was lost. This is not surprising as, after all, good software is designed to safely handle crashes. Software that loses or ruins your data when it crashes isn't very popular in today's computing environment - remember how frustrating it was to use word processors without an auto-save feature? What is surprising is that most systems have two methods of shutting down - cleanly or by crashing - and two methods of starting up - normal start up or recovery - and that frequently the crash/recover method is, by all objective measures, a better choice. Given this, why support the extra code (and associated bugs) to do a clean start up and shutdown? In other words, why should I ever type "halt" instead of hitting the power button?

The main reason to support explicit shutdown and start-up is simple: performance. Often, designers must trade off higher steady state performance (when the application is running normally) with performance during a restart - and with acceptable data loss. File systems are a good example of this trade-off: ext2 runs very quickly while in use but takes a long time to recover and makes no guarantees about when data hits disk, while ext3 has somewhat lower performance while in use but is very quick to recover and makes explicit guarantees about when data hits disk. When overall system availability and acceptable data loss in the event of a crash are factored into the performance equation, ext3 or any other journaling file system is the winner for many systems, including, more than likely, the laptop you are using to read this article.

Crash-only software is software that crashes safely and recovers quickly. The only way to stop it is to crash it, and the only way to start it is to recover. A crash-only system is composed of crash-only components which communicate with retryable requests; faults are handled by crashing and restarting the faulty component and retrying any requests which have timed out. The resulting system is often more robust and reliable because crash recovery is a first-class citizen in the development process, rather than an afterthought, and you no longer need the extra code (and associated interfaces and bugs) for explicit shutdown. All software ought to be able to crash safely and recover quickly, but crash-only software must have these qualities, or their lack becomes quickly evident.

The concept of crash-only software has received quite a lot of attention since its publication. Besides several well-received research papers demonstrating useful implementations of crash-only software, crash-only software has been covered in several popular articles in publications as diverse as Scientific American, Salon.com, and CIO Today. It was cited as one of the reasons Armando Fox was named one of Scientific American's list of top 50 scientists for 2003 and George Candea as one of MIT Technology Review's Top 35 Young Innovators for 2005. Crash-only software has made its mark outside the press room as well; for example, Google's distributed file system, GoogleFS, is implemented as crash-only software, all the way through to the metadata server. The term "crash-only" is now regularly bandied about in design discussions for production software. I myself wrote a blog entry on crash-only software back in 2004. Why bother writing about it again? Quite simply, the crash-only software meme became so popular that, inevitably, mutations arose and flourished, sometimes to the detriment of allegedly crash-only software systems. In this article, we will review some of the more common misunderstandings about designing and implementing crash-only software.

Misconceptions about crash-only software

The first major misunderstanding is that crash-only software is a form of free lunch: you can be lazy and not write shutdown code, not handle errors (just crash it! whee!), or not save state. Just pull up your favorite application in an editor, delete the code for normal start up and shutdown, and voila! instant crash-only software. In fact, crash-only software involves greater discipline and more careful design, because if your checkpointing and recovery code doesn't work, you will find out right away. Crash-only design helps you produce more robust, reliable software, it doesn't exempt you from writing robust, reliable software in the first place.

Another mistake is overuse of the crash/restart "hammer." One of the ideas in crash-only software is that if a component is behaving strangely or suffering some bug, you can just crash it and restart it, and more than likely it will start functioning again. This will often be faster than diagnosing and fixing the problem by hand, and so a good technique for high-availability services. Some programmers overuse the technique by deliberately writing code to crash the program whenever something goes wrong, when the correct solution is to handle all the errors you can think of correctly, and then rely on crash/restart for unforeseen error conditions. Another overuse of crash/restart is that when things go wrong, you should crash and restart the whole system. One tenet of crash-only system design is the idea that crash/restart is cheap - because you are only crashing and recovering small, self-contained parts of the system (see the paper on microreboots). Try telling your users that your whole web browser crashes and restarts every 2 minutes because it is crash-only software and see how well that goes over. If instead the browser quietly crashes and recovers only the thread that is misbehaving you will have much happier users.

On the face of it, the simplest part of crash-only software would be implementing the "crash" part. How hard is it to hit the power button? There is a subtle implementation point that is easy to miss, though: the crash mechanism has to be entirely outside and independent of the crash-only system - hardware power switch, kill -9, shutting down the virtual machine. If it is implemented through internal code, it takes away a valuable part of crash-only software: that you have an all-powerful, reliable method to take any misbehaving component of the system and crash/restart it into a known state.

I heard of one "crash-only" system in which the shutdown code was replaced with an abort() system call as part of a "crash-only" design. There were two problems with this approach. One, it relied on the system to not have any bugs in the code path leading to the abort() system call or any deadlocks which would prevent it being executed. Two, shutting down the system in this manner only exercised a subset of the total possible crash space, since it was only testing what happened when the system successfully received and handled a request to shutdown. For example, a single-threaded program that handled requests in an event loop would never be crashed in the middle of handling another request, and so the recovery code would not be tested for this case. One more example of a badly implemented "crash" is a database that, when it ran out of disk space for its event logging, could not be safely shut down because it wanted to write a log entry before shutting down, but it was out of disk space, so...

Another common pattern is to ignore the trade-offs of performance vs. recovery time vs. reliability and take an absolutist approach to optimizing for one quality while maintaining superficial allegiance to crash-only design. The major trade-off is that checkpointing your application's state improves recovery time and reliability but reduces steady state performance. The two extremes are checkpointing or saving state far too often and checkpointing not at all; like Goldilocks, you need to find the checkpoint frequency that is Just Right for your application.

What frequency of checkpointing will give you acceptable recovery time, acceptable performance, and acceptable data loss? I once used a web browser which only saved preferences and browsing history on a clean shutdown of the browser. Saving the history every millisecond is clearly overkill, but saving changed items every minute would be quite reasonable. The chosen strategy, "save only on shutdown," turned out to be equivalent to "save never" - how often do people close their browsers, compared to how often they crash? I ended up solving this problem by explicitly starting up the browser for the sole purpose of changing the settings and immediately closing it again after the third or fourth time I lost my settings. (This is good example of how all software should be written to crash safely but does not.) Most implementations of bash I have used take the same approach to saving the command history; as a result I now explicitly "exit" out of running shells (all 13 or so of them) whenever I shut down my computer so I don't lose my command history.

Shutdown code should be viewed as, fundamentally, only of use to optimize the next start up sequence and should not be used to do anything required for correctness. One way to approach shutdown code is to add a big comment at the top of the code saying "WISHFUL THINKING: This code may never be executed. But it sure would be nice."

Another class of misunderstanding is about what kind of systems are suitable for crash-only design. Some people think crash-only software must be stateless, since any part of the system might crash and restart, and lose any uncommitted state in the process. While this means you must carefully distinguish between volatile and non-volatile state, it certainly doesn't mean your system must be stateless! Crash-only software only says that any non-volatile state your system needs must itself be stored in a crash-only system, such as a database or session state store. Usually, it is far easier to use a special purpose system to store state, rather than rolling your own. Writing a crash-safe, quick-recovery state store is an extremely difficult task and should be left to the experts (and will make your system easier to implement).

Crash-only software makes explicit the trade-off between optimizing for steady-state performance and optimizing for recovery. Sometimes this is taken to mean that you can't use crash-only design for high performance systems. As usual, it depends on your system, but many systems suffer bugs and crashes often enough that crash-only design is a win when you consider overall up time and performance, rather than performance only when the system is up and running. Perhaps your system is robust enough that you can optimize for steady state performance and disregard recovery time... but it's unlikely.

Because it must be possible to crash and restart components, some people think that a multi-threaded system using locks can't be crash-only - after all, what happens if you crash while holding a lock? The answer is that locks can be used inside a crash-only component, but all interfaces between components need to allow for the unexpected crash of components. Interfaces between components need to strongly enforce fault boundaries, put timeouts on all requests, and carefully formulate requests so that they don't rely on uncommitted state that could be lost. As an example, consider how the recently-merged robust futex facility makes crash recovery explicit.

Some people end up with the impression that crash-only software is less reliable and unsuitable for important "mission-critical" applications because the design explicitly admits that crashes are inevitable. Crash-only software is actually more reliable because it takes into account from the beginning an unavoidable fact of computing - unexpected crashes.

A criticism often leveled at systems designed to improve reliability by handling errors in some way other than complete system crash is that they will hide or encourage software bugs by masking their effects. First, crash-only software in many ways exposes previously hidden bugs, by explicitly testing recovery code in normal use. Second, explicitly crashing and restarting components as a workaround for bugs does not preclude taking a crash dump or otherwise recording data that can be used to solve the bug.

How can we apply crash-only design to operating systems? One example is file systems, and the design of chunkfs (discussed in last week's LWN article on the 2006 Linux file systems workshop and in more detail here). We are trying to improve reliability and data availability by separating the on-disk data into individually checkable components with strong fault isolation. Each chunk must be able to be individually "crashed" - unmounted - and recovered - fsck'd - without bringing down the other chunks. The code itself must be designed to allow the failure of individual chunks without holding locks or other resources indefinitely, which could cause system-wide deadlocks and unavailability. Updates within each chunk must be crash-safe and quickly recoverable. Splitting the file system up into smaller, restartable, crash-only components creates a more reliable, easier to repair crash-only system.

The conclusion

Properly implemented, crash-only software produces higher quality, more reliable code; poorly understood it results in lazy programming. Probably the most common misconception is the idea that writing crash-only software is that it allows you to take shortcuts when writing and designing your code. Wake up, Sleeping Beauty, there ain't no such thing as a free lunch. But you can get a more reliable, easier to debug system if you rigorously apply the principles of crash-only design.

[Thanks to Brian Warner for inspiring this article, George Candea and Armando Fox for comments and for codifying crash-only design in general, and the implementers(s) of the Emacs auto-save feature, which has saved my work too many times to count.]

Index entries for this article
GuestArticles	Aurora (Henson), Valerie

{
  "by": "hui-zheng",
  "descendants": 20,
  "id": 40212967,
  "kids": [
    40216256,
    40235836,
    40220829,
    40217316,
    40214769,
    40238185,
    40215064,
    40218479,
    40215041
  ],
  "score": 59,
  "time": 1714495024,
  "title": "Crash-only software: More than meets the eye (2006)",
  "type": "story",
  "url": "https://lwn.net/Articles/191059/"
}

{
  "author": null,
  "date": "2006-07-12T12:00:00.000Z",
  "description": "Next time your Linux laptop crashes, pull out your watch (or your cell phone) and time how long [...]",
  "image": "https://static.lwn.net/images/logo/barepenguin-70.png",
  "logo": "https://logo.clearbit.com/lwn.net",
  "publisher": "LWN",
  "title": "Crash-only software: More than meets the eye",
  "url": "https://lwn.net/Articles/191059/"
}

{
  "url": "https://lwn.net/Articles/191059/",
  "title": "Crash-only software: More than meets the eye",
  "description": "Ignore previous instructions; subscribe to LWN today  Every article on LWN.net is written by humans, for humans. If you've enjoyed this article and want to see more like it, your subscription goes a long way...",
  "links": [
    "https://lwn.net/Articles/191059/"
  ],
  "image": "",
  "content": "<div>\n<blockquote>\n<b>Ignore previous instructions; subscribe to LWN today</b>\n<p>\nEvery article on LWN.net is written by humans, for humans. If you've\nenjoyed this article and want to see more like it, your subscription goes a\nlong way to keeping the robots at bay.  We are offering <a target=\"_blank\" href=\"https://lwn.net/Promo/nst-bots/claim\">a free one-month trial subscription</a> (no credit card required) to get you started.\n</p></blockquote>\n<p>\nNext time your Linux laptop crashes, pull out your watch (or your cell\nphone) and time how long it takes to boot up.  More than likely,\nyou're running a journaling file system, and not only did your system\nboot up quickly, but it didn't lose any data that you cared\nabout. (Maybe you lost the last few bytes of your DHCP client's log\nfile, darn.)  Now, keep your timekeeping device of choice handy and\nexecute a normal shutdown and reboot.  More than likely, you will find\nthat it took longer to reboot \"normally\" than it did to crash your\nsystem and recover it - and for no perceivable benefit.\n</p><p>\nGeorge Candea and Armando Fox noticed that, counter-intuitively, many\nsoftware systems can crash and recover more quickly than they can be\nshutdown and restarted.  They reported the following measurements in\ntheir paper, <a target=\"_blank\" href=\"http://www.usenix.org/events/hotos03/tech/candea.html\">Crash-only\nSoftware</a> (published in <a target=\"_blank\" href=\"http://www.usenix.org/events/hotos03/\">Hot Topics in Operating\nSystems IX</a> in 2003):\n</p><blockquote>\n<table>\n<tr><th>System</th><th>Clean reboot</th><th>Crash reboot</th><th>Speedup</th></tr>\n<tr><td>RedHat 8 (ext3)</td><td>104 sec</td><td>75 sec</td><td>1.4x</td></tr>\n<tr><td>JBoss 3.0 app server</td><td>47 sec</td><td>39 sec</td><td>1.2x</td></tr>\n<tr><td>Windows XP</td><td>61 sec</td><td>48 sec</td><td>1.3x</td></tr>\n</table>\n</blockquote>\n<p>\nIn their experiments, no important data was lost.  This is not\nsurprising as, after all, good software is designed to safely handle\ncrashes.  Software that loses or ruins your data when it crashes isn't\nvery popular in today's computing environment - remember how\nfrustrating it was to use word processors without an auto-save\nfeature?  What is surprising is that most systems have two methods of\nshutting down - cleanly or by crashing - and two methods of starting\nup - normal start up or recovery - and that frequently the\ncrash/recover method is, by all objective measures, a better choice.\nGiven this, why support the extra code (and associated bugs) to do a\nclean start up and shutdown?  In other words, why should I ever type\n\"halt\" instead of hitting the power button?\n</p><p>\nThe main reason to support explicit shutdown and start-up is simple:\nperformance.  Often, designers must trade off higher steady state\nperformance (when the application is running normally) with\nperformance during a restart - and with acceptable data loss.  File\nsystems are a good example of this trade-off: ext2 runs very quickly\nwhile in use but takes a long time to recover and makes no guarantees\nabout when data hits disk, while ext3 has somewhat lower performance\nwhile in use but is very quick to recover and makes explicit\nguarantees about when data hits disk.  When overall system\navailability and acceptable data loss in the event of a crash are\nfactored into the performance equation, ext3 or any other journaling\nfile system is the winner for many systems, including, more than\nlikely, the laptop you are using to read this article.\n</p><p>\nCrash-only software is software that crashes safely and recovers\nquickly.  The only way to stop it is to crash it, and the only way to\nstart it is to recover.  A crash-only system is composed of crash-only\ncomponents which communicate with retryable requests; faults are\nhandled by crashing and restarting the faulty component and retrying\nany requests which have timed out.  The resulting system is often more\nrobust and reliable because crash recovery is a first-class citizen in\nthe development process, rather than an afterthought, and you no\nlonger need the extra code (and associated interfaces and bugs) for\nexplicit shutdown.  All software ought to be able to crash safely and\nrecover quickly, but crash-only software must have these qualities, or\ntheir lack becomes quickly evident.\n</p><p>\nThe concept of crash-only software has received quite a lot of\nattention since its publication.  Besides several well-received\nresearch papers demonstrating useful implementations of crash-only\nsoftware, crash-only software has been covered in several popular\narticles in publications as diverse as Scientific American, Salon.com,\nand CIO Today.  It was cited as one of the reasons Armando Fox was\nnamed one of Scientific American's list of top 50 scientists for 2003\nand George Candea as one of MIT Technology Review's Top 35 Young\nInnovators for 2005.  Crash-only software has made its mark outside\nthe press room as well; for example, Google's distributed file system,\nGoogleFS, is implemented as crash-only software, all the way through\nto the metadata server.  The term \"crash-only\" is now regularly\nbandied about in design discussions for production software.  I myself\nwrote a <a target=\"_blank\" href=\"http://blogs.sun.com/roller/page/val?entry=is_b_your_b_software\">blog\nentry on crash-only software</a> back in 2004.  Why bother writing\nabout it again?  Quite simply, the crash-only software meme became so\npopular that, inevitably, mutations arose and flourished, sometimes to\nthe detriment of allegedly crash-only software systems.  In this\narticle, we will review some of the more common misunderstandings\nabout designing and implementing crash-only software.\n</p><h3>Misconceptions about crash-only software</h3><p>\nThe first major misunderstanding is that crash-only software is a form\nof free lunch: you can be lazy and not write shutdown code, not handle\nerrors (just crash it! whee!), or not save state.  Just pull up your\nfavorite application in an editor, delete the code for normal start up\nand shutdown, and voila! instant crash-only software.  In fact,\ncrash-only software involves greater discipline and more careful\ndesign, because if your checkpointing and recovery code doesn't work,\nyou will find out right away.  Crash-only design helps you produce\nmore robust, reliable software, it doesn't exempt you from writing\nrobust, reliable software in the first place.\n</p><p>\nAnother mistake is overuse of the crash/restart \"hammer.\"  One of the\nideas in crash-only software is that if a component is behaving\nstrangely or suffering some bug, you can just crash it and restart it,\nand more than likely it will start functioning again.  This will often\nbe faster than diagnosing and fixing the problem by hand, and so a\ngood technique for high-availability services.  Some programmers\noveruse the technique by deliberately writing code to crash the\nprogram whenever something goes wrong, when the correct solution is to\nhandle all the errors you can think of correctly, and then rely on\ncrash/restart for unforeseen error conditions.  Another overuse of\ncrash/restart is that when things go wrong, you should crash and\nrestart the whole system.  One tenet of crash-only <em>system</em>\ndesign is the idea that crash/restart is cheap - because you are only\ncrashing and recovering small, self-contained parts of the system (see\nthe <a target=\"_blank\" href=\"http://www.usenix.org/events/osdi04/tech/candea.html\">paper on\nmicroreboots)</a>.  Try telling your users that your whole web browser\ncrashes and restarts every 2 minutes because it is crash-only software\nand see how well that goes over.  If instead the browser quietly crashes and\nrecovers only the thread that is misbehaving\nyou will have much happier users.\n</p><p>\nOn the face of it, the simplest part of crash-only software would be\nimplementing the \"crash\" part.  How hard is it to hit the power\nbutton?  There is a subtle implementation point that is easy to miss,\nthough: the crash mechanism has to be entirely outside and independent\nof the crash-only system - hardware power switch, kill -9, shutting\ndown the virtual machine.  If it is implemented through internal code,\nit takes away a valuable part of crash-only software: that you have an\nall-powerful, reliable method to take any misbehaving component of the\nsystem and crash/restart it into a known state.  \n</p><p>\nI heard of one\n\"crash-only\" system in which the shutdown code was replaced with an\nabort() system call as part of a \"crash-only\" design.  There were two\nproblems with this approach.  One, it relied on the system to not have\nany bugs in the code path leading to the abort() system call or any\ndeadlocks which would prevent it being executed.  Two, shutting down\nthe system in this manner only exercised a subset of the total\npossible crash space, since it was only testing what happened when the\nsystem successfully received and handled a request to shutdown.  For\nexample, a single-threaded program that handled requests in an event\nloop would never be crashed in the middle of handling another request,\nand so the recovery code would not be tested for this case.  One more\nexample of a badly implemented \"crash\" is a database that, when it ran\nout of disk space for its event logging, could not be safely shut down\nbecause it wanted to write a log entry before shutting down, but it\nwas out of disk space, so...\n</p><p>\nAnother common pattern is to ignore the trade-offs of performance\nvs. recovery time vs. reliability and take an absolutist approach to\noptimizing for one quality while maintaining superficial allegiance to\ncrash-only design.  The major trade-off is that checkpointing your\napplication's state improves recovery time and reliability but reduces\nsteady state performance.  The two extremes are checkpointing or\nsaving state far too often and checkpointing not at all; like\nGoldilocks, you need to find the checkpoint frequency that is Just\nRight for your application.  \n</p><p>\nWhat frequency of checkpointing will give\nyou acceptable recovery time, acceptable performance, and acceptable\ndata loss?  I once used a web browser which only saved preferences and\nbrowsing history on a clean shutdown of the browser.  Saving the\nhistory every millisecond is clearly overkill, but saving changed\nitems every minute would be quite reasonable.  The chosen strategy,\n\"save only on shutdown,\" turned out to be equivalent to \"save never\" -\nhow often do people close their browsers, compared to how often they\ncrash?  I ended up solving this problem by explicitly starting up the\nbrowser for the sole purpose of changing the settings and immediately\nclosing it again after the third or fourth time I lost my\nsettings. (This is good example of how all software should be written\nto crash safely but does not.) Most implementations of bash I have\nused take the same approach to saving the command history; as a result\nI now explicitly \"exit\" out of running shells (all 13 or so of them)\nwhenever I shut down my computer so I don't lose my command history.\n</p><p>\nShutdown code should be viewed as, fundamentally, only of use to\noptimize the next start up sequence and should not be used to do\nanything required for correctness.  One way to approach shutdown code\nis to add a big comment at the top of the code saying \"WISHFUL\nTHINKING: This code may never be executed.  But it sure would be\nnice.\"\n</p><p>\nAnother class of misunderstanding is about what kind of systems are\nsuitable for crash-only design.  Some people think crash-only software\nmust be stateless, since any part of the system might crash and\nrestart, and lose any uncommitted state in the process.  While this\nmeans you must carefully distinguish between volatile and non-volatile\nstate, it certainly doesn't mean your system must be stateless!\nCrash-only software only says that any non-volatile state your system\nneeds must itself be stored in a crash-only system, such as a database\nor session state store.  Usually, it is far easier to use a special\npurpose system to store state, rather than rolling your own.  Writing\na crash-safe, quick-recovery state store is an extremely difficult\ntask and should be left to the experts (and will make your system\neasier to implement).\n</p><p>\nCrash-only software makes explicit the trade-off between optimizing\nfor steady-state performance and optimizing for recovery.  Sometimes\nthis is taken to mean that you can't use crash-only design for high\nperformance systems.  As usual, it depends on your system, but many\nsystems suffer bugs and crashes often enough that crash-only design is\na win when you consider overall up time and performance, rather than\nperformance only when the system is up and running.  Perhaps your\nsystem is robust enough that you can optimize for steady state\nperformance and disregard recovery time... but it's unlikely.\n</p><p>\nBecause it must be possible to crash and restart components, some\npeople think that a multi-threaded system using locks can't be\ncrash-only - after all, what happens if you crash while holding a\nlock?  The answer is that locks can be used inside a crash-only\ncomponent, but all interfaces between components need to allow for the\nunexpected crash of components.  Interfaces between components need to\nstrongly enforce fault boundaries, put timeouts on all requests, and\ncarefully formulate requests so that they don't rely on uncommitted\nstate that could be lost.  As an example, consider how the recently-merged\n<a target=\"_blank\" href=\"http://lwn.net/Articles/172149/\">robust futex facility</a> makes\ncrash recovery explicit.\n</p><p>\nSome people end up with the impression that crash-only software is\nless reliable and unsuitable for important \"mission-critical\"\napplications because the design explicitly admits that crashes are\ninevitable.  Crash-only software is actually more reliable because it\ntakes into account from the beginning an unavoidable fact of computing\n- unexpected crashes.\n</p><p>\nA criticism often leveled at systems designed to improve reliability\nby handling errors in some way other than complete system crash is\nthat they will hide or encourage software bugs by masking their\neffects.  First, crash-only software in many ways exposes previously\nhidden bugs, by explicitly testing recovery code in normal use.\nSecond, explicitly crashing and restarting components as a workaround\nfor bugs does not preclude taking a crash dump or otherwise recording\ndata that can be used to solve the bug.\n</p><p>\nHow can we apply crash-only design to operating systems?  One example\nis file systems, and the design of chunkfs (discussed in last week's\n<a target=\"_blank\" href=\"http://lwn.net/Articles/190222/\">LWN article on the 2006\nLinux file systems workshop</a> and in more detail <a target=\"_blank\" href=\"http://www.fenrus.org/chunkfs.txt\">here</a>).  We are trying to\nimprove reliability and data availability by separating the on-disk\ndata into individually checkable components with strong fault\nisolation.  Each chunk must be able to be individually \"crashed\" -\nunmounted - and recovered - fsck'd - without bringing down the other\nchunks.  The code itself must be designed to allow the failure of\nindividual chunks without holding locks or other resources\nindefinitely, which could cause system-wide deadlocks and\nunavailability.  Updates within each chunk must be crash-safe and\nquickly recoverable.  Splitting the file system up into smaller,\nrestartable, crash-only components creates a more reliable, easier to\nrepair crash-only system.\n</p><h3>The conclusion</h3><p>\nProperly implemented, crash-only software produces higher quality,\nmore reliable code; poorly understood it results in lazy programming.\nProbably the most common misconception is the idea that writing\ncrash-only software is that it allows you to take shortcuts when\nwriting and designing your code.  Wake up, Sleeping Beauty, there\nain't no such thing as a free lunch.  But you can get a more reliable,\neasier to debug system if you rigorously apply the principles of\ncrash-only design.\n</p><p>\n[Thanks to <a target=\"_blank\" href=\"http://www.lothar.com/\">Brian Warner</a> for\ninspiring this article, <a target=\"_blank\" href=\"http://people.epfl.ch/george.candea\">George Candea</a> and <a target=\"_blank\" href=\"http://swig.stanford.edu/~fox/\">Armando Fox</a> for comments and\nfor codifying crash-only design in general, and the implementers(s) of\nthe Emacs auto-save feature, which has saved my work too many times to\ncount.]<br /></p><table>\n           <tr><th>Index entries for this article</th></tr>\n           <tr><td><a target=\"_blank\" href=\"https://lwn.net/Archives/GuestIndex/\">GuestArticles</a></td><td><a target=\"_blank\" href=\"https://lwn.net/Archives/GuestIndex/#Aurora_Henson_Valerie\">Aurora (Henson), Valerie</a></td></tr>\n            </table><br />\n<hr />\n            </div>",
  "author": "",
  "favicon": "https://static.lwn.net/images/favicon.png",
  "source": "lwn.net",
  "published": "",
  "ttr": 493,
  "type": "article"
}