Hacking on PostgreSQL is hard

http://rhaas.blogspot.com/2024/05/hacking-on-postgresql-is-really-hard.html

Hacking on PostgreSQL is really hard. I think a lot of people would agree with this statement, not all for the same reasons. Some might point to the character of discourse on the mailing list, others to the shortage of patch reviewers, and others still to the difficulty of getting the attention of a committer, or of feeling like a hostage to some committer's whimsy. All of these are problems, but today I want to focus on the purely technical aspect of the problem: the extreme difficulty of writing reasonably correct patches.

There are tons of examples that I could use to demonstrate this, but it would be unfair to embarrass anyone else, so I'll take an example drawn from my own very recent experience: incremental backup. I consider this a moderately complex feature. It was an ambitious project, but there are certainly much more ambitious things that somebody might want to do, and that some people have done. This project has actually been brewing for several years; an earlier incarnation of the project morphed into the effort to create pg_verifybackup. In 2023, I worked back around to the main event again. I spent several months having on incremental backup in the first half of the year, and then several more months on it in the last quarter, ending with a commit of the feature on December 20, 2023.

On December 21, there were four commits, two by me and two by Tom Lane, fixing defects in that commit. By January 15, there were sixteen more followup commits, of which only two were planned. Two of those were by Tom Lane, one by Michael Paquier, and the rest by me. Pretty much all of these were, at least as I see it, fixing real issues. It wasn't like the mailing list was torturing me to fix stupid things that didn't matter; they were finding a whole bunch of dumb oversights in my work. After the first few weeks, the pace did slow down quite a bit, but the worst was yet to come.

On March 4, I committed a fix for incremental backup in the face of current CREATE DATABASE whatever STRATEGY file_copy operations. This was formally a data-corrupting bug, although it is unlikely that many people would have hit it in practice. On April 5, I committed a fix for a data-corrupting bug that would have hit practically anyone who used incremental backup on relations 1GB or larger in size. On April 19, I committed a fix for an issue that would have made it impossible to restore incremental backups of PostgreSQL instances that made use of user-defined tablespaces. On April 25, I committed code and documentation improvements in response to observations that, if the checksum status of the cluster was changed, you could checksum failures after restoring. These failures wouldn't be real - in reality your data was fine - but it would look frightening.

This is not an exhaustive enumeration of everything I've done to try to stabilize incremental backup. For example, along the way, I broke the buildfarm several times trying to add more tests, since obviously I didn't have sufficiently good tests. If you go through the commit log, you can see my frantic efforts to stabilize the buildfarm before the howling mob of angry hackers descended on me. But the oversights mentioned in the previous paragraph, particularly the middle two, were serious oversights. They didn't indicate any design-level insufficiency, so the fixes were very simple, but the reasonable reader might wonder how such mistakes survived testing. It's not as if I didn't test -- or at least, I didn't think I hadn't tested. I put significant time and energy into both manual testing and the writing of automated test cases. Jakub Wartak also put an enormous amount of time and energy into testing, for which I remain profoundly grateful, and he somehow didn't find those problems, either. How is that even possible?

One possible theory is that I'm not really very good at this whole hacking on PostgreSQL thing, and certainly there are people who are better at it than I am, but I don't think that can be the whole explanation. If it were, you would expect the troubles that I had here to be unusual, and they very much aren't. In fact, some people have had much worse experiences than I have had with this feature, resulting in patches on which someone has spent a lot of time having to be thrown out entirely, or in committed patches representing large amounts of work being reverted, or in serious bugs making it all the way through to release, necessitating after-the-fact stabilization. I remember a case where a serious data-corrupting bug that I introduced wasn't found for something like two years, and that kind of thing isn't uncommon. As far as I can tell, everyone who works on PostgreSQL struggles to write code well enough to live up to the project standards every time they sit down to write a patch, and even the very best hackers still fail at it from time to time, in small ways or sometimes in large ones.

I believe that this is part of what's behind many of the problems that I mentioned in the opening paragraph. For example, suppose you're lucky enough to be a committer. Every time you commit one of your own patches, you're at serious risk of having to drop everything and put a ton of work into fixing everything you did wrong, either as soon as you do the commit, or when the problems are found later, or both. Every time you commit one of somebody else's patches, you're at risk of having to do the same thing, which means you're probably going to be reluctant to commit anything unless you're pretty sure it's pretty good. That means that committing other people's patches is not primarily about the time it takes to type git commit and git push, but about all of the review you do beforehand, and the potential unfunded liability of having to be responsible for it afterward. I haven't talked to other committers about the extent to which this weighs on their decision-making process, but I'd be astonished if it didn't. There's one particular patch I remember committing - I won't mention which one - where I spent weeks and weeks of time reviewing the patch before committing it, and after committing it, I lost most of the next six to nine months fixing things I hadn't caught during review. That is the sort of experience that you can't afford to repeat very often; there just aren't enough months in the year, or years in your working life. I think it was totally worth the pain, in that particular case, but it's definitely not worth that amount of pain for a random patch in which I'm not particularly personally invested.

And that obviously has the effect of limiting the number of people who can get things committed to PostgreSQL. To become a committer, you have to convince people that you're one of the people who can be trusted to give the final sign-off to other people's patches. That requires both technical and diplomatic skill, but the technical skill alone takes thousands of hours to develop. And then, if you want to keep being able to commit significant patches, whether your own or someone else's, you have to continue spending at least hundreds and probably over a thousand hours on it, every year, in order to maintain the necessary skill level. Not everyone is able or willing to do that, which means that the pool of active committers doesn't grow a whole lot: people are added, but people also move on. And that in turn means that the number of promising new contributors who can get enough committer attention to become committers themselves is also quite limited. Existing committers tend to focus their attention on the most promising patches from the most promising developers; other people, to some greater or lesser extent, get frozen out. Even committers can get frozen out, to a degree: if you commit something that turns out to have major problems, you're going to get a fair amount of blowback from other committers who want to spend their time either on their own patches or on the patches of non-committers, not cleaning up after you, and that blowback is likely to make you more reluctant to commit major patches in the future. That's as it should be, but it still has the effect of further restricting the rate at which stuff gets done.

And of course, all of this also affects the tone of the community discourse. Non-committers get frustrated if they can't get the attention of committers. Reviews get frustrated at people who submit low-quality patches, especially if repeated rounds of review don't result in much improvement. Committers get frustrated at the amount of time they spend cleaning up after other people's mistakes, or worse still, their own. I genuinely believe that almost everyone has the intention to be kind and well-mannered and to help others out whenever possible, but the sheer difficulty of the task in which we are engaged puts pressure on everyone. In my case, and I'm probably not alone in this, that pressure extends well beyond working hours. I can't count the the number of times that I've been rude to someone in my family because I turned the buildfarm red and had to spend the afternoon, or the evening, fixing it, or often enough, just reverting my ill-considered changes. I'm not sure how other people experience it, but for me, the worst part of it is the realization that I've been dumb. Had I only done X or tested Y, I could have avoided messing it up, and I didn't do that, or at least not correctly, and now here we are.

Since PostgreSQL is the only open source project in which I've ever been involved, I don't really know to what degree other projects have encountered these problems, or how they've solved them. I would like to see the developer base grow, and the amount that we get done in a release scale, in a way that it currently doesn't. But I have also seen that just committing more stuff with less caution tends to backfire really hard. After 15 years as a PostgreSQL developer, most if it full time, and after 30 years of programming experience, I still can't commit a test case change without a serious risk of having to spend the next several hours, or days, cleaning up the damage. Either programming is intrinsically difficult, and that's just to be expected, or we're doing things that make it harder for ourselves. I suspect it's at least partially the latter, but I don't know.

Your thoughts welcome.

{
"by": "fletchr",
"descendants": 51,
"id": 40231332,
"kids": [
40242666,
40241095,
40242090,
40241169,
40241518,
40243163,
40243508,
40242806,
40240755,
40241967,
40243460,
40241131,
40240927,
40231353,
40241482,
40241962,
40241454
],
"score": 159,
"time": 1714609545,
"title": "Hacking on PostgreSQL is hard",
"type": "story",
"url": "http://rhaas.blogspot.com/2024/05/hacking-on-postgresql-is-really-hard.html"
}
{
"author": "Robert Haas",
"date": "2024-06-17T14:05:00.000Z",
"description": "Hacking on PostgreSQL is really hard. I think a lot of people would agree with this statement, not all for the same reasons. Some might poin…",
"image": "https://resources.blogblog.com/img/icon18_edit_allbkg.gif",
"logo": null,
"publisher": null,
"title": "Hacking on PostgreSQL is Really Hard",
"url": "http://rhaas.blogspot.com/2024/05/hacking-on-postgresql-is-really-hard.html"
}
{
"url": "http://rhaas.blogspot.com/2024/05/hacking-on-postgresql-is-really-hard.html",
"title": "Hacking on PostgreSQL is Really Hard",
"description": "Hacking on PostgreSQL is really hard. I think a lot of people would agree with this statement, not all for the same reasons. Some might point to the character of discourse on the mailing list, others to the...",
"links": [
"http://rhaas.blogspot.com/2024/05/hacking-on-postgresql-is-really-hard.html"
],
"image": "",
"content": "<div>\n<p>Hacking on PostgreSQL is really hard. I think a lot of people would agree with this statement, not all for the same reasons. Some might point to the <a href=\"https://rhaas.blogspot.com/2023/12/praise-criticism-and-dialogue.html\" target=\"_blank\">character of discourse on the mailing list</a>, others to the shortage of patch reviewers, and others still to the difficulty of getting the attention of a committer, or of feeling like a hostage to some committer's whimsy. All of these are problems, but today I want to focus on the purely technical aspect of the problem: the extreme difficulty of writing reasonably correct patches.</p><p>There are tons of examples that I could use to demonstrate this, but it would be unfair to embarrass anyone else, so I'll take an example drawn from my own very recent experience: incremental backup. I consider this a moderately complex feature. It was an ambitious project, but there are certainly much more ambitious things that somebody might want to do, and that some people have done. This project has actually been brewing for several years; an earlier incarnation of the project morphed into the effort to create <span>pg_verifybackup</span>. In 2023, I worked back around to the main event again. I spent several months having on incremental backup in the first half of the year, and then several more months on it in the last quarter, ending with a commit of the feature on December 20, 2023.</p><p>On December 21, there were four commits, two by me and two by Tom Lane, fixing defects in that commit. By January 15, there were sixteen more followup commits, of which only two were planned. Two of those were by Tom Lane, one by Michael Paquier, and the rest by me. Pretty much all of these were, at least as I see it, fixing real issues. It wasn't like the mailing list was torturing me to fix stupid things that didn't matter; they were finding a whole bunch of dumb oversights in my work. After the first few weeks, the pace did slow down quite a bit, but the worst was yet to come.</p><p>On March 4, I committed a fix for incremental backup in the face of current <span>CREATE DATABASE whatever STRATEGY file_copy</span> operations. This was formally a data-corrupting bug, although it is unlikely that many people would have hit it in practice. On April 5, I committed a fix for a data-corrupting bug that would have hit practically anyone who used incremental backup on relations 1GB or larger in size. On April 19, I committed a fix for an issue that would have made it impossible to restore incremental backups of PostgreSQL instances that made use of user-defined tablespaces. On April 25, I committed code and documentation improvements in response to observations that, if the checksum status of the cluster was changed, you could checksum failures after restoring. These failures wouldn't be real - in reality your data was fine - but it would look frightening.</p><p>This is not an exhaustive enumeration of everything I've done to try to stabilize incremental backup. For example, along the way, I broke the buildfarm several times trying to add more tests, since obviously I didn't have sufficiently good tests. If you go through the commit log, you can see my frantic efforts to stabilize the buildfarm before the howling mob of angry hackers descended on me. But the oversights mentioned in the previous paragraph, particularly the middle two, were serious oversights. They didn't indicate any design-level insufficiency, so the fixes were very simple, but the reasonable reader might wonder how such mistakes survived testing. It's not as if I didn't test -- or at least, I didn't think I hadn't tested. I put significant time and energy into both manual testing and the writing of automated test cases. Jakub Wartak also put an enormous amount of time and energy into testing, for which I remain profoundly grateful, and he somehow didn't find those problems, either. How is that even possible?</p><p>One possible theory is that I'm not really very good at this whole hacking on PostgreSQL thing, and certainly there are people who are better at it than I am, but I don't think that can be the whole explanation. If it were, you would expect the troubles that I had here to be unusual, and they very much aren't. In fact, some people have had much worse experiences than I have had with this feature, resulting in patches on which someone has spent a lot of time having to be thrown out entirely, or in committed patches representing large amounts of work being reverted, or in serious bugs making it all the way through to release, necessitating after-the-fact stabilization. I remember a case where a serious data-corrupting bug that I introduced wasn't found for something like two years, and that kind of thing isn't uncommon. As far as I can tell, everyone who works on PostgreSQL struggles to write code well enough to live up to the project standards every time they sit down to write a patch, and even the very best hackers still fail at it from time to time, in small ways or sometimes in large ones.</p><p>I believe that this is part of what's behind many of the problems that I mentioned in the opening paragraph. For example, suppose you're lucky enough to be a committer. Every time you commit one of your own patches, you're at serious risk of having to drop everything and put a ton of work into fixing everything you did wrong, either as soon as you do the commit, or when the problems are found later, or both. Every time you commit one of somebody else's patches, you're at risk of having to do the same thing, which means you're probably going to be reluctant to commit anything unless you're pretty sure it's pretty good. That means that committing other people's patches is not primarily about the time it takes to type <span>git commit</span> and <span>git push</span>, but about all of the review you do beforehand, and the potential unfunded liability of having to be responsible for it afterward. I haven't talked to other committers about the extent to which this weighs on their decision-making process, but I'd be astonished if it didn't. There's one particular patch I remember committing - I won't mention which one - where I spent weeks and weeks of time reviewing the patch before committing it, and after committing it, I lost most of the next six to nine months fixing things I hadn't caught during review. That is the sort of experience that you can't afford to repeat very often; there just aren't enough months in the year, or years in your working life. I think it was totally worth the pain, in that particular case, but it's definitely not worth that amount of pain for a random patch in which I'm not particularly personally invested.</p><p>And that obviously has the effect of limiting the number of people who can get things committed to PostgreSQL. To become a committer, you have to convince people that you're one of the people who can be trusted to give the final sign-off to other people's patches. That requires both technical and diplomatic skill, but the technical skill alone takes thousands of hours to develop. And then, if you want to keep being able to commit significant patches, whether your own or someone else's, you have to continue spending at least hundreds and probably over a thousand hours on it, every year, in order to maintain the necessary skill level. Not everyone is able or willing to do that, which means that the pool of active committers doesn't grow a whole lot: people are added, but people also move on. And that in turn means that the number of promising new contributors who can get enough committer attention to become committers themselves is also quite limited. Existing committers tend to focus their attention on the most promising patches from the most promising developers; other people, to some greater or lesser extent, get frozen out. Even committers can get frozen out, to a degree: if you commit something that turns out to have major problems, you're going to get a fair amount of blowback from other committers who want to spend their time either on their own patches or on the patches of non-committers, not cleaning up after you, and that blowback is likely to make you more reluctant to commit major patches in the future. That's as it should be, but it still has the effect of further restricting the rate at which stuff gets done.</p><p>And of course, all of this also affects the tone of the community discourse. Non-committers get frustrated if they can't get the attention of committers. Reviews get frustrated at people who submit low-quality patches, especially if repeated rounds of review don't result in much improvement. Committers get frustrated at the amount of time they spend cleaning up after other people's mistakes, or worse still, their own. I genuinely believe that almost everyone has the intention to be kind and well-mannered and to help others out whenever possible, but the sheer difficulty of the task in which we are engaged puts pressure on everyone. In my case, and I'm probably not alone in this, that pressure extends well beyond working hours. I can't count the the number of times that I've been rude to someone in my family because I turned the buildfarm red and had to spend the afternoon, or the evening, fixing it, or often enough, just reverting my ill-considered changes. I'm not sure how other people experience it, but for me, the worst part of it is the realization that I've been dumb. Had I only done X or tested Y, I could have avoided messing it up, and I didn't do that, or at least not correctly, and now here we are.</p><p>Since PostgreSQL is the only open source project in which I've ever been involved, I don't really know to what degree other projects have encountered these problems, or how they've solved them. I would like to see the developer base grow, and the amount that we get done in a release scale, in a way that it currently doesn't. But I have also seen that just committing more stuff with less caution tends to backfire really hard. After 15 years as a PostgreSQL developer, most if it full time, and after 30 years of programming experience, I still can't commit a test case change without a serious risk of having to spend the next several hours, or days, cleaning up the damage. Either programming is intrinsically difficult, and that's just to be expected, or we're doing things that make it harder for ourselves. I suspect it's at least partially the latter, but I don't know.</p><p>Your thoughts welcome.</p>\n</div>",
"author": "",
"favicon": "http://rhaas.blogspot.com/favicon.ico",
"source": "rhaas.blogspot.com",
"published": "",
"ttr": 360,
"type": ""
}