Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 575193

Summary: Some Gerrit changes report 404s, some internal server error (500) after yesterday's outage
Product: Community Reporter: Thomas Wolf <twolf>
Component: GerritAssignee: Eclipse Webmaster <webmaster>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: denis.roy, loskutov, matthias.sohn, mikael.barbero, sw
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: All   
See Also: https://bugs.eclipse.org/bugs/show_bug.cgi?id=575252
Whiteboard:
Bug Depends on:    
Bug Blocks: 575261    

Description Thomas Wolf CLA 2021-08-02 10:21:51 EDT
Not sure if Gerrit is still recovering from yesterday's outage. I see people pushing changes, so I presume it should be operating nominally now.

However, some change appear to be corrupted.

The following ones give 404 responses for me:

* https://git.eclipse.org/r/c/egit/egit/+/183574
* https://git.eclipse.org/r/c/statet/org.eclipse.statet-commons/+/183575
* https://git.eclipse.org/r/c/statet/org.eclipse.statet-commons/+/183571
* https://git.eclipse.org/r/c/statet/org.eclipse.statet-commons/+/183572

Pushing a new patchset onto https://git.eclipse.org/r/c/egit/egit/+/183574  failed simply with "remote rejected".

The following ones give 

  Error 500 (Server Error): Internal server error

  Endpoint: /changes/*~*/revisions/*/related

* https://git.eclipse.org/r/c/statet/org.eclipse.statet-commons/+/183566
* https://git.eclipse.org/r/c/statet/org.eclipse.statet-commons/+/183567
* https://git.eclipse.org/r/c/egit/egit/+/183552
* https://git.eclipse.org/r/c/statet/org.eclipse.statet-commons/+/183565
Comment 1 Thomas Wolf CLA 2021-08-03 02:30:29 EDT
The 404 error message is

  You might have not enough privileges.

  Error 404 (Not Found): Not found: egit/egit~183574

  Endpoint: /changes/*~*/edit/

The 404/500 responses seem to affect changes pushed between Jul 30 15:11 and the crash. Earlier changes and changes pushed after Gerrit came up again seem fine.

@Stephan: I presume you adding yourself as CC means you also see this?
Comment 2 Stephan Wahlbrink CLA 2021-08-03 14:20:13 EDT
Yes, as listed in your post, changes for StatET were affected too.

I solved it for me:
I could commit some changes despite of the error message.
If this was not possible, I simply send the changes again to gerrit with a new change id.
Comment 3 Mikaël Barbero CLA 2021-08-04 08:39:24 EDT
The last weekend's outage was on our primary backend storage (where repos hosted at git.eclipse.org are stored). The outage started at 9:20pm/9:30pm EDT on July 31st.

Since then, live replicas have been rsync'd back. What you see is apparent data loss. I don't have any information whether this is due to how the storage has been restored. I cannot tell you if we have more up-to-date data that could be restored for your git repo. 

Hopefully, the date/time of the outage given above should help you gather the required clues to bring back a working git repo.
Comment 4 Thomas Wolf CLA 2021-08-04 17:16:09 EDT
Also a 404 response for /changes/*~*/robotcomments :

https://git.eclipse.org/r/c/jdt/eclipse.jdt.core/+/183573

@Mikaël: possibly some of that data loss can be recovered if people have the changes still locally and push them with new Gerrit change IDs. But unless these broken Gerrit changes get removed somehow, they'll stay around forever :-( It also doesn't look like plain "data loss"; it looks like the Gerrit database is corrupted. It does have _some_ info on these changes, but other parts appear to be missing. Since Gerrit stores most (or even all?) data in the git repos themselves, this is rather disconcerting. Things in its notedb or in other refs/meta/* git refs appear to be corrupted.
Comment 5 Thomas Wolf CLA 2021-08-05 03:23:54 EDT
I re-pushed one of the corrupted changes: https://git.eclipse.org/r/c/egit/egit/+/183574 is broken, superseded now by https://git.eclipse.org/r/c/egit/egit/+/183717. However, that newly pushed changed gives a 500 for /changes/*~*/revisions/*/related .

I cannot do anything with https://git.eclipse.org/r/c/egit/egit/+/183574 , not even abandoning it.
Comment 6 Matthias Sohn CLA 2021-08-10 17:48:01 EDT
Just came back from 2 weeks vacation.

Could you provide me a copy of the egit repository from the gerrit site ? Then I can check what's broken and try to repair it.

Gerrit (via JGit) relies on atomic file renames to implement transactions on the filesystem. It seems the secondary filesystem used to restore gerrit from is not a consistent copy of the primary one which was destroyed during the outage.

In order to avoid such issues we should consider
- replicate all git repositories hosted by Gerrit to another storage using Gerrit's replication plugin [1] which replicates via git transport to ensure consistency (this can't be guaranteed e.g. by rsync)
- implement periodic backups of the Gerrit site [2]
- migrate from a single host Gerrit setup to a multi-site [3] setup deployed across multiple availability zone

[1] https://gerrit.googlesource.com/plugins/replication/+/refs/heads/master/src/main/resources/Documentation 
[2] https://gerrit-review.googlesource.com/Documentation/backup.html
[3] https://gerrit.googlesource.com/plugins/multi-site/+/refs/heads/master
Comment 7 Thomas Wolf CLA 2021-08-11 05:30:16 EDT
(In reply to Matthias Sohn from comment #6)
> Could you provide me a copy of the egit repository from the gerrit site ?
> Then I can check what's broken and try to repair it.
> 
> Gerrit (via JGit) relies on atomic file renames to implement transactions on
> the filesystem. It seems the secondary filesystem used to restore gerrit
> from is not a consistent copy of the primary one which was destroyed during
> the outage.

When you analyze the repo, please also look if there's something that could be done in Gerrit to avoid such inconsistent states. Atomic renames are fine, but if several files are changed, what in Gerrit guarantees overall consistency? The commit for patch set 183574,1 is apparently there, otherwise it shouldn't even show up at all. But if the notedb entry/entries is/are missing, then Gerrit should at least be able to deal with or even recover from it. And likewise for the revisions/*/related or robotcomments problems.
Comment 8 Matthias Sohn CLA 2021-08-11 08:41:58 EDT
(In reply to Thomas Wolf from comment #7)
> (In reply to Matthias Sohn from comment #6)
> > Could you provide me a copy of the egit repository from the gerrit site ?
> > Then I can check what's broken and try to repair it.
> > 
> > Gerrit (via JGit) relies on atomic file renames to implement transactions on
> > the filesystem. It seems the secondary filesystem used to restore gerrit
> > from is not a consistent copy of the primary one which was destroyed during
> > the outage.
> 
> When you analyze the repo, please also look if there's something that could
> be done in Gerrit to avoid such inconsistent states. Atomic renames are
> fine, but if several files are changed, what in Gerrit guarantees overall
> consistency? The commit for patch set 183574,1 is apparently there,
> otherwise it shouldn't even show up at all. But if the notedb entry/entries
> is/are missing, then Gerrit should at least be able to deal with or even
> recover from it. And likewise for the revisions/*/related or robotcomments
> problems.

Most objects in git are immutable. When storing new objects either by receiving a pack through git transport or by creating new objects locally first the immutable blob (file content or note content), tree, and commit data is persisted, then refs are updated. When updating refs Gerrit either uses JGit's PackedBatchRefUpdate which packs all refs and then updates packed-refs by first writing the new content including all ref updates of the current transaction to a temporary file and then atomically renames. Alternatively RefTable can be used to consistently update multiple refs, RefTable is used in production since a long time by those using JGit DFS storage, the traditional file storage based implementation of RefTable is still considered experimental and not yet widely used.

I agree that Gerrit should provide mechanisms to detect and fix corruption like the one caused by this storage failure. 

It seems that the periodic batched update of the secondary NFS server mentioned in the post mortem does not guarantee that file updates happening on the primary server are replicated in the same order to the secondary NFS server. This may lead to a corrupt state if e.g. a ref update was replicated but the new git objects this ref update depends on were not yet replicated. This should not happen if replication is done via git transport which basically replays all ref updates on the secondary storage in the same order as on the primary storage or by periodically backing up the file system via consistent filesystem snapshots.
Comment 9 Eclipse Webmaster CLA 2021-08-11 08:54:05 EDT
(In reply to Matthias Sohn from comment #6)
 
> Could you provide me a copy of the egit repository from the gerrit site ?
> Then I can check what's broken and try to repair it.

Ok I've put a tgz'd copy of the repo in your homedir, which you should be able to access via sftp to projects-storage.eclipse.org .  If that doesn't work let me know and I"ll toss it on archive.eclipse.org.

-M.
Comment 10 Denis Roy CLA 2021-08-11 10:20:09 EDT
(In reply to Thomas Wolf from comment #4)
> @Mikaël: possibly some of that data loss can be recovered if people have the
> changes still locally and push them with new Gerrit change IDs. But unless
> these broken Gerrit changes get removed somehow, they'll stay around forever
> :-( It also doesn't look like plain "data loss"; it looks like the Gerrit
> database is corrupted.

I didn't think Gerrit had a database?


> It does have _some_ info on these changes, but other
> parts appear to be missing. Since Gerrit stores most (or even all?) data in
> the git repos themselves, this is rather disconcerting. Things in its notedb
> or in other refs/meta/* git refs appear to be corrupted.

Corrupted how?



(In reply to Matthias Sohn from comment #6)
> Gerrit (via JGit) relies on atomic file renames to implement transactions on
> the filesystem. It seems the secondary filesystem used to restore gerrit
> from is not a consistent copy of the primary one which was destroyed during
> the outage.

 
> In order to avoid such issues we should consider
> - replicate all git repositories hosted by Gerrit to another storage using
> Gerrit's replication plugin [1] which replicates via git transport to ensure
> consistency (this can't be guaranteed e.g. by rsync)

I will look into this. You must be aware that most backup utilities do not speak "native git" -- and nor should they.
Comment 11 Thomas Wolf CLA 2021-08-11 10:38:32 EDT
(In reply to Denis Roy from comment #10)
> (In reply to Thomas Wolf from comment #4)
> > :-( It also doesn't look like plain "data loss"; it looks like the Gerrit
> > database is corrupted.
> 
> I didn't think Gerrit had a database?

Of course it does. It's just not a _relational_ database. Gerrit's database is called git. It uses the git repo as a database to store things. They even call that the "notedb". "db" standing for "database".
Comment 12 Thomas Wolf CLA 2021-08-11 13:59:58 EDT
(In reply to Denis Roy from comment #10)
> Corrupted how?

How should I know what exactly is corrupted? Once Matthias has analyzed the tgz dump of the repo, we might know.

As I've been told several times now to "just push again", let's be very clear about this: this is not a case of a commit gone missing. The corruption is in the part of the git repo that Gerrit manages itself as its database; if objects are missing in that part, there is no way I could re-push them since I never had them and wouldn't know what they should contain and finally because Gerrit for good reasons doesn't even give access to this part of the git repo.
Comment 13 Thomas Wolf CLA 2021-08-11 14:10:04 EDT
(In reply to Thomas Wolf from comment #12) 
> ...that Gerrit manages itself as its database...

Just as additional information in case it wasn't known: one can put arbitrary blobs into a git repository. It doesn't have to be commits.

For instance, the JGit repository has a tag "spearce-gpg-pub". This tag does *not* point to a commit, but to a blob, which contains an old GPG public key, apparently of the GPG key pair Shawn used at some time to sign early JGit distributions.

In a similar way (but more complex) Gerrit uses git repos as a database to store review comments and other stuff. But all that is hidden; when you just clone the repo from Gerrit, you don't get that part.
Comment 14 Denis Roy CLA 2021-08-11 14:16:31 EDT
(In reply to Thomas Wolf from comment #12)
> (In reply to Denis Roy from comment #10)
> > Corrupted how?
> 
> How should I know what exactly is corrupted? Once Matthias has analyzed the
> tgz dump of the repo, we might know.

Thank you for explaining the "how", thank you.

Unrelated, I will submit that using an SCM for a "database" is a poor choice without appropriate journaling, rollback capabilities and, last I checked, an absence of admin tools (ie, non-git engineer). 

For sure, we can (and will) improve our replication and backup processes -- but Gerrit's backend has become (IMHO) fragile, overly complex and admin-hostile.
Comment 15 Thomas Wolf CLA 2021-08-11 15:04:02 EDT
(In reply to Denis Roy from comment #14)
> Unrelated, I will submit that using an SCM for a "database" is a poor choice
> without appropriate journaling, rollback capabilities and, last I checked,
> an absence of admin tools (ie, non-git engineer). 

I can sympathise with this.

It does make Gerrit replication simpler since it all is one mechanism, and if a git repo is replicated, the other Gerrit server automatically also has the database and knows it is consistent with the "normal git" part of the git repo. So there's no need to replicate an external relational database separately, which I could imagine comes with its own set of consistency problems. (The other server has got the git repo, but a database replica lags, or similar things.)

I think another take-away from the EF outage for the Gerrit developers is that making such a git database equally resilient against external failures as a proven professional database product is hard, and that Gerrit needs more internal safeguards to deal appropriately with inconsistencies.
Comment 16 Matthias Sohn CLA 2021-08-12 11:06:09 EDT
(In reply to Eclipse Webmaster from comment #9)
> (In reply to Matthias Sohn from comment #6)
>  
> > Could you provide me a copy of the egit repository from the gerrit site ?
> > Then I can check what's broken and try to repair it.
> 
> Ok I've put a tgz'd copy of the repo in your homedir, which you should be
> able to access via sftp to projects-storage.eclipse.org .  If that doesn't
> work let me know and I"ll toss it on archive.eclipse.org.
> 
> -M.

I downloaded the copy of the egit repository and injected it into a local gerrit test instance.
I also injected the All-Users repository to enable seeing usernames in my test instance.

After rebuilding the secondary indexes using the reindex command [1] the repository looks consistent. Running "git fsck" didn't report any inconsistencies on git level.
I can open all recent changes from the Gerrit UI.
If you want me to also check other repositories I need a copy of them.

I cannot reproduce the 500 errors Thomas observed on git.eclipse.org.
I think they may have been caused by stale index entries.

The changes 183571,183572, 183575 are lost. They are not present in the git repository.

The error dialog shown for non-existing changes is misleading, it's not signalling that
the respective change is corrupt but that the change does not exist.
I think this error dialog should be improved, I filed [2] to track this.

I suggest you rebuild all secondary indexes of the gerrit site using [1] in order to fix stale index entries which may have been caused by the outage. Note that this requires a downtime.
When running this command ensure you use an appropriate JVM heap size (similar to what you use to run gerrit).

After the summer holiday season I will discuss with the other gerrit maintainers how we can improve tooling to check for problems after such outages and correct them.

[1] https://git.eclipse.org/r/Documentation/pgm-reindex.html
[2] https://bugs.chromium.org/p/gerrit/issues/detail?id=14905
Comment 17 Denis Roy CLA 2021-08-12 13:38:55 EDT
Matthias, many thanks for looking at this.

> I suggest you rebuild all secondary indexes of the gerrit site using [1] in
> order to fix stale index entries which may have been caused by the outage.
> Note that this requires a downtime.

I'm reindexing git.eclipse.org right now using the online reindex. Is that not sufficient?
Comment 18 Matthias Sohn CLA 2021-08-12 16:23:13 EDT
(In reply to Denis Roy from comment #17)
> Matthias, many thanks for looking at this.
> 
> > I suggest you rebuild all secondary indexes of the gerrit site using [1] in
> > order to fix stale index entries which may have been caused by the outage.
> > Note that this requires a downtime.
> 
> I'm reindexing git.eclipse.org right now using the online reindex. Is that
> not sufficient?

yes, online reindexing all indexes using
https://git.eclipse.org/r/Documentation/cmd-index-start.html
with the --force option should work as well
Comment 19 Matthias Sohn CLA 2021-08-13 05:45:34 EDT
(In reply to Matthias Sohn from comment #16)
> (In reply to Eclipse Webmaster from comment #9)
> > (In reply to Matthias Sohn from comment #6)
...
> The changes 183571,183572, 183575 are lost. They are not present in the git
> repository.

I didn't read the URLs carefully enough, actually these are not egit changes but from another repository.

The status of the egit changes mentioned by Thomas in my test environment:

183574: lost, doesn't exist anymore
183552: looks consistent and I can open it in my test instance
183717: looks consistent and I can open it in my test instance

on git.eclipse.org/r after rebuilding all indexes:

183574: lost, doesn't exist anymore
183552: opening https://git.eclipse.org/r/c/egit/egit/+/183552 still raises an internal server error, can you check the error_log and provide the corresponding stack trace ?
183717: opening https://git.eclipse.org/r/c/egit/egit/+/183717 still raises an internal server error, can you check the error_log and provide the corresponding stack trace ?
Comment 20 Denis Roy CLA 2021-08-13 09:53:13 EDT
(In reply to Matthias Sohn from comment #19)
> (In reply to Matthias Sohn from comment #16)
> > (In reply to Eclipse Webmaster from comment #9)
> > > (In reply to Matthias Sohn from comment #6)
> ...
> > The changes 183571,183572, 183575 are lost. They are not present in the git
> > repository.
> 
> I didn't read the URLs carefully enough, actually these are not egit changes
> but from another repository.
> 
> The status of the egit changes mentioned by Thomas in my test environment:
> 
> 183574: lost, doesn't exist anymore
> 183552: looks consistent and I can open it in my test instance
> 183717: looks consistent and I can open it in my test instance
> 
> on git.eclipse.org/r after rebuilding all indexes:
> 
> 183574: lost, doesn't exist anymore


> 183552: opening https://git.eclipse.org/r/c/egit/egit/+/183552 still raises
> an internal server error, can you check the error_log and provide the
> corresponding stack trace ?

[2021-08-13T09:51:12.580-0400] [HTTP GET /r/changes/egit%2Fegit~183552/revisions/3/related (droy from [snip])] ERROR com.google.gerrit.httpd.restapi.RestApiServlet : Error in GET /r/changes/egit%2Fegit~183552/revisions/3/related: missing_object [CONTEXT project="egit/egit" ]
org.eclipse.jgit.errors.MissingObjectException: Missing unknown f78893c885328e7a7a9a7282ac5ed09e5578ef9f



> 183717: opening https://git.eclipse.org/r/c/egit/egit/+/183717 still raises
> an internal server error, can you check the error_log and provide the
> corresponding stack trace ?

[2021-08-13T09:52:02.088-0400] [HTTP GET /r/changes/?O=a&q=status%3Aopen%20conflicts%3A183717 (droy from [snip])] WARN  com.google.gerrit.server.query.change.ConflictsPredicate : (Re-logging with stack trace) Merge failure checking conflicts of change 183574 in egit/egit (f78893c885328e7a7a9a7282ac5ed09e5578ef9f): Missing unknown f78893c885328e7a7a9a7282ac5ed09e5578ef9f [CONTEXT ratelimit_period="1 MINUTES [skipped: 1]" ]
org.eclipse.jgit.errors.MissingObjectException: Missing unknown f78893c885328e7a7a9a7282ac5ed09e5578ef9f
        at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:136)


Seems they are both missing the same object?
Comment 21 Matthias Sohn CLA 2021-08-13 11:09:51 EDT
(In reply to Denis Roy from comment #20)
> (In reply to Matthias Sohn from comment #19)
> > (In reply to Matthias Sohn from comment #16)
> > > (In reply to Eclipse Webmaster from comment #9)
> > > > (In reply to Matthias Sohn from comment #6)
> > ...
> > > The changes 183571,183572, 183575 are lost. They are not present in the git
> > > repository.
> > 
> > I didn't read the URLs carefully enough, actually these are not egit changes
> > but from another repository.
> > 
> > The status of the egit changes mentioned by Thomas in my test environment:
> > 
> > 183574: lost, doesn't exist anymore
> > 183552: looks consistent and I can open it in my test instance
> > 183717: looks consistent and I can open it in my test instance
> > 
> > on git.eclipse.org/r after rebuilding all indexes:
> > 
> > 183574: lost, doesn't exist anymore
> 
> 
> > 183552: opening https://git.eclipse.org/r/c/egit/egit/+/183552 still raises
> > an internal server error, can you check the error_log and provide the
> > corresponding stack trace ?
> 
> [2021-08-13T09:51:12.580-0400] [HTTP GET
> /r/changes/egit%2Fegit~183552/revisions/3/related (droy from [snip])] ERROR
> com.google.gerrit.httpd.restapi.RestApiServlet : Error in GET
> /r/changes/egit%2Fegit~183552/revisions/3/related: missing_object [CONTEXT
> project="egit/egit" ]
> org.eclipse.jgit.errors.MissingObjectException: Missing unknown
> f78893c885328e7a7a9a7282ac5ed09e5578ef9f
> 
> 
> 
> > 183717: opening https://git.eclipse.org/r/c/egit/egit/+/183717 still raises
> > an internal server error, can you check the error_log and provide the
> > corresponding stack trace ?
> 
> [2021-08-13T09:52:02.088-0400] [HTTP GET
> /r/changes/?O=a&q=status%3Aopen%20conflicts%3A183717 (droy from [snip])]
> WARN  com.google.gerrit.server.query.change.ConflictsPredicate : (Re-logging
> with stack trace) Merge failure checking conflicts of change 183574 in
> egit/egit (f78893c885328e7a7a9a7282ac5ed09e5578ef9f): Missing unknown
> f78893c885328e7a7a9a7282ac5ed09e5578ef9f [CONTEXT ratelimit_period="1
> MINUTES [skipped: 1]" ]
> org.eclipse.jgit.errors.MissingObjectException: Missing unknown
> f78893c885328e7a7a9a7282ac5ed09e5578ef9f
>         at
> org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:
> 136)
> 
> 
> Seems they are both missing the same object?

yes, these changes were in a series of changes depending on each other and one of them was lost due to the outage.

If an open change depends on other open changes Gerrit shows them as a list of related changes. The REST request trying to load these related changes fails for the change which was lost.

I guess Thomas still has the lost change locally and can rewrite changes 183552 and 183717 locally using rebase --interactive in order to change their Change-Id and push the series again. This will create new changes replacing changes 183552 and 183717. Then the old changes 183552 and 183717 can be abandoned.
Comment 22 Denis Roy CLA 2021-08-13 14:52:55 EDT
> I guess Thomas still has the lost change locally and can rewrite changes
> 183552 and 183717 locally using rebase --interactive in order to change
> their Change-Id and push the series again. This will create new changes
> replacing changes 183552 and 183717. Then the old changes 183552 and 183717
> can be abandoned.

Thomas, please let us know if you need anything.
Comment 23 Thomas Wolf CLA 2021-08-15 15:34:37 EDT
(In reply to Matthias Sohn from comment #21)
> I guess Thomas still has the lost change locally and can rewrite changes
> 183552 and 183717 locally using rebase --interactive in order to change
> their Change-Id and push the series again. This will create new changes
> replacing changes 183552 and 183717. Then the old changes 183552 and 183717
> can be abandoned.

Did so, and apparently abandoning the two ancestors of 183574 also finally got rid of that broken listing for 183574 itself.

(In reply to Denis Roy from comment #22)
> Thomas, please let us know if you need anything.

No, looks OK now. Thanks everybody.
Comment 24 Thomas Wolf CLA 2021-08-15 15:35:19 EDT
*** Bug 575252 has been marked as a duplicate of this bug. ***