Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 266243 - [transport] Need ability to restart failed download of repo index files
Summary: [transport] Need ability to restart failed download of repo index files
Status: RESOLVED FIXED
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: p2 (show other bugs)
Version: 3.5   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P3 major (vote)
Target Milestone: 3.5 M7   Edit
Assignee: Henrik Lindberg CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 266272 (view as bug list)
Depends on: 216278 270667
Blocks:
  Show dependency tree
 
Reported: 2009-02-25 20:36 EST by Pascal Rapicault CLA
Modified: 2016-08-17 15:58 EDT (History)
6 users (show)

See Also:


Attachments
Implementation of resumed download of content.jar/content.xml (38.29 KB, patch)
2009-04-03 16:25 EDT, Henrik Lindberg CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Pascal Rapicault CLA 2009-02-25 20:36:30 EST
Eclipse 3.4.2 is available today and this represents a great opportunity to try out p2 when the servers are faced with incredible loads resembling the final release.
The first thing that this makes me realize is that we definitely need to have the ability to restart failed download of the repo index files (content.jar / artifacts.jar) so that people can despite time outs and other problems get the main index files required for p2 to show content to install to the user.
Comment 1 Scott Lewis CLA 2009-02-25 21:32:28 EST
This seems more like an enhancement to me than a major bug...as it seems very late in the game to add on new features that until now haven't been on our docket.

Also...this bug isn't just for transport, as since we don't have any storage of partial results, the storage of such results (either in a single session or across sessions) would have to be done at the download manager level.   We do, however, have immediate support for downloading partial files.




Comment 2 Pascal Rapicault CLA 2009-02-25 22:03:36 EST
This is not a big deal for the release of 3.5 itself, but it will be a big deal if we want 3.5 users to be able to successfully update to 3.5.1 and 3.5.2 and maybe even to 3.6.

The content.jar and artifacts.jar are not downloaded through the download manager but are obtained as regular files. Therefore we have much more flexibility with what we can do.
Comment 3 Pascal Rapicault CLA 2009-02-26 10:28:12 EST
*** Bug 266272 has been marked as a duplicate of this bug. ***
Comment 4 Henrik Lindberg CLA 2009-03-01 17:45:51 EST
I looked into this, and I need a bit of help to make sure I am in the right place...

It looks like SimpleMetadataRepositoryFactory is asked for a local file (getLocalFile) which is done by obtaining a CacheManager, that in turn calls ECFMetaDataTransport to download the file if it is not present.

It looks like there is opportunity to make the CacheManager preserve the state of a partial download (it now deletes a partial download at the end).

My idea is to move the partial to a "partial" directory (to prevent someone from mistaking the partial file for the real thing) - and then modify the cace manager to check if there is a partial download on a subsequent request. As it uses ECFMetaDataTransport to perform the download, a new method is needed to perform a partial download. 

Does this seam like a reasonable approach? 



Comment 5 Pascal Rapicault CLA 2009-03-03 22:59:44 EST
It looks like it is the good place. Can we rely on the number bytes downloaded to restart?
Could there be a case where the restarted download could result in an invalid file? If this is the case then we need a way to flush the cache as well when the repo failed loading.
Comment 6 John Arthorne CLA 2009-03-03 23:15:38 EST
The cache flushing on failure was added recently... in the past if there was a failure during download and we had a partial result with a new timestamp, we never went back to the server for the new file. We certainly have to make sure we discard/move the file on failure.
Comment 7 Henrik Lindberg CLA 2009-03-04 07:33:39 EST
(In reply to comment #5)
> It looks like it is the good place. Can we rely on the number bytes downloaded
> to restart?
Where you thinking of some other way of doing it? 
Are checksums available somewhere? If available up-front, then I thought that the partial, and the expected checksum should be saved. When resuming, a check is made if the checksum is unchanged (this to make sure the server is not delivering different content with same timestamp).

> Could there be a case where the restarted download could result in an invalid
> file? If this is the case then we need a way to flush the cache as well when
> the repo failed loading.
> 
Yes, certainly - the outcome of the resume is naturally that a new resume may occur, that it fails completely (missing) or that the downloaded bits have the wrong size or checksum, and if so must be discarded and download start over again.

The worst case is probably when a resume encounters a "false positive" error due to high server load or malfunction, or being on a poort network - exactly the case where the resume is supposed to help. What should be done in this case? I can imagine saving a bit more information about the partial download - timestamp of last attempt - number of bytes obtained per attempt - status of attempts etc. Then have some heuristics that determines when an error is really an error (3 attempts and gone more than 24 hours?).

My idea was to start with simplest possible.
Comment 8 Pascal Rapicault CLA 2009-03-04 22:20:59 EST
>> It looks like it is the good place. Can we rely on the number bytes downloaded to restart?
>Where you thinking of some other way of doing it? 
  No

>Are checksums available somewhere? 
  No, but this looks like a good idea, maybe past 3.5 unless absolutely critical for this.

I think Henrich had encountered servers upon restart the server was sending too many / too few bytes. If we can't detect it then this may jeopardize what we can do

Again to me it would be a worst story than the one we have today, if we could have a file that appears to be complete and up-to-date but that can not parse.
Comment 9 Henrich Kraemer CLA 2009-03-12 15:36:15 EDT
(In reply to comment #8)
> I think Henrich had encountered servers upon restart the server was sending too
> many / too few bytes. If we can't detect it then this may jeopardize what we
> can do

We have encountered cases where reading over the socket (from an HTTP server) would delivered fewer bytes than known to be on the server without an exception thrown (dubbed premature end-of-file). Our assumption is that this is a server side or network connectivity issue. 
However I do not know whether this is in any way related to partial gets (restart).
Comment 10 Henrik Lindberg CLA 2009-03-12 16:42:18 EDT
(In reply to comment #9)
> We have encountered cases where reading over the socket (from an HTTP server)
> would delivered fewer bytes than known to be on the server without an exception
> thrown (dubbed premature end-of-file). Our assumption is that this is a server
> side or network connectivity issue. 
> However I do not know whether this is in any way related to partial gets
> (restart).
> 
Not really, if it happens on a regular transfer, it can also happen on a partial transfer.
In the end there is only two pieces of information that can help in determining if the transfer was good: authorative information on size of file, and checksum.

So, how is the premature-end-of-file detected? By a reported size (Content-Length) that is different than what is actually read? 
Comment 11 Henrich Kraemer CLA 2009-03-12 17:24:58 EDT
(In reply to comment #10)
<stuff deleted>
> In the end there is only two pieces of information that can help in determining
> if the transfer was good: authorative information on size of file, and
> checksum.
> 
> So, how is the premature-end-of-file detected? By a reported size
> (Content-Length) that is different than what is actually read? 
> 

Yes I agree that this requires authoritative validation info; in our app these are expected size, and checksum (MD5).

Case 1: expected size and checksum is known: If size does not agree with content-length (or for partial gets a content-range) download fails before transferring payload bytes. Otherwise file is downloaded. In case a premature EOF is encountered based on the expected size, then download is repeated with partial get for the remainder of the file.
When entire file was downloaded with the exact number of bytes expected the checksum is validated.


Case 2: No authoritative info is known. Then content-length (or content-range) is leveraged as the expected size. 

Comment 12 Henrik Lindberg CLA 2009-03-31 20:46:03 EDT
I have almost finished the implementation of this for p2 meta data repository. It would be great if the patch in 270667 could be committed, as I am adding more messages in the implementation of "resume download".
Comment 13 Henrik Lindberg CLA 2009-04-03 16:25:52 EDT
Created attachment 130884 [details]
Implementation of resumed download of content.jar/content.xml

This patch adds resume of download of content.jar, and content.xml to the CacheManager.
Comment 14 Henrik Lindberg CLA 2009-04-03 16:27:44 EDT
assigning back to inbox so someone can review and commit.

Comment 15 Scott Lewis CLA 2009-04-03 16:58:38 EDT
Henrik, I wanted to make sure that you saw the small API change in IFileTransferInfo as described in bug #270516...as this might be germane to your patch for this bug.
Comment 16 Henrik Lindberg CLA 2009-04-03 19:48:09 EDT
(In reply to comment #15)
> Henrik, I wanted to make sure that you saw the small API change in
> IFileTransferInfo as described in bug #270516...as this might be germane to
> your patch for this bug.
> 
Thanks Scott. I was aware of that, and it does not affect. Checks are already made with <= 0.
Comment 17 Henrik Lindberg CLA 2009-04-14 09:14:20 EDT
As I now have p2 committer status - should I go ahead and commit this change to HEAD?
Pascal mentioned in the p2 call, that it would be a good idea to be able to turn off the resume feature via a property - I can add a separate issue for that.
Comment 18 Pascal Rapicault CLA 2009-04-14 09:21:22 EDT
Release it but with the support for the property. Also please make sure to create a test case for this. Thx
John please take a look at the code once it is released (john is away today Tuesday)
Comment 19 Henrik Lindberg CLA 2009-04-14 11:05:37 EDT
The property: org.eclipse.equinox.p2.metadata.repository.resumable can be set to "false" to turn off the resume download feature.

Added tests for both cases (resume and blocked resume).
Comment 20 Henrik Lindberg CLA 2009-04-14 11:11:45 EDT
Fixed in HEAD.
Comment 21 John Arthorne CLA 2009-04-15 00:05:57 EDT
I released some cleanup:

 - I removed several new message keys that were only for logged messages. Messages that only go to the log file are not translated in Eclipse. The end user's language is not necessarily the language of the support/development staff that will be receiving the log. 
 - The createCache method had become quite long and hard to read (>130 lines). I factored it into two separate methods (createCache and updateCache).
 - Refactored computation of cache directory into a helper method (getCacheDirectory)
 - Deleted some lines of code that had been commented out in getCache method
Comment 22 Eclipse Genie CLA 2016-05-28 05:23:25 EDT
New Gerrit change created: https://git.eclipse.org/r/73862