| Summary: | [prov] Add multi-threaded download and mirror selection into p2's artifact retrieval | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Equinox | Reporter: | Timothy Webb <tim-at-eclipse> | ||||||||||
| Component: | p2 | Assignee: | John Arthorne <john.arthorne> | ||||||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||||||
| Severity: | normal | ||||||||||||
| Priority: | P3 | CC: | dennis.vaughn, eclipse-bugzilla, jeffmcaffer, pascal, slewis, Stefan.Liebig | ||||||||||
| Version: | 3.4 | ||||||||||||
| Target Milestone: | 3.4 M6 | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | All | ||||||||||||
| Whiteboard: | |||||||||||||
| Attachments: |
|
||||||||||||
We will be looking into integrating this as soon as we are done with M3. *** Bug 194011 has been marked as a duplicate of this bug. *** The patch still applies nicely to the exception of the build.properties and manifest.mf but this is not a big deal. (In reply to comment #3) > The patch still applies nicely to the exception of the build.properties and > manifest.mf but this is not a big deal. > I applied this patch to a newly downloaded copy of p2 (Nov 20) and get a number of rather strange compile errors. Is there some trick to applying this patch at this point? Created attachment 83572 [details]
Patch updated to HEAD
This patch should be able to apply nicely on HEAD. There is still 2 compile errors around Multistatus that I did not fix but this is not really important.
Let me know if you have any problem.
Any thoughts so far on how to provide the hints from the artifact repositories? As I mentioned in the post, there are a few different approaches I could take. The easiest would be to have a custom interface called something like IArtifactRepositoryForThreadedStrategy that would be used by the ThreadedDownloadStrategy. We would then switch the SimpleArtifactRepository to also support that interface. One of the calls on the interface would still need to indicate whether the threaded download strategy could be used. Also, as mentioned in the first message on this bug report, there are additional APIs that we will need for the threaded support. Please advise. Another possibility is to create a richer set of IArtifactRequest subtypes, such as ChunkMirrorRequest for mirroring a chunk of a file. Then we would just need a generic method like IArtifactRepository.canPerform(IArtifactRequest) to introspect a repository to see if it supports that form of request. Part of the issue is in the reporting, in that the download strategy uses ongoing information from the download to make subsequent determinations. This involves having information exposed from ECF and then providing control back to the download strategy. Using different ArtifactDescriptor types could certainly help but I'm not really sure how to relate that to the ongoing progress that's needed in the strategy. With the ProcessingSteps the download strategy may also address: - memory requirements/availability (the jbdiff based processing steps are memory intensive) - processor speed - an artifact descriptor (having a delta based processing step in it) may refuse to run because a prerequisite (a special previous version of the artifact) is not available The time and space requirements must be given within the artifact descriptor (processing steps) so that a download strategy can take them into account. The prerequisite of a previous version is currently part of the processing step data property and it is interpreted by the concrete processing step. This requires the processing steps to be instantiated and initialized. The initialization phase detects whether the required version is available. I would like to clarify where we are with this enhancement...for me and for others that don't have the other Equinox communication channels (e.g. f2f). Is the patch attachment 83572 [details] been committed to HEAD? Is it going in before M4 (i.e. this week)? Tim in comment #6 you imply that perhaps more/other file transfer meta-data is needed from ECF for feedback to the download strategy. I would like to know what/other meta-data about the actual file transfer (or partial transfer) is desired/required, so that I can get access to those data in the ECF API. Presently available is the obvious: bytes downloaded, expected total size (and therefore percentage), the original range specification (if not the entire file), whether the transfer is pausable (i.e. if protocol/impl supports), the original URL. There may be others that we/you are specifically interested in? (e.g. about time...but that could also be calculated within the download strategy I expect). As John says in comment #7, it would seem to make sense to have some introspection API on the artifact repository. Perhaps that is already present...if not will it be added? (sorry...I've not been able to have all source in workspace over past few weeks) As per Stephan's comment #9, it seems that we will need some means to convey both download hints and processing hints...my question is do we have a way of providing such hints in the artifact repository/artifact descriptor? My apologies if these questions seem sort of 'out of it'. Due to other commitments, I haven't been able to keep up with the day-to-day developments of p2 APIs recently but I expect that to be changing. There are properties (a map) within the ArtifactDescriptor that could hold such information. These properties are already used for the download size and the result size. The properties (time/space requirements) I mentioned in comment 9 are not yet stored. Responsible for storing these properties are the optimizers. Circling back a bit, here are a couple of the assumptions that were made when we decided to have a download strategy: - artifact repositories should be simple - there is one artifact repository per traditional update site mirror - different artifact repositories need different kinds of download strategies The sticking point in this implementation is that different artifact repositories and underlying transports have very different needs in the strategy. The implementation of a smart strategy is really tied to the underlying transports such as needing particular status information to be provided back from ECF for ECF-based repositories. I would like to propose that the cleaner way to address this issue is to move the complexity of the download strategy behind the artifact repository implementation where it can reside closer to the underlying transports. In addition, for artifact repositories that can retrieve the same artifact from multiple remote replicas (traditional eclipse update site mirrors), that the list of mirrors and selection of mirror be isolated behind the artifact repository. Restructuring the brains behind the download this way has the following benefits: - different types of artifact repositories can be optimized particular to the style of artifact binary retrieval for the supported transports (CDRom, HTTP, BitTorrent, ...) - constructs like multi-threading are closer to the transports, such that if something like ECF already starts jobs, you would need to know that to not burn duplicate threads as you do with today's APIs In regards to some of the other comments: This change makes comment #9 more straight forward in some aspects, but the list of processing steps would still need to be understood. We may want to have some base cost-analysis logic that can be shared across repository implementations. Regarding comment #10 and comment #11, we would still need a way to encapsulate some information, but not necessarily standardized across all system. One final question I have is (assuming you buy into this logic) whether the knowledge of such activities as automatic mirror selection should occur in the transport or in the artifact repository. If you don't agree with this thought process, we can continue on using the artifact descriptor's ability to store extra data to be the trigger for how we handle processing in the download strategy a suggested in comment #11 though we will need to still determine what sort of API model re comment #8 we are going to implement that allows different sort of progress information based on the type of strategies that need to monitor information such as ongoing download rates, etc. Hi Tim, (In reply to comment #12) > Circling back a bit, here are a couple of the assumptions that were made when > we decided to have a download strategy: > > - artifact repositories should be simple > - there is one artifact repository per traditional update site mirror > - different artifact repositories need different kinds of download strategies > > The sticking point in this implementation is that different artifact > repositories and underlying transports have very different needs in the > strategy. The implementation of a smart strategy is really tied to the > underlying transports such as needing particular status information to be > provided back from ECF for ECF-based repositories. ECF's IFileID is based upon/created by an URL. So I think that using an URL to 'reason' about a download strategy (e.g. url.getProtocol()) would make the most sense...unless I'm missing something. Of course there are limits to the ability to reason based upon just a protocol spec (e.g. "http" or "scp")...for example, whether a file transfer is 'pausable' or not...is only known after an initial request to the target URL/fileID has been issued, since http 1.1 servers support pausing and http 1.0 servers do not. With ECF, once a request has been issued, the IIncomingFileTransfer instance can be consulted for whether or not it is pausable (by whether or not it returns a non-null instance to iift.getAdapter(IFileTransferPauseable.class). So in any event, I think there are things in the download strategy that can be deduced via URL.getProtocol(), but there are other things that need a real transfer instance to measure/record (e.g. whether pausable...throughput...etc). > > I would like to propose that the cleaner way to address this issue is to move > the complexity of the download strategy behind the artifact repository > implementation where it can reside closer to the underlying transports. So wouldn't this make it harder for artifact repositories to use/reuse download strategies? For example, let's say we produce some download strategies for http/https...assuming the ability to do (1.1) partial downloads and pause/resume. Wouldn't it be better to enable repository builders to reuse these strategies (probably something that repository builders will want to do, as assuming we do a good job of some reasonable http/https strategies that people will want to reimplement such things). Would this be possible with the download strategies 'behind' the artifact repository API? (i.e. part of the artifact repo impl)? >In > addition, for artifact repositories that can retrieve the same artifact from > multiple remote replicas (traditional eclipse update site mirrors), that the > list of mirrors and selection of mirror be isolated behind the artifact > repository. > > Restructuring the brains behind the download this way has the following > benefits: > > - different types of artifact repositories can be optimized particular to the > style of artifact binary retrieval for the supported transports (CDRom, HTTP, > BitTorrent, ...) > - constructs like multi-threading are closer to the transports, such that if > something like ECF already starts jobs, you would need to know that to not burn > duplicate threads as you do with today's APIs Although I see the optimization advantages, it seems like it would make re-use much harder. Or maybe I'm not understanding. > > In regards to some of the other comments: > > This change makes comment #9 more straight forward in some aspects, but the > list of processing steps would still need to be understood. We may want to > have some base cost-analysis logic that can be shared across repository > implementations. Yes...I think cost analysis (recording measurements of throughput, etc), as well as policies specific to protocols (e.g. break http transfers of more than 5 megabytes into X parts to download in parallel...unless there are other http transfers going on at same time, etc) would want to be shared across repository implementations as well. > > Regarding comment #10 and comment #11, we would still need a way to encapsulate > some information, but not necessarily standardized across all system. > > One final question I have is (assuming you buy into this logic) whether the > knowledge of such activities as automatic mirror selection should occur in the > transport or in the artifact repository. I'm not a big fan of having the mirror selection logic be in the transport (i.e. in the ECF provider)...for reasons of separation of concerns, but also because it adds to the complexity of a filetransfer API like ECF...which I'm willing to do if people require it...but seems more appropriate for something like a download strategy. > > If you don't agree with this thought process, we can continue on using the > artifact descriptor's ability to store extra data to be the trigger for how we > handle processing in the download strategy a suggested in comment #11 though we > will need to still determine what sort of API model re comment #8 we are going > to implement that allows different sort of progress information based on the > type of strategies that need to monitor information such as ongoing download > rates, etc. So I should take this opportunity to ask...what sort of API were you expecting to get runtime info...e.g. throughput, etc? Currently, the IIncomingFileTransfer class has: long getBytesReceived(); double getPercentComplete(); But this interface (IIncomingFileTransfer) implements IAdaptable, so we can easily add new interfaces (which only some providers may implement) that will allow queries for other information. But another way would be for the download strategy to keep track of local system time and occasionally ask the IIncomingFileTransfer instance for getBytesReceived info...and then compute average download rate as desired. Then this info could be aggregated by URL (or by protocol, host, etc) and/or by URL and range specification (if the download was broken into multiple parts). But if a specific API is desired for ECF (e.g. IIncomingFileTransferDetails) let me know what you think are the right method calls and we can/will add it. Scott, regarding comment #13, with last things first... To implement the equivalent algorithm we were doing in Maynstall can definitely be built using the system time and then doing polling. We had done the algorithm by having a listener get notified from the download thread which could then notify the download to complete. In the ECF model, we can request a cancel of the download and then startup the transfer from another location -- so I don't believe we will need a IIncomingFileTransferDetails interface via adaptable. The only benefit though to doing so is that we could have it be a bit cleaner for operations such as accessing the download rate, however, if not all implementations will support that, we might as well just handle that within the strategy. So your main concern seems to be around the re-use of the download strategy by different users. The question would be what is the best structure to provide a rich capability without requiring each repository implementor implement an intelligent strategy. One option might be to provide an abstract implementation that implementors could extend or provide an extra interface and then delegate operations to the download strategy. If we can generally buy into the artifact repository doing the intelligence as opposed to something sitting on top, probably the easiest is to try and put together a simple implementation of an intelligent abstract repository and then abstract out either a utility download strategy class or abstract base class. The main disadvantage that I see with having the download strategy inside the artifact repository is that a consumer of p2 wouldn't have the option of dropping in a different download strategy. Scott, do you see it making sense to expose information such as download transfer status and bytes processed in the artifact repository API such that the download strategy can be outside? If so, we would need all of the operations such as cancel, etc. to also be exposed at this level. Alternatively, if we could make the DownloadStategyService have APIs to be called form ArtifactRepository instead of from the current download manager as the initial implementation approach was taking. Truthfully, once we can find the right way to connect these boxes, I have a feeling the implementation will go very quickly ... and the most direct solution I see given the need to have knowledge of the transports is to work this code into the artifact repository... OK, based on our discussions a couple weeks ago, we decided to work the notion of multi-threaded download into the SimpleArtifactRepository instead of having a separate download strategy. To that end, I've created a new patch that adds initially minimum multi-threading support to SimpleArtifactRepository. The second phase of the implementation will be leveraging a mirror service that provides alternate download URLs for the main URL registered as part of the artifact repository. The final phase, if time permits, will be intelligently switching between mirrors based on performance of each mirror dynamically while a transfer is in progress. Created attachment 88433 [details]
Patch to SimpleArtifactRepository (phase 1)
Created attachment 88919 [details]
Threaded downloads and mirror lookups
Updated patch that supplies a property on the artifact repository to control the number of threads that should be used when downloading multiple artifacts. Also checks to see if the artifact repository is local, and if so doesn't use threaded downloads.
Patch also exposes IArtifactTransport in the API allowing for plugging in of alternate transport types. To support mirroring, we would need to use a different implementation of ECFTransport (or one that wraps on top). Ultimately I believe we should have the transport factory exposed as an optional service allowing for different implementations of the download portion.
The transport factory allows registration of IDownloadMirrorsDelegates that can determine a mirror to use for a particular download. Upon completion of a download (success or failure), the download mirror gets status reported back which could be used to aide in determination of which mirror to use on subsequent requests.
I'll take this to review/release once HEAD is reopen for business after M5. I have applied the phase 1 patch with some minor changes. I reviewed the second patch with mirror support, but I'm unsure of the direction. It seems overly complicated to introduce transport interface and factories, a new mirror transport, etc, just to implement mirrors. Mirroring just involves replacing the "location" where the download comes from - it doesn't require a new transport implementation. I'm picturing this as a simple change in SimpleArtifactRepository (location = mirrorService.findMirror(location)). The reason I split out the Transport was that to implement better logic around retries, mirror selection, etc. you need to have some of that logic within the transport. As opposed to having one single transport that has all of that logic built into it, I was proposing splitting out the transport to allow different implementations. We can start with the base transport working with just selecting a mirror from the list at random, but the benefit to having the transport be pluggable is that we can develop alternate implementations and when stable (or depending on a given deployment model), those transports can then be dropped in. Separately, in the original p2 design weren't transports intended to be pluggable? (In reply to comment #20) <stuff deleted> > Separately, in the original p2 design weren't transports intended to be > pluggable? Yes, but I think this was using a different notion of 'transport'. WRT file transfer protocols, ECF delivers pluggable transports. Your use of 'transports' also includes retries and mirror selection. I don't have any objection to having a pluggable strategy for those things as well (or even calling it 'transports'), I'm just pointing out that it's a different usage of the term. At least as I understood the requirement for pluggable transports for p2, it was for pluggable protocols (e.g. http, bittorrent, etc)...but in my view plugability is a pretty generic good...so I suppose it should be applied here too. Maybe it would help if it was called something more specific, like 'RetrievalStrategy'. > At least as I understood the requirement for pluggable transports for p2, it was for pluggable protocols (e.g. http, bittorrent, etc)...
This is what we had always described as pluggable transport.
Indeed. I've not looked at the patch in detail but I recall the direction as having a repo object have (at least logically) multiple locations (one for each mirror). Then whoever is doing the download optimization strategy picks from these. That selection could be implemented somewhere in/behind the repo if needed but or outside but logically it is distinct from both the repo and the transport to keep the concerns separate and indeed support high pluggability. *** Bug 215930 has been marked as a duplicate of this bug. *** As discussed in the p2 call this week, in the spirit of getting the simple stuff working I have released some simple mirror support modeled on what we had in update manager: - In the Generator, mirrorsURL property from site.xml is added as a property in p2 repositories - Just before calling transport to perform download in SimpleArtifactRepository, check for mirrors, and pick the best mirror if available - If the mirror fails, revert automatically to the base repository - Bit rate and failure count are recorded for each mirror, and mirror list is re-sorted after each download I need to do some more testing with real repositories that have mirrors, so for now you have to enable it by setting the system property "eclipse.p2.mirrors=true". There are some .options values in org.eclipse.p2.core for tracing mirror selection and mirror sorting. Review and input on mirror selection algorithm would be appreciated. I'll throw out one problem I was facing to see if anyone has ideas. Currently each download selects the "best" mirror from the list - the mirror with the least failures and highest Bps rate. I have a feeling this isn't optimal in the face of concurrent downloads. I.e., at the start, every download thread selects the same mirror. My theory is it's better to spread the load of concurrent downloads across multiple mirrors. I was thinking of adding some randomness in the mirror selection - some function with a probability distribution that heavily favours "better" mirrors, but has some chance of selecting other mirrors as well. This wouldn't be optimal in my corporate environment, where I have a full mirror on the LAN that is overwhelmingly faster than any others, but may achieve better overall throughput in an environment where there are several mirrors available of similar speed. After further testing, I found the ECF transfer can hang indefinitely if we hit a mirror that is not responsive (bug 219368). Until we have a fix for this I will leave mirroring disabled by default (requires setting system property as described above). The ECF problem has been fixed, and mirroring is now enabled by default. Removing contributed keyword since this patch was not released. |
Created attachment 81472 [details] Patch for IDownloadStrategy Create the concept of download strategies to be supported by the p2 download manager. The idea is that different strategies will understand how to optimize access to software from different artifact repositoryes -- for instance, in Maynstall there is an algorithm that optimizes access by multi-threading requests to HTTP repositories. The download strategies should be able to determine which one is better in situations where there are multiple download strategies for a given artifact repository. Attached is my first attempt to implement the idea of IDownloadStrategy though I'm stuck a bit in getting the artifact repositories to have an API that will work. In essence, the SimpleDownloadStrategy should be able to work with the current SimpleArtifactRepository. The ThreadedDownloadStrategy will need additional APIs from a given ArtifactRepository such as the ability to start a download in the middle or to monitor the current download rate of the transfer to be able to switch to another mirror if the current one is slow. I'm thinking of alternate interfaces like IArtifactRepositoryWith... that would provide additional APIs -- though since for a given get operation you need to potentially use WithDownloadRateReporting along with WithRangeSupport, the API doesn't look very clean. Any suggestions for ways to have the IArtifactRepository API extensible in such a way that more advanced strategies would be able to get the information they need? (Note that the intention is not to have a 1 to 1 pairing of different ArtifactRepository implementations and IDownloadStrategies though that is certainly one albeit ulgy route I could take.