| Summary: | [api] optimize query synchronization | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | z_Archived | Reporter: | Robert Elves <robert.elves> | ||||||||||||||||||||||||
| Component: | Mylyn | Assignee: | Steffen Pingel <steffen.pingel> | ||||||||||||||||||||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||||||||||||||||||||
| Severity: | enhancement | ||||||||||||||||||||||||||
| Priority: | P2 | CC: | ekuleshov, mik.kersten | ||||||||||||||||||||||||
| Version: | unspecified | ||||||||||||||||||||||||||
| Target Milestone: | 2.0 | ||||||||||||||||||||||||||
| Hardware: | PC | ||||||||||||||||||||||||||
| OS: | Windows XP | ||||||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||||
|
Description
Robert Elves
Description is completely misleading. JIRA does allow such query. However conversation on bug 161734 was about further optimization in order to avoid downloading the same data multiple times, which is NOT possible with the current getChangedSinceLastSync() contract. As far as I understand the current synchronization process, the following is happening: * One or more queries are performed (either user initiated or automatically) * getChangedSinceLastSync() is invoked for all tasks * The returned tasks are synchronized individually to retrieve full task data That means even if only a single query is synchronized this can cause any task to get updated. Rob, why is that not limited to the tasks returned by the query and only done for all when a full sync is done? Eugene, JIRA currently makes two server round trips for each retrieved query hit making query synchronization unusable. How are we going to fix that? (In reply to comment #2) > As far as I understand the current synchronization process, the following is happening: > > * One or more queries are performed (either user initiated or automatically) > * getChangedSinceLastSync() is invoked for all tasks Only for tasks from affected repositories. > * The returned tasks are synchronized individually to retrieve full task data > > That means even if only a single query is synchronized this can cause any task > to get updated. Rob, why is that not limited to the tasks returned by the query > and only done for all when a full sync is done? I've been suggesting to reverse that completely. So, you can skip query execution, or at least skip synchronization of individual query hits, effectively only updating added/removed tasks, because task data will be picked up and synchronized by "get since last sync..." request. > Eugene, JIRA currently makes two server round trips for each retrieved query hit > making query synchronization unusable. How are we going to fix that? Which ones you you are referring to? Anyways, we can (and should) improve current situation between query sync and "get since last sync..." calls; and maybe narrow "get since last sync..." call (i.e. scan trough tasks passed to "get since last sync" and if they have task tada, request only changes in affected projects). The former would also help Bugzilla sync, which is taking 5..10 minutes for my Mylar query. (In reply to comment #3) > (In reply to comment #2) > > As far as I understand the current synchronization process, the following is > happening: > > > > * One or more queries are performed (either user initiated or automatically) > > * getChangedSinceLastSync() is invoked for all tasks > > Only for tasks from affected repositories. Correct, for example if you've only selected queries over bugs.eclipse.org, after the queries complete they're synchronization, synchronizeChanged is called on bugs.eclipse.org. This job gives the connector the opportunity to synch up those (dev.eclipse.org) tasks that have changed, even those not currently captured by a query. One way to improve this could be ensure only tasks not in a query are checked and doing this check less frequently. Determining which tasks are not in a query is expensive though. I've disabled the synchChanged for manual (user initiated) synch, so this should only happen on background synch now. > I've been suggesting to reverse that completely. So, you can skip query > execution, or at least skip synchronization of individual query hits, > effectively only updating added/removed tasks, because task data will be picked > up and synchronized by "get since last sync..." request. Currently synchChanged doesn't return new hits so this won't successfully update a query. And my understanding is that for most connectors (Bugzilla the exception) the act of performing a query results in all the task data being sent anyway, hence the recent changes to the performQuery/collector mechanism to actually do the update of task data on the spot. (In reply to comment #4) > > I've been suggesting to reverse that completely. So, you can skip query > > execution, or at least skip synchronization of individual query hits, > > effectively only updating added/removed tasks, because task data will be > > picked up and synchronized by "get since last sync..." request. > > Currently synchChanged doesn't return new hits so this won't successfully update > a query. Ok. But it still worth to synchronize all the updated tasks inside synchChanged and then query can skip synchronization for everything but new hits. Right now if I synchronize my Mylar query it synchronizing about 500 issues every time. I don't think there is a need for that. > And my understanding is that for most connectors (Bugzilla the > exception) the act of performing a query results in all the task data being sent > anyway, hence the recent changes to the performQuery/collector mechanism to > actually do the update of task data on the spot. Not quite, because getChanged don't work with task data, so those hits are fetched again individually; and then again when query is executed in performQuery() that is called after getChanged. > > > I've been suggesting to reverse that completely. So, you can skip query > > > execution, or at least skip synchronization of individual query hits, > > > effectively only updating added/removed tasks, because task data will be > > > picked up and synchronized by "get since last sync..." request. > > > > Currently synchChanged doesn't return new hits so this won't successfully > update > > a query. > Ok. But it still worth to synchronize all the updated tasks inside synchChanged > and then query can skip synchronization for everything but new hits. We need a per task data timestamp to determine when it was synchronized last. The current approach of having a per repository is not sufficient anymore since now task data can be synchronized during query execution. > > And my understanding is that for most connectors (Bugzilla the > > exception) the act of performing a query results in all the task data being > sent > > anyway, hence the recent changes to the performQuery/collector mechanism to > > actually do the update of task data on the spot. After looking at the implementation I have to revise my statements for Trac in that regard. Trac only receives a list of ids when performing a query so it is more similar to Bugzilla than JIRA. > Not quite, because getChanged don't work with task data, so those hits are > fetched again individually; JIRA could update the task data as part of the query and always return an empty list. (In reply to comment #3) > I've been suggesting to reverse that completely. So, you can skip query > execution, or at least skip synchronization of individual query hits, > effectively only updating added/removed tasks, because task data will be picked > up and synchronized by "get since last sync..." request. We can't avoid getting most of the task data for JIRA in the first place so we need to make sure we handle the retrieved information as efficiently as possible. Since some information (i.e. editable attributes and available operations) is missing I suggest that JIRA manually migrates that data for each hit: 1. Migrate information from existing task data for that issue 2. If no task data is available, retrieve information from cache 3. If not cached, get information from repository and add to cache > > Eugene, JIRA currently makes two server round trips for each retrieved query > hit > > making query synchronization unusable. How are we going to fix that? > Which ones you you are referring to? It invoked server.getEditableAttributes() and server.getAvailableOperations(). > Anyways, we can (and should) improve current situation between query sync and > "get since last sync..." calls; and maybe narrow "get since last sync..." call > (i.e. scan trough tasks passed to "get since last sync" and if they have task > tada, request only changes in affected projects). The former would also help > Bugzilla sync, which is taking 5..10 minutes for my Mylar query. (In reply to comment #7) > We can't avoid getting most of the task data for JIRA in the first place so we > need to make sure we handle the retrieved information as efficiently as > possible. Since some information (i.e. editable attributes and available > operations) is missing I suggest that JIRA manually migrates that data for each > hit: > > 1. Migrate information from existing task data for that issue > 2. If no task data is available, retrieve information from cache > 3. If not cached, get information from repository and add to cache Migrate where? And what cache? > > > Eugene, JIRA currently makes two server round trips for each retrieved query > > > hit making query synchronization unusable. How are we going to fix that? > > Which ones you you are referring to? > It invoked server.getEditableAttributes() and server.getAvailableOperations(). See my comment quoted below. Like I said, those calls are pretty much unavoidable, unless we'll make some assumptions and cache their results. > > Anyways, we can (and should) improve current situation between query sync and > > "get since last sync..." calls; and maybe narrow "get since last sync..." call > > (i.e. scan trough tasks passed to "get since last sync" and if they have task > > tada, request only changes in affected projects). The former would also help > > Bugzilla sync, which is taking 5..10 minutes for my Mylar query. > > We can't avoid getting most of the task data for JIRA in the first place so we > > need to make sure we handle the retrieved information as efficiently as > > possible. Since some information (i.e. editable attributes and available > > operations) is missing I suggest that JIRA manually migrates that data for > each > > hit: > > > > 1. Migrate information from existing task data for that issue > > 2. If no task data is available, retrieve information from cache > > 3. If not cached, get information from repository and add to cache > > Migrate where? And what cache? If we had previously received task data from a query or manual synchronization retrieve that from TaskDataManager and use it to complete the query hit. The cache for editable attributes and available operations does not yet exist and needs to be implemented. (In reply to comment #9) > If we had previously received task data from a query or manual synchronization > retrieve that from TaskDataManager and use it to complete the query hit. Task data should be populated completely (in getChanged) and there should be no need to resync or migrate it again. > The cache for editable attributes and available operations does not yet exist > and needs to be implemented. You previously convinced me that caching that data is bad idea. Have you changed your mind? I have feeling that we don't understand each other and this discussion is not getting us to the solution. We need to raise priority and somehow resolve this as soon as possible. I am not sure how we can resolve that, but here is my observations of the current api and implementation that is related to query sync: -- QueryHitCollector.accept(RepositoryTaskData taskData) don't store received task data. I think it should do that -- SynchronizeQueryJob.run(IProgressMonitor monitor) only runs synchronizeChanged() is there are any changes in queries. Which is wrong -- SynchronizeQueryJob.run(..) should run synchronizeChanged() before individual queries, so it can skip queries if there is no changed tasks in the repository More over, when synchronizing those queries, we only need to fetch task data for newly added hits, because everything else should be already updated during synchronizeChanged() -- AbstractRepositoryConnector.getChangedSinceLastSync() should receive QueryHitCollector as a parameter, so it would be able to pass task data if connector is getting it during query sync, then RepositorySynchronizationManager.SynchronizeChangedTasksJob.run() would only synchronize tasks that don't have task data. -- Full synchronization of the query task data is only needed when AbstractRepositoryConnector.getChangedSinceLastSync() can't figure out changed hits or query wasn't previously synchronized. Created attachment 70795 [details]
mylar/context/zip
> > If we had previously received task data from a query or manual synchronization > > retrieve that from TaskDataManager and use it to complete the query hit. > Task data should be populated completely (in getChanged) and there should be no > need to resync or migrate it again. I agree. > > The cache for editable attributes and available operations does not yet exist > > and needs to be implemented. > You previously convinced me that caching that data is bad idea. Have you changed > your mind? I never said that caching is bad and I haven't changed my mind about it. I want to use cached information for query hits only and refresh the cache when during manual synchronization. (In reply to comment #13) > I never said that caching is bad and I haven't changed my mind about it. I want > to use cached information for query hits only and refresh the cache when during > manual synchronization. I don't think it is going to work. For jira, there is really no query hits and manual synchronization is rather more time critical, since user is most likely waiting for its results (I am surely do). Thank you for summarizing your thoughts. > -- QueryHitCollector.accept(RepositoryTaskData taskData) don't store received > task data. I think it should do that This is done by TaskFactory which is invoked by QueryHitCollector. > -- SynchronizeQueryJob.run(IProgressMonitor monitor) only runs > synchronizeChanged() is there are any changes in queries. Which is wrong It's called on each synchronization from my understanding. > -- SynchronizeQueryJob.run(..) should run synchronizeChanged() before individual > queries, so it can skip queries if there is no changed tasks in the repository If synchronizeChanged() detects a new or changed issue you need to run the query otherwise you don't know if the query results displayed by Mylar still match the query, right? > -- AbstractRepositoryConnector.getChangedSinceLastSync() should receive > QueryHitCollector as a parameter, so it would be able to pass task data if > connector is getting it during query sync, then > RepositorySynchronizationManager.SynchronizeChangedTasksJob.run() would only > synchronize tasks that don't have task data. The connector can already invoke performQuery() internally with the current API and use a TaskFactory and QueryHitCollector to do that. > -- Full synchronization of the query task data is only needed when > AbstractRepositoryConnector.getChangedSinceLastSync() can't figure out changed > hits or query wasn't previously synchronized. This seems like a good idea and it should work for Trac which also returns new tickets when queried for changed tickets. This should minimize the synchronization overhead for repositories that don't have many changes. I am fine with these changes as long as they are not applied to manual synchronization of queries. If I force a synchronization I am foremost interested in accuracy and not speed. (In reply to comment #15) > > -- SynchronizeQueryJob.run(IProgressMonitor monitor) only runs > > synchronizeChanged() is there are any changes in queries. Which is wrong > It's called on each synchronization from my understanding. synchronizeChanged() is only called if any query from same repository had changes > > -- SynchronizeQueryJob.run(..) should run synchronizeChanged() before > individual > > queries, so it can skip queries if there is no changed tasks in the repository > > If synchronizeChanged() detects a new or changed issue you need to run the query > otherwise you don't know if the query results displayed by Mylar still match the > query, right? Not unless getChanged() detected no changes at all regardless of the passed tasks (not possible in current api) > > -- AbstractRepositoryConnector.getChangedSinceLastSync() should receive > > QueryHitCollector as a parameter, so it would be able to pass task data if > > connector is getting it during query sync, then > > RepositorySynchronizationManager.SynchronizeChangedTasksJob.run() would only > > synchronize tasks that don't have task data. > > The connector can already invoke performQuery() internally with the current API > and use a TaskFactory and QueryHitCollector to do that. It can't because it need to filter by list of tasks and performQuery() don't take it. > I am fine with these changes as long as they are not applied to manual > synchronization of queries. If I force a synchronization I am foremost > interested in accuracy and not speed. I don't think forced and manual should be the same thing. If user kicked synchronization he is waiting for results, so speed is more important comparing to background sync. > I don't think forced and manual should be the same thing. Not sure what you mean. In terms of the Mylyn API "forced" means manual synchronization. > If user kicked > synchronization he is waiting for results, so speed is more important comparing > to background sync. Other opinions? Rob, Mik? (In reply to comment #17) > > I don't think forced and manual should be the same thing. > Not sure what you mean. In terms of the Mylyn API "forced" means manual > synchronization. I am talking from the user point of view (In reply to comment #15) > > -- Full synchronization of the query task data is only needed when > > AbstractRepositoryConnector.getChangedSinceLastSync() can't figure out changed > > hits or query wasn't previously synchronized. > > This seems like a good idea and it should work for Trac which also returns new > tickets when queried for changed tickets. This should minimize the > synchronization overhead for repositories that don't have many changes. I like the idea too of switching these two up. (In reply to comment #17) > > If user kicked > > synchronization he is waiting for results, so speed is more important > comparing > > to background sync. When the user manually synchs a query we need to ensure accuracy. One use case for this is that if you are about to board a plane, forcing a manual synch on a query should get you up to date so you can be sure you have the latest incoming/data to work with. I am also concerned that the synchronization timestamp could get skewed for whatever reason (e.g. server time is incorrect). I want an easy way to manually trigger a synchronization that ensures I have accurate results independent of synchronization state stored by Mylyn. (In reply to comment #19) > > If user kicked synchronization he is waiting for results, so speed is more important > > comparing to background sync. > When the user manually synchs a query we need to ensure accuracy. One use case > for this is that if you are about to board a plane, forcing a manual synch on a > query should get you up to date so you can be sure you have the latest > incoming/data to work with. I don't see why more incremental builds would reduce accuracy. Also, from my experience, there is usually no time to wait 30 minutes while "from scratch" sync will be completed when you have to board your plane... I could not figure out how to correctly implement query synchronization for Trac. Eugene, if you have a patch by tomorrow evening that simplifies the current API and optimizes query synchronization for JIRA I'd be happy to review it and look into any necessary modifications for Trac and Bugzilla. Created attachment 71134 [details]
proposed api changes
Here is incomplete and untested patch (run out of time), but I hope you could get an idea. I think it will get stale real quick because of the refactorings around local tasks.
Please pay most of the attention to changes in bugzilla, jira and track connectors, in RepositorySynchronizationManager and the guys that call him.
It seems like JIRA connector would benefot the most from it. Apparently trac don't have task data handy, and bugzilla don't use timestamp-based query and instead explicitly checks passed tasks by ids.
One thing I didn't cover is that performQuery() call need to receive list of tasks it should NOT synchronize (which is returned by getChangedSinceLastSync() call). Again, that would be most beneficial for JIRA, because it could skip downloading task data for those tasks.
Created attachment 71135 [details]
mylar/context/zip
Thanks Eugene. Unfortunately I get merge conflicts for every file. Could you possibly recreate the patch? I think the general idea of the patch is good but we can do without adding a query collector to getChangedSinceLastSync(). I'll try to finish post a proposal based on the patch by late tonight. Created attachment 71190 [details]
merged with recent changes
I needed query result collector to save the task data. It might be useful to somehow differenciate accept(ITask) and accept(RepositoryTaskData) calls om query result collector, in order to retrieve list of tasks that still need to be synchronized. Then we may not need to return collection from getChanged() call.
Created attachment 71272 [details]
next patch iteration
This patch introduces major changes to the synchronization process. Most notably
getChangedSinceLastSync() is replaced by updateNeedsSynchronization() which flags all tasks that have remote changes that haven't been synchronized. A new boolean attribute "needsSynchronization" that is also externalized was added to AbstractTask for that.
Query synchronization logic has been changed as following (implementation is in SynchronizeQueryJob):
1. Flag changed tasks
2. Run queries if changes were found
3. Synchronize flaged tasks
4. Update synchronization time stamp
Other changes:
* QueryHitCollector now implements ITaskCollector and has been separated from SearchHitCollector
* Part of the query logic implementation for Bugzilla was reverted, since downloading of the full task data is back in the framework
* JIRA now stores all downloaded task data and uses cached information from "old" task data to avoid additional server round-trips
Remaining work:
* Add detection of new tasks for Bugzilla
* Review TaskFactory
* Test synchronization of web tasks
* Test search
* Add documentation
* Find a better name for AbstractRepositoryConnector.updateNeedsSynchronization()
Generally the patch should not make things worse for any connector:
* Results for Bugzilla should be the same as before with some potential for optimization similar to Trac
* For Trac traffic utilization is a lot less for repositories that have few changes since queries are only run when the repository has changed
* For JIRA query synchronization is faster by a magnitude and task data is not downloaded more often than needed
* The web connector needs to be tested but it is probably not affected since it does not implement most of the API
It would be very helpful to get feedback on this patch as soon as possible to stabilize the API and implementation. Please note that this patch is not ready for merging, yet, but I'll apply it to my bootstrapped workspace to gain more test results.
Created attachment 71273 [details]
mylar/context/zip
Created attachment 71276 [details]
fourth iteration
Awesome. Rob: please review early tomorrow. As Steffen points out let's not forget to come up with a good name for that method. Steffen: I'm worried that this will get stale given that I'm still doing considerable refactoring. OK for me to apply with a brief review and for Rob to do more detailed review tomorrow? I'd rather wait for some feedback from Rob and Eugene before applying this since the patch changes some of the semantics of the synchronization API significantly. Don't worry about the merging, I did it multiple times in the last hours and got only very few easy to resolve conflicts. A quick review would be great anyhow :). Created attachment 71325 [details]
fifth iteration
* simplified ITaskFactory
* preserve lastSyncDateStamp in QueryHitCollector
This is looking great. What if we changed the notion of 'needs synchronizing' to 'stale'? That way the terminology works potentially for both repository and local tasks and is shorter: connector.updateNeedsSynchronization -> markStale , AbstractTask.isStale() AbstractTask.setStale(), or something along those lines? Great idea. How about connector.markStaleTasks()? I'll do the renaming and post an updated patch. I am currently looking into partially resolving bug 154258 as well. Perfect! Created attachment 71344 [details]
sixth iteration
Changes in this patch:
* Improved feedback in progress view
* Run synchronization of changed tasks in same job as query synchronization
There is still one job per repository but we can address that later if needed.
Created attachment 71355 [details]
seventh iteration
Created attachment 71364 [details]
eighth iteration
Great. Committed with minor alterations. This combined with the recent changes to task change notification lead to some poor interactions (editor refreshing multiple times, query italics sticking etc).We're going to have to test this thoroughly. Assigning to Steffen since he is iterating on this. (In reply to comment #27) > Remaining work: > > * Add detection of new tasks for Bugzilla For now Bugzilla runs queries on each synchronization cycle. This could be optimized in the future to only run if there are modified tasks or new tasks. > * Review TaskFactory I am not happy about the boolean flags in that class. We should review that API for 3.0. > * Test synchronization of web tasks Tracked in bug 192309 . > * Test search Seems to work. This could be optimized for JIRA to make it non task data based. > * Add documentation Tracked as part of bug 187720 . > * Find a better name for > AbstractRepositoryConnector.updateNeedsSynchronization() The name was changed to markStaleTasks(). Looks like we are done here. Great work guys! Talk about task-based collaboration :) What I don't like about new "stale" flag is that it is transient and not quite belong to the task data. (In reply to comment #44) > What I don't like about new "stale" flag is that it is transient and not > quite belong to the task data. The flag is persisted in the task list. You're right that it is not related to task data and is also useful for connectors that do not support offline storage. SynchronizeTaskJob will fallback to invoking AbstractRepositoryConnector.updateTaskFromRepository() for all flaged tasks when no ITaskDataHandler is available for a repository. (In reply to comment #45) > The flag is persisted in the task list. That is another bad thing about it. I am not sure if there is any need in persisting such information, because it can be calculated based on times. Also, persisting few thousands tasks just to resync them is sounds kinda weird to me. (In reply to comment #46) > That is another bad thing about it. I am not sure if there is any need in > persisting such information, because it can be calculated based on times. The reason is to persist it is that it is not possible to recover information about stale tasks once the synchronization timestamp has been updated. I agree that for the current connectors we could get around without storing the flag but I don't why it is bad to store it. > Also, > persisting few thousands tasks just to resync them is sounds kinda weird to > me. Not sure what you mean by that? Please make constructive suggestions if you have ideas how to improve the current API. If you think this API is acceptable for the 2.0 release we should open a new bug and discuss improvements for the next release cycle there. The reason behind persisting the stale flag for tasks is that a connector could determine the changed tasks and then update the repository synchronization time stamp before the tasks are synchronized. At that point the information about stale tasks in the task list can not be recovered since the original timestamp is not known anymore. This does not apply to the reference implementations. Trac and Bugzilla update the timestamp after all changed tasks have been synchronized. JIRA synchronizes the tasks while querying for changed tasks and then updates the timestamp. (In reply to comment #48) > The reason behind persisting the stale flag for tasks is that a connector > could determine the changed tasks and then update the repository > synchronization time stamp before the tasks are synchronized. I might be missing something, but I thought that connector could determine tasks changed since last sync for any given repository without having timestamp on each task... Anyways, there is no time left for tweaking this API at this point. |