| Summary: | do not check for stale tasks if synchronizing query on Bugzilla 3.4 or later | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | z_Archived | Reporter: | Thomas Ehrnhoefer <thomas.ehrnhoefer> | ||||||||
| Component: | Mylyn | Assignee: | Project Inbox <mylyn-triaged> | ||||||||
| Status: | CLOSED MOVED | QA Contact: | |||||||||
| Severity: | enhancement | ||||||||||
| Priority: | P3 | CC: | b.muskalla, eclipse, robert.elves, steffen.pingel | ||||||||
| Version: | unspecified | ||||||||||
| Target Milestone: | --- | ||||||||||
| Hardware: | PC | ||||||||||
| OS: | Windows 7 | ||||||||||
| Whiteboard: | |||||||||||
| Bug Depends on: | |||||||||||
| Bug Blocks: | 281893, 359210 | ||||||||||
| Attachments: |
|
||||||||||
I just understood how the preSync is different - it uses a different URL (adding a chFieldTo=Now). However I still see two issues here: # the staleTasks in the SynchronizationSession is not part of the "session.needsPerformQueries", thus the bugzilla connector needs to manually set the needsPerformQueries to false if the changed call retrieved no results # the changed call can retrieve results even though the task list already has the newest task data ** reproducable by changing a task of a query in the webUI (change summary ** sync query once, see incoming, clear incoming ** sync query again: note that the changed call will still return this changed task ** change the task with mylyn task editor ** sync query again: note that the changed call will NOT return this task anymore Alright, so far my analysis. I am happy to provide a patch to fix those two issues if that analysis is actually correct and came to the right conclusions. Please let me know Sorry, comment one mentioned 2 issues, but the (*"// XXX: HACK in case of ip change bugzilla can return 0 hits"*) mentioned in the description is a third issue I think Created attachment 182578 [details]
patch
this is a patch that improves the performance quite dramatically for the queries I wanted to speed up, and is based on my comments above. Sure it's not that easy, but I will just try it out for a bit and see how it goes
The Bugzilla queries are executed in several steps: 1. Check for changed tasks 2. Run queries 3. Refresh changed tasks The query in (1) runs for all tasks whether part of a query or not. The queries in (2) need to always execute since (1) does not detect new tasks. Did you synchronize all tasks or just the query to produce the results in comment 1? If the latter is the case the described behavior is expected since the time stamp is tracked on a per repository basis but not on a per query or task basis. Frank, Rob, is that work around when a query does not return results still required? Thanks for the explanation Steffen. Sounds like the only potential improvements (at least theoretically) is to fix the login/logout hack, as well as improve the check for changed tasks to not resync already known tasks. (In reply to comment #4) > Did you synchronize all tasks or just the query to produce the results in > comment 1? If the latter is the case the described behavior is expected since > the time stamp is tracked on a per repository basis but not on a per query or > task basis. I synced a query. Did I understand this right: as long as just one query get's synced, it will always find this changed task, but as soon as the Mylyn automatic sync syncs the whole repo, this changed task will not be returned in later query syncs? If this is the case, I think we need to try to improve that, as one of our new features coming soon relies on a (hopefully fast) single query to populate an editor. (In reply to comment #4) > The Bugzilla queries are executed in several steps: > > 1. Check for changed tasks > 2. Run queries > 3. Refresh changed tasks > > The query in (1) runs for all tasks whether part of a query or not. The queries > in (2) need to always execute since (1) does not detect new tasks. Again, checking if I understand this right: The SyncSession object carries all relevant tasks for the running querySync. It might contain more than just the query would return, thus (1) is necessary. If it would only care about the tasks that the query returns, we could just run (2) and be fine? If that is the case, would it be possible to add some sort of flag to a query (actually a SyncSession) to NOT care about any other tasks, but only the ones in the query? In this feature of ours that requires a fast running query, we know (for this one query sync) that we do not care about anything else than the query results, so it seems we could easily skip (1). Also, shouldn't we be able to add the chFrom/chTo parameters to (2) and thus reduce the results and hopefully increase the speed a bit (especially for queries with lots of tasks in it)? Created attachment 182628 [details]
patch v2
This is (without cleanup) what I had in mind for adding the changed timestamp to (2).
As for adding some flag to prevent the individual tasks sync, that requires API change from what I can tell, so I am waiting until I hear if that's even a possible solution.
Created attachment 182629 [details]
mylyn/context/zip
(In reply to comment #5) > Also, shouldn't we be able to add the chFrom/chTo parameters to (2) and thus > reduce the results and hopefully increase the speed a bit (especially for queries > with lots of tasks in it)? (In reply to comment #6) > Created an attachment (id=182628) > patch v2 > > This is (without cleanup) what I had in mind for adding the changed timestamp to > (2). > As for adding some flag to prevent the individual tasks sync, that requires API > change from what I can tell, so I am waiting until I hear if that's even a > possible solution. Scratch that. We would need to add all existing tasks to the result, plus there is no way to not get tasks which due to a change do not show up in this query anymore. So this little optimization is not gonna work. > (In reply to comment #4) > > Did you synchronize all tasks or just the query to produce the results in > > comment 1? If the latter is the case the described behavior is expected since > > the time stamp is tracked on a per repository basis but not on a per query or > > task basis. > > I synced a query. > Did I understand this right: as long as just one query get's synced, it will > always find this changed task, but as soon as the Mylyn automatic sync syncs the > whole repo, this changed task will not be returned in later query syncs? > If this is the case, I think we need to try to improve that, as one of our new > features coming soon relies on a (hopefully fast) single query to populate an > editor. Yes, that is the behavior implemented in the Bugzilla connector. It's up to each connector implementation to manage this according to the specifics of a repository. > (In reply to comment #4) > > The Bugzilla queries are executed in several steps: > > > > 1. Check for changed tasks > > 2. Run queries > > 3. Refresh changed tasks > > > > The query in (1) runs for all tasks whether part of a query or not. The > queries > > in (2) need to always execute since (1) does not detect new tasks. > > Again, checking if I understand this right: The SyncSession object carries all > relevant tasks for the running querySync. It might contain more than just the > query would return, thus (1) is necessary. If it would only care about the tasks > that the query returns, we could just run (2) and be fine? No, not all Bugzilla versions return a modification time stamp as part of the query results hence it is not possible to detect changed tasks without running (1) first. > If that is the case, would it be possible to add some sort of flag to a query > (actually a SyncSession) to NOT care about any other tasks, but only the ones in > the query? In this feature of ours that requires a fast running query, we know > (for this one query sync) that we do not care about anything else than the query > results, so it seems we could easily skip (1). Again, Bugzilla needs to check for changed tasks for the reasons stated above but connectors may skip (1) if query results provide sufficient information or if a repository API does not provide functionality to check for changed tasks. The Trac connector does check that in preSynchronization(): if (!session.isFullSynchronization()) { return; } > Also, shouldn't we be able to add the chFrom/chTo parameters to (2) and thus > reduce the results and hopefully increase the speed a bit (especially for > queries with lots of tasks in it)? I can't see how this would work. The purpose of running a query is to get all results that match and to determine which tasks do not match any longer. The purpose is not to detect changed tasks. Thanks again :) I guess an easy improvement is not possible than. But it sounds like different behaviour based on Bugzilla versions could work: if version supports modificationDate, don't do (1) and do an extended (2) with setting stale tasks? (In reply to comment #10) > But it sounds like different behaviour based on Bugzilla versions could work: if > version supports modificationDate, don't do (1) and do an extended (2) with > setting stale tasks? Yes, that optimization would make sense to me. The feature was added in Bugzilla 3.4: 341542: Include timestamp of query execution in programmatic search output (RDF, etc.) https://bugzilla.mozilla.org/show_bug.cgi?id=341542 Great. I probably wont provide a patch soon as I am rather busy. I created another task for the "hack" thing: 329697: reevaluate logout "hack" https://bugs.eclipse.org/bugs/show_bug.cgi?id=329697 Mylyn has been restructured, and our issue tracking has moved to GitHub [1]. We are closing ~14K Bugzilla issues to give the new team a fresh start. If you feel that this issue is still relevant, please create a new one on GitHub. [1] https://github.com/orgs/eclipse-mylyn |
I was looking into some bugzilla query performance on one of our instances, and started to debug the performQuery. Just for the record, here is the (I think) intended workflow of Bugzilla's performQuery: # getSearchHits # if no hits ## logout ## getSearchHits Problem is that in my case, every performQuery will have no hits for the first getSearchHits try, and thus do logout and try again. Of the query tested, a successful getSearchHits, taskes about 2.5sec, but the unsuccessful one plus the logout add additional 2-3sec to that. The whole peformQuery thus takes about twice as long as is should I think. Not why does it always fail on the first try? Well, if a query has not changed tasks, the getSearchHits has no hits. So the logout and retry will be useless in a whole lot of calls. It seems this retry was added due to IP changes ("*// XXX: HACK in case of ip change bugzilla can return 0 hits*"), but it seems like a problematic hack (well, are there non-problematic ones...) as it slows down every user on probably the majority of calls. Besides that, I still found the query slow, so I looked further. I found that there are still always *two performQuery* calls. This is due to the fact that Bugzilla does a performQuery in the preSynchronization method, as well as one for the actual synchronization. From my point of view (without knowing too much of the internals) there are three issues here: # the preSync, no matter if there are staleTasks found or not, will always lead to a follow up sync # the preSync sometimes finds a stale task but the task list does not find incoming changes # from my POV the performQuery is doing the same on each call I think the latter is the most important, it makes me wonder why even do the presync, why not just always do the sync? I am almost certain there is a difference I don't see here, so please help me out :) Thanks for listening, hopefully something can be done about the performance here