This Bugzilla instance is deprecated, and most Eclipse projects now use GitHub or Eclipse GitLab. Please see the deprecation plan for details.
Bug 196056 - [performance] reduce frequency of repository configuration download
Summary: [performance] reduce frequency of repository configuration download
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: Mylyn (show other bugs)
Version: unspecified   Edit
Hardware: PC All
: P1 normal (vote)
Target Milestone: 2.1   Edit
Assignee: maarten meijer CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 205416 205708
  Show dependency tree
 
Reported: 2007-07-10 15:09 EDT by Mik Kersten CLA
Modified: 2007-10-08 08:14 EDT (History)
8 users (show)

See Also:


Attachments
mylyn/context/zip (18.00 KB, application/octet-stream)
2007-07-13 13:45 EDT, Robert Elves CLA
no flags Details
Adapted config.cgi scxript that will cut about 30% of the cached config file (941 bytes, text/plain)
2007-10-03 11:32 EDT, maarten meijer CLA
no flags Details
Bugzilla client accepting gzipped configuration (3.51 KB, patch)
2007-10-04 06:06 EDT, maarten meijer CLA
no flags Details | Diff
Code should handle redirect to gzipped file as well (3.01 KB, patch)
2007-10-05 10:05 EDT, maarten meijer CLA
no flags Details | Diff
mylyn/context/zip (19.30 KB, application/octet-stream)
2007-10-05 10:05 EDT, maarten meijer CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mik Kersten CLA 2007-07-10 15:09:43 EDT
We currently download the configuration data on every 10th synchronization, which with default settings will happen be every 10 x 20 minutes when connected and using repository tasks/queries.  On the bugs.eclipse.org repository the current configuration size is 900K, so this adds up to considerable network traffic (see http://eclipsewebmaster.blogspot.com/2007/07/mylyn-popularity-busy-bugzilla.html ).  We should explore ways to optmizie network usage, e.g.:

* [cheaper] Reduce the interval to be every 100th download.  This would need to be done in conjunction with an "Update Attributes" button in the Task Editor since the attributes would be much more likely to go stale.

* [better] Figure out a mechanism to determine whether the attributes have changed.  Since we don't yet have a way of determining this we should file a bug against bugzilla.mozilla.org and explore options (e.g. trying to get the size of the file to see if it changed before downloading all contents).
Comment 1 Denis Roy CLA 2007-07-10 15:21:21 EDT
Mozilla could use some Expires and caching headers so that the document can at least be cached (save bandwidth).

With the above in place, server maintainers (like me) could also add cache mechanisms (squid proxies, etc) so that not all the requests are served from dynamic content each time (saving precious CPU and database cycles).
Comment 2 Willian Mitsuda CLA 2007-07-10 18:57:23 EDT
(In reply to comment #0)
> * [cheaper] Reduce the interval to be every 100th download.  This would need to
> be done in conjunction with an "Update Attributes" button in the Task Editor
> since the attributes would be much more likely to go stale.

See also bug#162428.
Comment 3 Denis Roy CLA 2007-07-10 19:18:34 EDT
Another option: could Mylyn look for a config.cgi.xml *before* (and instead of) loading config.cgi?  If the xml file doesn't exist, config.cgi is then loaded. That way the admin of a large, busy site could simply auto-generate a static (and cacheable) xml file as needed.
Comment 4 Robert Elves CLA 2007-07-10 19:52:17 EDT
Denis, both options sound reasonable to me.  Another option could be to write a short Perl script to dump a last modified timestamp to a config.timestamp file. We could then (if available) use this timestamp as the bases for 'synchronization' of the repository configuration. 
Comment 5 Gunnar Wagenknecht CLA 2007-07-11 01:07:14 EDT
(In reply to comment #1)
> Mozilla could use some Expires and caching headers so that the document can at
> least be cached (save bandwidth).

I think that's the right way to go. HTTP provides some good caching options. Thoset should be leveraged instead of forcing the Bugzilla maintainers to add some extra scripts and processing to their servers just to support Mylyn. It should be easy to remember the last modified time somewhere. Maybe Mylyn could help the Bugzilla developers with a patch.

Comment 6 Eugene Kuleshov CLA 2007-07-11 01:46:18 EDT
(In reply to comment #5)
> I think that's the right way to go. HTTP provides some good caching options.

Gunnar, that could be a dangerous option. Timestamp on the cached document does help to save the traffic, but it would also make impossible to see if server configuration had changed and may lead to abuse of the no-cache option. There was several reports that users don't immediately see updated configuration because it is not being retrieved instantly. Such issue probably would need some special support on the server. Either a separate timestamp document like Rob suggested, or explicitly clean the cached document and its timestamp upon any configuration change.
Comment 7 Gunnar Wagenknecht CLA 2007-07-11 02:25:42 EDT
(In reply to comment #6)
> Gunnar, that could be a dangerous option. Timestamp on the cached document does
> help to save the traffic, but it would also make impossible to see if server
> configuration had changed and may lead to abuse of the no-cache option.

That would only happen if it would be poorly implemented. However, from the discussion in the bug I got the impression that this won't happen.

> There was several reports that users don't immediately see updated
> configuration because it is not being retrieved instantly. Such issue 
> probably would need some special support on the server.

Mhm. I don't see why there should be a need for this. Each browser has a refresh button and a possibility to clear the cache and reload the content from the server. Even Mylyn could have a refresh button that a user explicitly invokes. 

> Either a separate timestamp document like
> Rob suggested, or explicitly clean the cached document and its timestamp upon
> any configuration change.

That's weird because nobody other than Mylyn would benefit from this solution. Sorry but I don't see any Bugzilla committer accepting such a patch. Moreover, writing the last-modified date into a separate file is nothing more than using validation based caching (see [1]) by avoiding the HTTP capabilities and reinventing the wheel.

[1] http://billhiggins.us/weblog/2007/05/11/http-caching-options/

Comment 8 Eugene Kuleshov CLA 2007-07-11 03:00:47 EDT
(In reply to comment #7)
> > Either a separate timestamp document like
> > Rob suggested, or explicitly clean the cached document and its timestamp upon
> > any configuration change.
> That's weird because nobody other than Mylyn would benefit from this solution.
> Sorry but I don't see any Bugzilla committer accepting such a patch. Moreover,
> writing the last-modified date into a separate file is nothing more than using
> validation based caching (see [1]) by avoiding the HTTP capabilities and
> reinventing the wheel.

I am not an expert with all the caching options and got confused by the Dennis's comment about external squid proxy. So, I was just trying to explain the use case, that Bugzilla client like Mylyn would really benefit from the ability to detect configuration change without explicit refresh or force actions. It is not an issue in the web UI, because web UI shows these changes immediately and users somehow expect similar behavior in the rich UI.
Comment 9 Gunnar Wagenknecht CLA 2007-07-11 03:06:08 EDT
(In reply to comment #8)
> I am not an expert with all the caching options and got confused by the
> Dennis's comment about external squid proxy.

I think that Squid will know how to deal with the caching options of HTTP. However, if you have a caching server in between that really enforces caching than you'll loose anyway because the separate file containing your time stamp would also be cached.
Comment 10 maarten meijer CLA 2007-07-11 04:13:36 EDT
(In reply to comment #0)
> * [cheaper] Reduce the interval to be every 100th download.  This would need to
> be done in conjunction with an "Update Attributes" button in the Task Editor
> since the attributes would be much more likely to go stale.
I think you should also consider typical usage of Mylin. I use it to schedule and structure (love those auto CVS comments) my own web work on  my own bugzilla/CVS server.

I also use Mylin to follow bugs I found and reported in tools I use, either Eclipse, some tool vendor or on bugs.mysql.com. But this is less imprtant than my own repository. So we need a 'learning' Mylin that can see whether it is my main bugzilla or just like a news reader. Can you query Mylin users how many active repositories they have? I have 4.

My main suggestion make the refresh interval(s) user configurable per repository instance:
every 200/4000 minutes, at startup only (for lower interest sites), every week. 
Similar to but also much better that auto-update ;-)

The second is to not refresh when it is not in the current Working Set, to get some sort of demand/usage driven reduction.

Comment 11 Robert Elves CLA 2007-07-11 17:34:43 EDT
 (In reply to comment #6)
> Either a separate timestamp document like Rob
> suggested, or explicitly clean the cached document and its timestamp upon any
> configuration change.

Summarizing Eugene,  Maarten's and others' suggestions (and Mik's guidelines on bug#196021), here's how I think we should proceed:

1) Improve Mylyn's config refresh policy
   - For 2.1 we will add an explicit attributes refresh button (bug#162428) and set auto refresh to happen once daily if Auto-synchronization is enabled
2) Ask Bugzilla to provide a timestamp file that is updated upon each config change 
3) Ask server administrators to cache the config output and invalidate frequently
3a) Consider extending Bugzilla to invalidate the cached copy when when config changes

(1) should reduce traffic significantly without sacrificing user experience
(2) is required in order to maintain the level of transparency that we're after
(3) can be done as needed by server admins

Denis, does this sound like a reasonable course of action from here?
Comment 12 maarten meijer CLA 2007-07-12 03:14:13 EDT
Using a timestamp file still requires a full config download. Would it be more economical to request a config delta from a certain date?
The the return message will then be all config changes since the requested date or an empty delta.
Thus bugzilla will not resend information already present in Mylin. 
  
> 2) Ask Bugzilla to provide a timestamp file that is updated upon each config change
Comment 13 Denis Roy CLA 2007-07-12 10:09:40 EDT
There are essentially three problems here:

1. config.cgi wastes lots of cpu power, because the file is generated dynamically each time (cpu on the webserver (noticeable) + cpu on the database (negligeable))

2. config.cgi wastes bandwidth, as our repository is fairly large (almost 1MB)

3. I don't really know when the repository changes, so I woudln't know what to put in the timestamp file

This morning I solved (1) -- and perhaps (2) by simply moving config.cgi to config-stock.cgi, and replacing config.cgi with a script that simply sends the raw RDF data from a static file (config.xml). The static file is regenerated after x minutes.  In doing so, I send the appropriate HTTP/1.1 cache headers and content-length, so Mylyn could simply HEAD the file first to see if the size has changed since the last fetch.

Since doing this, I no longer see any config.cgi processes as using lots of cpu power, and config.cgi loads in milliseconds vs. seconds (the bottleneck being your internet connection).  Our web server's CPU load has decreased noticeably.


Here is my revised config.cgi script:

#!/usr/bin/perl

$filename = "/path/to/config.xml";
$filename2 = "/path/to/config.xml2";
$fileage = -M $filename;

$fileage = sprintf("%d", $fileage * 1440);   # in minutes plz

if($fileage > 15) {
        system ("/path/to/wget -O $filename2 https://bugs.eclipse.org:8443/bugs/config-stock.cgi?ctype=rdf; cp $filename2 $filename");
}

$filesize = -s $filename;

print "Content-type: application/rdf+xml\n";
print "Cache-Control: max-age=3600, must-revalidate\n";
print "Content-Length: " . $filesize . "\n";
print "\n";
open (FILE, $filename);
print <FILE>;
close FILE;
Comment 14 Denis Roy CLA 2007-07-12 10:46:48 EDT
Because no one probably put any serious thought into the "every 10th synchronization" algorithm, I thought it would be fun to see what kind of impact this has on us.

I tailed the logfiles for GET /bugs/config.cgi requests for 5 minutes.  Note that only Mylyn fetches this config.cgi file.

Total requests: 142
28.4 req/minute, one request every 2 seconds

Total transfer size for 5 minutes: 121 megabytes, or 413 KB/sec (3.22 megabits/sec)

This one bug is now consuming about 5% of Eclipse.org's permanent bandwidth...
Comment 15 Steffen Pingel CLA 2007-07-12 14:22:57 EDT
This synchronization policy need to be reviewed. Rob, the Eclipse repository is added by default and the configuration update will take place regardless of any queries present in the task list, right? 

Adding up Dennis' numbers this means there 5680 configuration updates within 200 minutes (the default update interval) which would be roughly equal to the number of Eclipse users having Mylyn running at this time with the default configuration?
Comment 16 Robert Elves CLA 2007-07-12 16:10:04 EDT
Great! So based on the work Denis has done, we can now add a check and only update the configuration when it has changed based on Content-Length field in http header.
If this field isn't available we'll simply assume that an update is required as per usual.

 (In reply to comment #15)
> This synchronization policy need to be reviewed. Rob, the Eclipse repository is
> added by default and the configuration update will take place regardless of any
> queries present in the task list, right?
Correct. We can improved this as well.

Comment 17 Karl Matthias CLA 2007-07-12 16:54:34 EDT
(In reply to comment #16)
> Great! So based on the work Denis has done, we can now add a check and only
> update the configuration when it has changed based on Content-Length field in
> http header.
> If this field isn't available we'll simply assume that an update is required as
> per usual.

We could fairly easily add a Last-Modified header as well if you'd rather have that to go on than Content-length.


> 
>  (In reply to comment #15)
> > This synchronization policy need to be reviewed. Rob, the Eclipse repository is
> > added by default and the configuration update will take place regardless of any
> > queries present in the task list, right?
> Correct. We can improved this as well.

Yeah, removing this default needs to happen pretty fast.
Comment 18 Eugene Kuleshov CLA 2007-07-12 17:10:46 EDT
(In reply to comment #16)
> Great! So based on the work Denis has done, we can now add a check and only
> update the configuration when it has changed based on Content-Length field in
> http header.

Rob, I wonder how it is going to work for unchanged bugzilla installs? When config document is dynamically generated you may hit it twice...
Comment 19 Robert Elves CLA 2007-07-12 17:45:13 EDT
 (In reply to comment #17)
> (In reply to comment #16)
> > Great! So based on the work Denis has done, we can now add a check and only
> > update the configuration when it has changed based on Content-Length field in
> > http header.
> > If this field isn't available we'll simply assume that an update is required
> as
> > per usual.
> 
> We could fairly easily add a Last-Modified header as well if you'd rather have
> that to go on than Content-length.

This would be better if possible since this would more closely mirror how the rest of our synchronization works. Denis could you update your script to include this?

 (In reply to comment #18)
> Rob, I wonder how it is going to work for unchanged bugzilla installs? When
> config document is dynamically generated you may hit it twice...

The first call shouldn't be very expensive though since we will just request the header on the first call:
http://jakarta.apache.org/commons/httpclient/methods/head.html


Comment 20 Eugene Kuleshov CLA 2007-07-12 18:22:09 EDT
(In reply to comment #19)
> The first call shouldn't be very expensive though since we will just request the
> header on the first call:
> http://jakarta.apache.org/commons/httpclient/methods/head.html

I meant that it won't be expensive to the client, but for unmodified bugzilla the HEAD may trigger same database activity as the GET request, unless of course bugzilla explicitly handles the HEAD request already.
Comment 21 Robert Elves CLA 2007-07-12 18:47:23 EDT
 (In reply to comment #20)
> I meant that it won't be expensive to the client, but for unmodified bugzilla
> the HEAD may trigger same database activity as the GET request, unless of course
> bugzilla explicitly handles the HEAD request already.

Sorry yes, you raise a good point.  As is the current design should address the problem of network load but may result in a double hit to the database depending on how Bugzilla handles the initial HEAD request.
Comment 22 Karl Matthias CLA 2007-07-12 18:55:52 EDT
As far as I can tell Bugzilla does not explicitly handle the HEAD request.
Comment 23 Eugene Kuleshov CLA 2007-07-12 19:17:10 EDT
(In reply to comment #21)
> Sorry yes, you raise a good point.  As is the current design should address the
> problem of network load but may result in a double hit to the database depending
> on how Bugzilla handles the initial HEAD request.

Right. Maybe it should be some kind of flag, in the repository settings or simply hardcoded for eclipse.org for the time being.
Comment 24 Denis Roy CLA 2007-07-12 21:16:55 EDT
Indeed, HEAD and GET are identical to Bugzilla.

+1 for comment 23 - do a HEAD for the first access, then if the Last-Modified header is received, then you know it's a "smart" bugzilla.  Otherwise, you assume a stock bugzilla and store that state.

I'd suggest this approach over hard-coding our site, as a proxy/caching server can also yield the same cacheable config.cgi without needing to modify bugzilla.
Comment 25 Robert Elves CLA 2007-07-13 13:45:26 EDT
Attempt is now made to use Last-Modified header field, if not available subsequent attempts will default to retrieve configuration. If at later date an admin updates the repository to use this last-modified field, the user can re-enable use of this Last-Modified by selecting Task Repository Settings > Additional Settings > Cached Configuration. At some point we could consider automatically checking for this at some random interval so that the user is not responsible for this setting.

Repository configuration is now only updated upon first synch (i.e. shortly after a workbench startup) and then subsequently once daily.

Denis, if you could update bug.eclipse.org to return the Last-Modified field, aside from some manual testing, I think we're done here.
Comment 26 Robert Elves CLA 2007-07-13 13:45:34 EDT
Created attachment 73757 [details]
mylyn/context/zip
Comment 27 Mik Kersten CLA 2007-07-13 21:55:18 EDT
Fix is available from the latest dev build: http://eclipse.org/mylyn/downloads/builds.php
Comment 28 Karl Matthias CLA 2007-07-17 14:18:28 EDT
Denis is out on vacation this week so I updated the config.cgi he wrote to add Last-Modified.  It's specified in the correct W3 standard format like:

Last-Modified: Tue, 17 Jul 2007 14:08:22 EDT

Hope that does the trick!
Comment 29 Mik Kersten CLA 2007-07-17 15:04:23 EDT
Thanks Karl!

Rob: look over this article to see if ETags could provide a good reusable mechanism for having servers store the date/cache state: http://www.infoq.com/articles/etags
Comment 30 Robert Elves CLA 2007-07-19 13:14:48 EDT
 (In reply to comment #28)
> Denis is out on vacation this week so I updated the config.cgi he wrote to add
> Last-Modified.  It's specified in the correct W3 standard format like:
> 
> Last-Modified: Tue, 17 Jul 2007 14:08:22 EDT
> 
> Hope that does the trick!

Great! If you're are on the latest Mylyn dev build, open your Eclipse.org Task Repository settings and enable Cached Configuration under Additional Settings. Subsequent configuration synchronization will be more efficient. I'll make this known via Mylyn mailing lists and faq.


 (In reply to comment #29)
> Rob: look over this article to see if ETags could provide a good reusable
> mechanism for having servers store the date/cache state:
> http://www.infoq.com/articles/etags
Interesting, I'll have a look.
Comment 31 maarten meijer CLA 2007-07-22 09:08:04 EDT
I'm not really sure whether this belongs here, but in addition to server load, there is the problem of log file expansion.
When running /bugzilla/ locally, one can prevent Mylin  filling up the log files by adding this to httpd.conf:

<DirectoryMatch /bugzilla/ >
	BrowserMatchNoCase Mylyn mylin=1
</DirectoryMatch>
CustomLog "/private/var/log/httpd/access_log" combined env=!mylin

Or should this go into the WiKi somewhere...
Comment 32 Robert Elves CLA 2007-07-23 01:18:29 EDT
 (In reply to comment #31) 
> Or should this go into the WiKi somewhere...
Yes, why don't you add this to http://wiki.eclipse.org/Mylyn_Bugzilla_Connector and Denis or Karl if you want to add the  server configuration optimization (I've added a place holder).
Comment 33 Mik Kersten CLA 2007-07-24 00:47:59 EDT
 (In reply to comment #31)
> BrowserMatchNoCase Mylyn mylin=1

Maarten: yes, would be great if you could add this.  When doing so please spell the project name correctly ("mylin" -> "mylyn") in order to avoid confusion.
Comment 34 Denis Roy CLA 2007-10-02 10:18:15 EDT
Reopening.

I noticed that for September, config.cgi requests totaled just over 2 terabytes (!!) of transfer.  This is about 10% of eclipse.org's total bandwidth, including downloads.

After some investigation, it became clear that the Mylyn client is not using the caching and expiry tags we set in comment 28, as notmany requests have a 304 "not modified" header.  Each request translated to a complete transfer of our repository, which is 950KB in size.

I used wget to make sure that caching was actually possible, and it is:
wget --header "If-Modified-Since: Tue, 02 Oct 2007 13:18:20 GMT"--delete-after -S --no-check-certificate https://bugs.eclipse.org/bugs/config.cgi
[snip]
HTTP request sent, awaiting response...
  HTTP/1.1 304 Not Modified
  Date: Tue, 02 Oct 2007 13:25:18 GMT
  Server: Apache
  Connection: Keep-Alive
  Keep-Alive: timeout=3, max=100
  Expires: Wed, 3 Oct 2007 13:13:46 GMT
  Cache-Control: max-age=86400, must-revalidate
09:25:18 ERROR 304: Not Modified.

From here, I can think of a couple of solutions:

1. MUST: Mylyn needs to send the correct if-modified-since header and expect a 304 in return if no changed have been made

2. SHOULD: Mylyn should store the size of the last repo transfer, then send a HEAD request and compare the current size.

3. SHOULD: Mylyn should not do more than 1 complete repo transfer per day. The repo doesn't change very often, so performing a sync more often is just wasteful

4. WOULD BE NICE: Mylyn could have a preference, default to 2 days, which would allow users to tweak how often it checks for changes.  This would allow users with low bandwidth to optimize Mylyn's usage according to their needs

In the interim, our current config.cgi will alternate between periods of 404 and 200 to spare our bandwidth.
Comment 35 Ketan Padegaonkar CLA 2007-10-02 13:16:55 EDT
Added bug 205213 to better understand things.
Comment 36 Mik Kersten CLA 2007-10-02 14:50:25 EDT
Denis: whenever we are in a period of 404 the Mylyn Bugzilla tool will be broken for any user who has not already created their queries (e.g. is a new user, creating a new query, refreshing attributes for a query).  In other words Mylyn will currently appear broken to a big portion of the users trying to use it on bugs.eclipse.org and we are in a very bad state.

Please note that the fix for this bug did not get released to the masses until last Friday's Europa Fall Update, to which people are just updating to now.  In other words, all those Mylyn clients were not using the cached configuration in September (only the small subset of the early adopters of our more frequent builds would have bene using those).

Since we're currently in a time of impending doom, could you please reactivate the config.cgi script while we figure out a solution that won't cause people accessing bugs.eclipse.org to break?  We fully appreciate the need to minimize bandwidth usage and will do anything that's needed to assist in this.  Hopefully we will see a huge drop once people update to the Europa Fall Update.  

In the meantime Rob and I will investigate further.
Comment 37 maarten meijer CLA 2007-10-02 16:41:23 EDT
 (In reply to comment #35)
> Added bug 205213 to better understand things.
Created patch to do just this in that bug
Comment 38 Mik Kersten CLA 2007-10-02 18:56:22 EDT
We are still in this very broken state, so I created bug 205249 against Community/Website.  I will inform users on the newsgroup right now and await a reply from the foundation either here or on that bug.  If we don't get a reply by early tomorrow we will need to announce this outage more widely since there have already been complaints.
Comment 39 Robert Elves CLA 2007-10-02 19:06:20 EDT
 (In reply to comment #34)
> 1. MUST: Mylyn needs to send the correct if-modified-since header and expect a
> 304 in return if no changed have been made

Currently the Bugzilla Connector sends a header request for the Last-Modified header and if it differes from the last a full request is made.
 
> 3. SHOULD: Mylyn should not do more than 1 complete repo transfer per day. The
> repo doesn't change very often, so performing a sync more often is just wasteful

Unless manually requested by user, the configuration currently is only retrieved once a day (assuming Cached Configuration is enabled which it is by default). 

Comment 40 Gunnar Wagenknecht CLA 2007-10-03 02:36:49 EDT
(In reply to comment #39)
> > 3. SHOULD: Mylyn should not do more than 1 complete repo transfer per day. The
> > repo doesn't change very often, so performing a sync more often is just wasteful
> 
> Unless manually requested by user, the configuration currently is only
> retrieved once a day (assuming Cached Configuration is enabled which it is by
> default). 

From my experience with such large distributed systems that's still way too often. I suggest at maximum once per week. Even once per month should be well sufficient in terms of bugs.eclipse.org.
Comment 41 maarten meijer CLA 2007-10-03 03:40:27 EDT
 (In reply to comment #39)
> Unless manually requested by user, the configuration currently is only retrieved
> once a day (assuming Cached Configuration is enabled which it is by default).
I upgraded from earlier Mylyn version and checking just now, noticed that cached configuration is NOT on.
So the setting is not updated if you have eclipse.org defined form an earlier version of mylyn, as, I suspect, the majority of users will have.

I think 'somebody' should hardcode a check for this on start up and put up an annoyance dialog on startup if eclipse.org is NOT cached, with option to do so from the dialog.
I cannot think of a reason not to cache when eclipse.org supports it.
THis of course until it is resolved in another way...
Comment 42 maarten meijer CLA 2007-10-03 04:07:39 EDT
 (In reply to comment #34)
> "not modified" header.  Each request translated to a complete transfer of our
> repository, which is 950KB in size.
Server side caching does indeed resolve the load on the CPU but not the bandwidth.
Client side caching reduces unnecessary requests, but we have to look also at size of each legitimate request.
  
config.cgi is very nicely formatted xml+rdf.
As an thought experiment I removed all extraneous spaces and newlines using BBEdits format/compact.
Size count  as now delivered: 996546 chars, 20419 lines
Size count  with spaces and newlines stripped: 667460chars, 2 lines

Whoops! Size reduced by 33%!!!!

We have to find out if this condensed form is properly parsed by Mylyn, if so it provides a
quite easy way to instantly reduce the bandwidth requirement by 33% s follows:

Make your caching code (see comment #13 above) also strip these extra spaces and newlines...
Comment 43 maarten meijer CLA 2007-10-03 11:32:14 EDT
Created attachment 79649 [details]
Adapted config.cgi scxript that will cut about 30% of the cached config file

I've tried this on my own bugzilla install and Mylyn parses without problems.
It is also 30% smaller.
Comment 44 Denis Roy CLA 2007-10-03 11:53:51 EDT
(In reply to comment #42)
> Whoops! Size reduced by 33%!!!!

This is beautiful stuff -- I'll apply this today and re-post my config.cgi script (I added some basic locking too, as sometimes a couple of processes would regen the file simultaneously).
Comment 45 maarten meijer CLA 2007-10-03 12:39:52 EDT
 (In reply to comment #44)
> (In reply to comment #42)
> > Whoops! Size reduced by 33%!!!!
As there is so much redundancy in both the XML markup (tags) and the content (the lead part of the URL is the same for every item!!) its also a prime candidate for on the fly compression, for example gzip.
My local stock config: 88854 bytes, trimmed using perl script: 62087 bytes, then gzipped: 4297 bytes.
Your config.cgi is 996546, trimmed it is 667460 and then gzipped would be 28501!!

That is just for the config.cgi. But every bug served to Mylyn is served as XML, how are those formatted (also much whitespace), can they be compressed easily?

Mik,
My suggestion would be: make the mylyn.web.core accept gzip as well as encoding and then add mod_gzip to eclipse.org to reduce the size of ALL Mylyn traffic. The mylyn.web.core is used for all web traffic after all, so every repository in the world can potentially benefit if they can send gzipped XML. That would really make Mylyn well behaved in terms of bandwidth, but not server CPU.

Denis,
As your already caching the config.xml, its no big deal to gzip this file once its regenerated. That would leave the option open to use mod_gzip everywhere or just gzip the cached big config files.
Comment 46 Denis Roy CLA 2007-10-03 12:58:25 EDT
(In reply to comment #45)
> candidate for on the fly compression, for example gzip.

On the fly compression comes at the expense of CPU power, although for something fairly static like config.cgi, I agree with you that gzip'ing it once then serving it thousands of times makes sense.

However, IMHO, if the config.cgi file is properly cached, and is stripped of extraneous characters, I think we've trimmed a substantial amount of fat within the realm of being practical.

Our bugzilla serves about a gazillion pages a day (more when we allow google/msn spiders to index us), so I can imagine mod_gzip having a noticeable impact on our CPUs. It's a constant struggle to walk the fine line between one bottleneck and another.
Comment 47 maarten meijer CLA 2007-10-03 13:10:03 EDT
 (In reply to comment #46)
> serving it thousands of times makes sense.
So Mylyn should accept gzip, as going from 996546 to 28501 is a reduction of over 90%!
Mik, maybe this should be a P1?
Comment 48 Mik Kersten CLA 2007-10-03 14:46:44 EDT
(In reply to comment #47)
> So Mylyn should accept gzip, as going from 996546 to 28501 is a reduction of over 90%!

We are happy to consume the file in the format that best suits eclipse.org's needs and (by having a customized Mylyn Bugzilla connector with additional handling and heuristics for the bugs.eclipse.org repository).  Our JIRA Connector already handles mod_gzip so that would be a straightforward change.  Alternatively we could consider retrieving a cached config.xml.zip type file that's updated on a schedule.  Either change would need to be backwards compatible with existing Mylyn clients by continuing to serve the current config.cgi unzipped.  Denis, if you propose a solution we can iterate and make sure our client can handle it gracefully and have a fallback if anything goes wrong.

Maarten: the most likely reason for your Cached Configuration setting getting turned off is, ironically, that your once-per-day background updated of attributes happened just after config.cgi started 404'ing, which made Mylyn assume that the Cached Configuration was not available on bugs.eclipse.org and fall back to the standard Bugzilla config.cgi.  We could have Mylyn try to eagerly fall back to the cached configuration in this scenario as a special rule for the bugs.eclipse.org.
Comment 49 maarten meijer CLA 2007-10-03 14:56:35 EDT
 (In reply to comment #48)
> Maarten: the most likely reason for your Cached Configuration setting getting
> turned off is, ironically, that your once-per-day background updated of
> attributes happened just after config.cgi started 404'ing, which made Mylyn
> assume that the Cached Configuration was not available on bugs.eclipse.org and
> fall back to the standard Bugzilla config.cgi.  We could have Mylyn try to
> eagerly fall back to the cached configuration in this scenario as a special rule
> for the bugs.eclipse.org.
To further this search for optimizing spots: where is the cached configuration stored?
It feels like workspace , since as I alternate between mylyn and work workspaces during the day, it feels like each one gets updated on every switch.
Repository access information and cached config data  should be stored once per machine I feel.
Comment 50 Denis Roy CLA 2007-10-03 15:09:21 EDT
I've committed the eclipse.org config.cgi script to the Phoenix "Infra Scripts" repository.  We can use it as a reference for other sites, and we'll be able to accept code patches. Maarten, thanks for your contribution.

:pserver:anonymous@dev.eclipse.org:/cvsroot/technology/org.eclipse.phoenix/infra-scripts/bugzilla/

or

http://dev.eclipse.org/viewcvs/index.cgi/org.eclipse.phoenix/infra-scripts/bugzilla/?root=Technology_Project

Let's have config.cgi specific discussions via new bugs against Phoenix (cc'ing the Mylyn team) so that this bug can focus on the frequency/caching of Mylyn (which, from what I read, can perhaps be WORKSFORME).
Comment 51 maarten meijer CLA 2007-10-04 06:06:08 EDT
Created attachment 79705 [details]
Bugzilla client accepting gzipped configuration

As the config.cgi is already caching and stripping superfluous spaces for 30% size reduction, why not go the whole way and gzip the resulting file as well. 
The bugzilla server can then decide to send the stripped version to older clients and the much smaller (-90%) gzip encoded version to newer patched clients.
Comment 52 maarten meijer CLA 2007-10-04 06:49:39 EDT
created bug 205416: In order to reduce traffic, the bugzilla should not only cache its config but also gzip it
https://bugs.eclipse.org/bugs/show_bug.cgi?id=205416
To make cached config also available gzipped.
If that is done, and the patch accepted into Mylyn we've cut 90% of the traffic generated by the config retrieval and the need for reduced frequency may go away.
Comment 53 Robert Elves CLA 2007-10-04 14:14:10 EDT
 (In reply to comment #52)
> created bug 205416: In order to reduce traffic, the bugzilla should not only
> cache its config but also gzip it
> https://bugs.eclipse.org/bugs/show_bug.cgi?id=205416
> To make cached config also available gzipped.
> If that is done, and the patch accepted into Mylyn we've cut 90% of the traffic
> generated by the config retrieval and the need for reduced frequency may go
> away.

Great. Patch applied. 

I'll update the connector so that it attempts use of 'cached config' each workbench start so if it fails for one session due to unavailability etc, upon restart it has a chance to reenable. This will eliminate the need for the preference in the settings page as well.
Comment 54 Robert Elves CLA 2007-10-04 18:56:37 EDT
Future Mylyn builds will now re-attempt to use the Last-Modified header upon subsequent workbench cycles or changes to the repository.
Comment 55 Mik Kersten CLA 2007-10-04 20:36:49 EDT
Thanks for the excellent contribution Maarten!

Denis: if you could check into our bandwidth usage in a couple of weeks, once Mylyn 2.1 has had a chance to get out there, it would be great to know how we're doing.
Comment 56 maarten meijer CLA 2007-10-05 10:01:44 EDT
gzip code does not handle a 302 redirect to a previously gzipped file, as would be the case when mod_rewrite is used to lower traffic.
 
Comment 57 maarten meijer CLA 2007-10-05 10:05:43 EDT
Created attachment 79806 [details]
Code should handle redirect to gzipped file as well

When using a modified config.cgi, the accept-encoding: gzip  and content-encoding: gzip headers can be used, but when load is distributed using mod_rewrite (incombination with BrowserMatch, bug 205213) the redirect ends up on just a gziopped file, that will be sent with Content-Type: application/x-gzip headers.

THis patch handles both cases so that webmaster for all bugzilla's have the freedom to chose their setup (perl, mod_rewrite, mod_gzip, etc).
Comment 58 maarten meijer CLA 2007-10-05 10:05:48 EDT
Created attachment 79807 [details]
mylyn/context/zip
Comment 59 Robert Elves CLA 2007-10-05 10:42:44 EDT
Patch applied, ip log updated. 
Comment 60 Robert Elves CLA 2007-10-05 10:46:28 EDT
Marking fixed. Thanks for all your input and contributions Maarten!
Comment 61 maarten meijer CLA 2007-10-08 08:14:15 EDT
To generalize lower bandwidth use created bug 205708: Make Mylyn accept gzip encoded data on all requests
https://bugs.eclipse.org/bugs/show_bug.cgi?id=205708