Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 323609

Summary: RollingFile appender does not roll but copies files if logging volume becomes large
Product: [RT] Virgo Reporter: Ric Klaren <ric.klaren>
Component: unknownAssignee: Glyn Normington <glyn.normington>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: ceki, dvooster, glyn.normington, ric.klaren, zteve.powell
Version: 2.1.0.M03-incubation   
Target Milestone: 2.1.0.M04-incubation   
Hardware: PC   
OS: Windows XP   
Whiteboard:
Attachments:
Description Flags
Logback debug output
none
Stack traces of two RollingFileAppender.starts during virgo startup none

Description Ric Klaren CLA 2010-08-25 09:23:49 EDT
Build Identifier: virgo-web-server-2.1.0.M03-incubation

Originally observed on DM Server 2. Reproduced on Virgo M3.

If the load on the logging (or the loglevel) gets high enough the RollingFileAppender does not roll the files anymore but leaves a set of files with identical start of the file (only some differences in size and trailing data) once the rolling breaks it seems not to recover if the load drops.

If the load or logging volume is not high the rolling functions as expected.

It does not occur on linux with the same configuration.

We observe that on windows two file handles are open on the log file from the same VM. We suspect that this is the root cause of the issue. This 2nd file handle is not present if we attempt to recreate the issue in a separate test case with just plain logback. So we suspect that the issue is with how DM server/Virgo uses logback.

Adapting the logging configuration to one that rolls and allows the prudent setting fixes the issue (but leaves us with a non ideal logging configuration).


Reproducible: Always

Steps to Reproduce:
1. take vanilla virgo m3
2. set root logger to debug level
3. start
4. pick up the pieces
Comment 1 Steve Powell CLA 2010-08-25 11:33:41 EDT
Thank you for the bug report.

I gather from your description that you are executing on Windows -- please can you be explicit about the platform you see this problem on; we don't see it on the Linux or Mac systems we use in testing (and we haven't recreated it on Windows, but that is because we haven't tried).

There are serious performance issues for high levels of logging on Windows which caused us recently to change the logging level of entry and exit tracing. We also know of a problem with the previous version of logback which meant that rolling can sometimes start writing to all historical log files simultaneously (which would give the symptoms you recount). (See http://jira.qos.ch/browse/LBCLASSIC-203 for details of the logback problem.)

Therefore we are very interested to see if you are using the current version of Virgo and that it (in turn) is using logback 0.9.21 or later and slf4j 1.6.0 or later. The problem should have been fixed in these versions.  If not, we would have to push back to logback for the solution.

Your suspicion,

| we suspect that the issue is with how DM
| server/Virgo uses logback.

though understandable, is extremely unlikely: Virgo doesn't use logback other than in medic and using the slf4j api. This usage and implementation in Virgo is platform agnostic, though the logback implementation is not; just as the problem symptoms appear to be.  As far as we are aware, the logback problem is unlikely to be detected in an application with a small number of logging threads.  Virgo has quite a few threads logging and so logs rather hard and fast (especially when level=DEBUG).

I suppose you could say it is 'the way we use it' though the issue is probably elsewhere.
Comment 2 Ric Klaren CLA 2010-08-25 13:22:58 EDT
(In reply to comment #1)
> I gather from your description that you are executing on Windows -- please can
> you be explicit about the platform you see this problem on; we don't see it on
> the Linux or Mac systems we use in testing (and we haven't recreated it on
> Windows, but that is because we haven't tried).

See selected platform in the report. Windows XP although also on Windows Server 2003.

> There are serious performance issues for high levels of logging on Windows
> which caused us recently to change the logging level of entry and exit tracing.
> We also know of a problem with the previous version of logback which meant that
> rolling can sometimes start writing to all historical log files simultaneously
> (which would give the symptoms you recount). (See
> http://jira.qos.ch/browse/LBCLASSIC-203 for details of the logback problem.)
> 
> Therefore we are very interested to see if you are using the current version of
> Virgo and that it (in turn) is using logback 0.9.21 or later and slf4j 1.6.0 or
> later. The problem should have been fixed in these versions.  If not, we would
> have to push back to logback for the solution.

I had filled in Virgo Milestone 3 somewhere in what seemed an appropriate field (build identifier see the original description). (In the version pulldown I just now notice M3 isn't listed by the way) But indeed this is with Milestone 3 including logback 0.9.24 inside the medic bundle and SLF4J api 1.6.1.

We were aware of the logback issue and explicitly changed our DM2 application to Virgo M3 to see if the issue we saw on DM2 would be resolved by upgrading (as was to be expected by the work done by the Virgo team previously) We were not that enthousiastic about unpacking the medic bundle and meddling with the logback inside it.

> Your suspicion,
> 
> | we suspect that the issue is with how DM
> | server/Virgo uses logback.
> 
> though understandable, is extremely unlikely: Virgo doesn't use logback other
> than in medic and using the slf4j api. This usage and implementation in Virgo
> is platform agnostic, though the logback implementation is not; just as the
> problem symptoms appear to be.  As far as we are aware, the logback problem is
> unlikely to be detected in an application with a small number of logging
> threads.  Virgo has quite a few threads logging and so logs rather hard and
> fast (especially when level=DEBUG).
> 
> I suppose you could say it is 'the way we use it' though the issue is probably
> elsewhere.

First we noticed the issue when load testing our application this with DM2. Next we upgraded to Virgo M3, and still noticed the bug with the rolling, which we expected to see go away.

As an extra check I grabbed the vanilla m3 zip and reproduced it the blunt way by setting the loglevel to debug, thus eliminating our applications actions with logback from the equation. Of course it is a pretty convoluted way to reproduce the issue but it remains that under load the rolling will fail. And no clue as to how far you can load it to trigger it.

So unless I've missed something, things more or less point to DM2/Virgo. As circumstancial evidence I can mention that there are two logback MBeans registered in the DM server (one under the medic mbean which does not properly work (for setting of loglevels or maybe it's of the kernel region) and one under the normal logback name, which can be used to change logging at runtime (but is partially broken for reloads (on dm2)).

Note that when the rolling starts to break the application performance is severely impacted. (other than the original load put on it)
Comment 3 Steve Powell CLA 2010-08-26 06:16:18 EDT
> See selected platform in the report. Windows XP although also on Windows Server
> 2003.
My mistake -- thanks.

> (In the version pulldown I just now notice M3 isn't listed by the way)
We know about this, but can't fix it until appropriate authorities return from vac.  :-)

> We were not that enthusiastic about unpacking the medic bundle and meddling with the
> logback inside it.
I think that is wise; although it is not really that hard it requires full ripple builds afterwards to set the dependencies all the way up.

What you might have missed is the assumption that logback actually did fix this problem in the version we now use. I'm inclined to think it is a logback issue.

However, this is a valid bug against Virgo, so this bugzilla remains open until more investigation can be undertaken. We would have to reproduce this problem in a smaller 'app' than Virgo before we could pass it on to logback.

It is great that you have been able to make quite a few experiments, and thank you for reporting your results here.  We are unlikely to be able to resolve this issue before Virgo's proper release (as it might very well mean updating to (yet a later) logback version -- and dependency version changes are unlikely before that).

Can you check the importance of the bugzilla and indicate here how seriously this affects you?
Comment 4 Ric Klaren CLA 2010-08-27 03:17:15 EDT
I guess it would leave it at the current setting.

Although we're seeing the rolling problem already with some non-mission critical logging in test environments at customers. We have some room to trim down some of the logging but it would be annoying to run into an issue and come to the conclusion that the logging contains n-times the same outdated logging (as such I'd like to call it a major). 

We tried a test case with the logback version and multi threaded writing but were unable to reproduce the double file handle issue.

We'll start investigating some workarounds.

Is it possible to run without medic bundle and use vanilla logback?
Comment 5 Daniƫl van 't Ooster CLA 2010-08-31 08:41:37 EDT
Created attachment 177829 [details]
Logback debug output

This file contains the console output after setting the debug="true" attribute on the configuration element in serviceability.xml. As you can see, the output from line 6 to 39 is 'repeated' in line 41 to 75. When the server is started, you can see 2 handles on the logfile. When adding the <jmxConfigurator/> element, 2 MBeans are registered (both with a different name: 'default' and 'org.eclipse.virgo.medic'.

Summarized: it seems LogBack is initialized twice, blocking the RollingFileAppender from doing its job well.

BTW easy way to reproduce the problem: set the root logger level to debug or trace, and just start the server. On my machine, this gives enough load to reproduce the problem.
Comment 6 Ric Klaren CLA 2010-08-31 09:50:27 EDT
Created attachment 177836 [details]
Stack traces of two RollingFileAppender.starts during virgo startup

Stack traces taken from Virgo startup after trimming down serviceability.xml to only one RollingFileAppender and placing a break point in RollingFileAppender.start() (where the File handle is openened)
Comment 7 Glyn Normington CLA 2010-08-31 12:27:45 EDT
Thanks for those stack traces. We'll dig into this soon.
Comment 8 Steve Powell CLA 2010-09-01 04:59:23 EDT
Starting analysis
Comment 9 Ric Klaren CLA 2010-09-01 10:37:53 EDT
We spent some more time testing. The root cause with all rolling problems is two file handles open on the same file, this combined with lacking error handling in the FixedWindowRollingPolicy creates the situation that logback will fall back to file copies for a rename operation. Thus creating N logfiles with near identical content and a lot of I/O. These copy operations kill the performance of the server.

Switching to TimeBasedRollingPolicy prevents the copy problems (but not all rolling will work due to sometimes files being locked at time of rolling) the prudent setting is not necessary in this setup (if you don't mind a logfile not getting cleaned up during cycling) It seems somewhat random what file gets stuck.

At first we suspected that the prudent setting would be necessary (of which logback documentation advertised 3 times slower logging) Testing showed that prudent did not have this much impact on our applications logging (almost none). 

So currently we have a work around.
Comment 10 Steve Powell CLA 2010-09-01 12:24:34 EDT
Hey Ric!

Thanks for the xtra testing -- we have been doing some looking ourselves; we can reproduce this problem on windows with just the kernel (reduce the rolling log file threshold to about 100KB to see it) and we have reduced the problem to the file handles as you reported.

What seems to be happening is that the default logger context (that logback insists on having) is set up (unnecessarily for us) early on -- but this sets configuration appenders to be identical to our usual ones. Hence two handles pointing to the same file(s).

We are experimenting here with an alternative default logger context, so that early in initialisation we can have one, but arrange that it doesn't collide with the configured one.

More news later, when we have figured out what the default context should be, and whether we need one at all!

Steve
Comment 11 Glyn Normington CLA 2010-09-02 12:18:15 EDT
We have changed the Virgo medic component to avoid the duplicate logging contexts and file handles and this fixes the bug.

Milestone 4 which should be available soon and will include this fix.
Comment 12 Ric Klaren CLA 2010-09-03 03:19:02 EDT
Thanks!

Looking forward to give M4 a try.

Cheers,

Ric
Comment 13 Glyn Normington CLA 2010-09-03 03:49:14 EDT
Fixed in 2.1.0.M04-incubation. Please verify in your environment and mark this bug as closed if the fix works for you.
Comment 14 Ric Klaren CLA 2010-09-03 07:29:03 EDT
* Confirmed that only one file handle is open on the logfiles with M4. 
* Also tested that it properly rolls with logging configuration that previously gave problems.

Thanks!

Ric
Comment 15 Steve Powell CLA 2010-09-03 12:25:56 EDT
Now accepted as fixing the bug in M04-incubation.
Comment 16 Missing name CLA 2010-10-13 16:10:16 EDT
Here are a few quick remarks which could help.

1) It is always a good idea to register a status listener to debug logback related issues so that messages generated after logback configuration. See http://logback.qos.ch/manual/configuration.html#statusListener

I suspect if OnConsoleStatusListener were registered you would have seen messages describing the contention problem on the console, easing the identification of the problem.

2) Regarding the default logger context, it is always possible to reset it so that its prior configuration is ignored.

3) Given that multiple logger contexts are in use, you can direct the/your context selector to use different configuration files for different contexts. See for example the conventionalConfigFileName method [1] in ContextJNDISelector which defines "logback-" + contextName + ".xml" as a convention for configuration file names.

HTH,

[1] http://logback.qos.ch/xref/ch/qos/logback/classic/selector/ContextJNDISelector.html#118
Comment 17 Glyn Normington CLA 2010-10-14 03:46:50 EDT
(In reply to comment #16)
> Here are a few quick remarks which could help.

Thanks Ceki. I raised bug 327739 to capture the status listener enhancement. As we are now down to a single logging context, I think we are in good shape for the upcoming release and we can add the status listener for the next release.

Thanks again for Logback and especially for its exemplary documentation - it's obvious how much care you took!

Regards,
Glyn