Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 315221 - [DB] Connection-Keep-Alive-DBStoreAccessor threads never end
Summary: [DB] Connection-Keep-Alive-DBStoreAccessor threads never end
Status: CLOSED FIXED
Alias: None
Product: EMF
Classification: Modeling
Component: cdo.db (show other bugs)
Version: 3.0   Edit
Hardware: PC All
: P3 blocker (vote)
Target Milestone: ---   Edit
Assignee: Eike Stepper CLA
QA Contact: Eike Stepper CLA
URL:
Whiteboard: offline-01
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-01 10:08 EDT by Erwin Betschart CLA
Modified: 2010-06-29 04:35 EDT (History)
8 users (show)

See Also:
sven.efftinge: pmc_approved+
Ed.Merks: pmc_approved+
Kenn.Hussey: pmc_approved+
martin.fluegge: review+
mtaal: review+
stefan: review+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Erwin Betschart CLA 2010-06-01 10:08:37 EDT
Build Identifier: 3.0

For every commit on a offline clone a new Connection-Keep-Alive-DBStoreAccessor is created and never shutdown.

Stacktrace where the db accessor is created:
OfflineClone$WriteThroughCommitContext(TransactionCommitContext).preWrite() line: 236	
CommitTransactionIndication.indicatingCommit(CDODataInput, OMMonitor) line: 185	
CommitTransactionIndication.indicating(CDODataInput, OMMonitor) line: 163	
CommitTransactionIndication.indicating(ExtendedDataInputStream, OMMonitor) line: 118	CommitTransactionIndication(IndicationWithMonitoring).indicating(ExtendedDataInputStream) line: 84	CommitTransactionIndication(IndicationWithResponse).doExtendedInput(ExtendedDataInputStream) line: 90	CommitTransactionIndication(Signal).doInput(BufferInputStream) line: 315	
CommitTransactionIndication(IndicationWithResponse).execute(BufferInputStream, BufferOutputStream) line: 63
CommitTransactionIndication(IndicationWithMonitoring).execute(BufferInputStream, BufferOutputStream) line: 63	
CommitTransactionIndication(Signal).runSync() line: 240	
CommitTransactionIndication(Signal).run() line: 146	
ThreadPoolExecutor$Worker.runTask(Runnable) line: 886	
ThreadPoolExecutor$Worker.run() line: 908	
Thread.run() line: 619	




Reproducible: Always
Comment 1 Eike Stepper CLA 2010-06-03 07:28:07 EDT
Stefan, as a side note, was there a particular reason that we create a dedicated java.util.Timer per accessor instance/activation? I think the whole purpose of Timers is sharing them as much as possible...
Comment 2 Eike Stepper CLA 2010-06-03 08:13:43 EDT
I'm still not 100% sure but I think it's "just" a matter of accessor *pooling*. I.e. the postCommit() method of the TransactionCommitContext is in fact called and the accessor release to its pool. The connectionKeepAliveTimer is properly surving the pooled phase. I'd like to further investigate if there's a problem with pooling in a way that pooled accessors stay too long or indefinitely in the pool because they're not reused later. In that case we could make the whole accessor pooling configurable. Currently you can only disable it by overriding:

org.eclipse.emf.cdo.server.internal.db.DBStore.getReaderPool(ISession, boolean)
org.eclipse.emf.cdo.server.internal.db.DBStore.getWriterPool(IView, boolean)

I will also clarify if we should better only create a single timer (thread) per store and not one per each store accessor...
Comment 3 Stefan Winkler CLA 2010-06-04 03:07:05 EDT
Eike,

generally, the task of the Timer-Thread is to issue an SQL to the database every x hours because MySQL and maybe other DBMS close the connection at their side if there is too much idle time. Because the timing was connection-based, creating a timer for each connection was the most intuituve implementation.

Of course, this could be refactored to a single timer thread which is associated to the store and which goes through all of the accessors in the pool every x hours and refreshes the connection once and for all. 

Basically the requirement is: Send an SQL in each open DB connection every x hours.
Comment 4 Eike Stepper CLA 2010-06-04 03:14:32 EDT
In WriteThroughCommitContext this one was missing:

    @Override
    public void preWrite()
    {
      // Do nothing
    }
Comment 5 Eike Stepper CLA 2010-06-04 03:16:44 EDT
(In reply to comment #3)
> Of course, this could be refactored to a single timer thread which is associated
> to the store and which goes through all of the accessors in the pool every x
> hours and refreshes the connection once and for all.

Yes, I just changed that ;-)

connectionKeepAliveTimer is no located in DBStore. Only the tasks stay in the accessor class and are scheduled from there once per accessor instance (but periodically, every 4 hours)
Comment 6 Eike Stepper CLA 2010-06-04 03:18:50 EDT
Committed to HEAD
Comment 7 Eike Stepper CLA 2010-06-11 07:32:46 EDT
Two issues remain:

1) Don't propagate unchecked exceptions from TimerTask.run()
2) cancel() the TimerTask before connection.close()
Comment 8 Eike Stepper CLA 2010-06-11 12:01:42 EDT
Committed to HEAD
Comment 9 Eike Stepper CLA 2010-06-13 01:01:25 EDT
This bug is critical because it definitely leads to the death of a repository instance after exactly 1 hour. The trivial fix is absolutely needed in 3.0 GA because every user would hit this problem after 1 hour of running a repository on a JDBC database.

As this is a post RC4 fix we need:

- Two review+ flags from Committers (Martin F., Stefan and Martin T.?) 
- Two additional pmc_approved+ flags from PMC members (Ed, Kenn and Sven?).
Comment 10 Eike Stepper CLA 2010-06-13 01:05:28 EDT
Please note that the important part of this bug to review for late contribution to Helios is described in comment #7 and has been committed as logged in comment #8.
Comment 11 David Williams CLA 2010-06-14 09:23:21 EDT
I notice the bug is marked "P3, normal". that's probably not accurate for 
a bug so important it requires a fix after RC4?  I suggest you make those accurate (though, I'll admit, I don't know the rules your project normally follows for marking bugs ... maybe those are accurate for your project?). 

Also, normally, a bug to be "blocking" (requiring a respin after RC4) some
discussion of work-arounds or alternatives would be appropriate. For example, is the work-around in this case that users have to restart Eclipse? (once every hour? :(

And, I assume, this would be a regression from previous release? 

Also, typically, some discussion of why this was not discovered until so late would be in order.  Do testing procedures need to be improved? Or, just a rare path that is only hit in special cases? etc. 

Some of this is probably obvious to you in the field ... but, since asking for a respin of helios common repo, it should be described for the casual reader ... like me :) 

Thanks,
Comment 12 Eike Stepper CLA 2010-06-14 12:35:03 EDT
(In reply to comment #11)
> I notice the bug is marked "P3, normal". that's probably not accurate for 
> a bug so important it requires a fix after RC4?  I suggest you make those
> accurate (though, I'll admit, I don't know the rules your project normally
> follows for marking bugs ... maybe those are accurate for your project?). 

The rules are: The reporter gives the severity and we mark the priority. The latter was not necessary because the fix was just two lines. Also we chould have filed a separate bug instead of reopening after my first fix to the original problem. That's why I chose to mark it critical in comment #9.

> Also, normally, a bug to be "blocking" (requiring a respin after RC4) some
> discussion of work-arounds or alternatives would be appropriate. For example,
> is the work-around in this case that users have to restart Eclipse?
> (once every hour? :(

There is no workaround. CDO involves running a model repository server with a database connection. Mysql closes inactive connections after 4 hours. So we added a Timer that simply touches the connection every hour. With this mechanism the 2 issues mentioned briefly in comment #7 were introduced and these definitely lead to a crash in the model repository server after exactly 1 hour (for all database types).

> And, I assume, this would be a regression from previous release? 

No.

> Also, typically, some discussion of why this was not discovered until so late
> would be in order.  Do testing procedures need to be improved? Or, just a rare
> path that is only hit in special cases? etc. 

Please see my explanation above.

> Some of this is probably obvious to you in the field ... but, since asking for
> a respin of helios common repo, it should be described for the casual reader
> ... like me :) 

Yes, obvious to all the other involved parties but I understand that in this late case we should have been more precise on these explanations. Do you think it is clear now?
Comment 13 David Williams CLA 2010-06-14 12:47:50 EDT
yes, very clear. Thank you. Now ... for when? Earlier the better. Will you be ready at 2 PM eastern today for a respin? (And, by ready, I mean fix in, someone on stand-by in case the build breaks :)
Comment 14 Eike Stepper CLA 2010-06-14 12:52:06 EDT
(In reply to comment #13)
> yes, very clear. Thank you. Now ... for when? Earlier the better. Will you be
> ready at 2 PM eastern today for a respin?

That seems to be in 1:10 hours from now. I might need a couple of additional minutes. Just for the case that signing takes longer. But generally yes ;-)


> (And, by ready, I mean fix in,
> someone on stand-by in case the build breaks :)

I'll stay online or check my iphone for mails until I see a green aggregation!
Comment 15 Eike Stepper CLA 2010-06-29 04:35:39 EDT
Available in 3.0 GA:
http://download.eclipse.org/modeling/emf/cdo/updates/3.0-releases/