Community
Participate
Working Groups
Build Identifier: 3.0 For every commit on a offline clone a new Connection-Keep-Alive-DBStoreAccessor is created and never shutdown. Stacktrace where the db accessor is created: OfflineClone$WriteThroughCommitContext(TransactionCommitContext).preWrite() line: 236 CommitTransactionIndication.indicatingCommit(CDODataInput, OMMonitor) line: 185 CommitTransactionIndication.indicating(CDODataInput, OMMonitor) line: 163 CommitTransactionIndication.indicating(ExtendedDataInputStream, OMMonitor) line: 118 CommitTransactionIndication(IndicationWithMonitoring).indicating(ExtendedDataInputStream) line: 84 CommitTransactionIndication(IndicationWithResponse).doExtendedInput(ExtendedDataInputStream) line: 90 CommitTransactionIndication(Signal).doInput(BufferInputStream) line: 315 CommitTransactionIndication(IndicationWithResponse).execute(BufferInputStream, BufferOutputStream) line: 63 CommitTransactionIndication(IndicationWithMonitoring).execute(BufferInputStream, BufferOutputStream) line: 63 CommitTransactionIndication(Signal).runSync() line: 240 CommitTransactionIndication(Signal).run() line: 146 ThreadPoolExecutor$Worker.runTask(Runnable) line: 886 ThreadPoolExecutor$Worker.run() line: 908 Thread.run() line: 619 Reproducible: Always
Stefan, as a side note, was there a particular reason that we create a dedicated java.util.Timer per accessor instance/activation? I think the whole purpose of Timers is sharing them as much as possible...
I'm still not 100% sure but I think it's "just" a matter of accessor *pooling*. I.e. the postCommit() method of the TransactionCommitContext is in fact called and the accessor release to its pool. The connectionKeepAliveTimer is properly surving the pooled phase. I'd like to further investigate if there's a problem with pooling in a way that pooled accessors stay too long or indefinitely in the pool because they're not reused later. In that case we could make the whole accessor pooling configurable. Currently you can only disable it by overriding: org.eclipse.emf.cdo.server.internal.db.DBStore.getReaderPool(ISession, boolean) org.eclipse.emf.cdo.server.internal.db.DBStore.getWriterPool(IView, boolean) I will also clarify if we should better only create a single timer (thread) per store and not one per each store accessor...
Eike, generally, the task of the Timer-Thread is to issue an SQL to the database every x hours because MySQL and maybe other DBMS close the connection at their side if there is too much idle time. Because the timing was connection-based, creating a timer for each connection was the most intuituve implementation. Of course, this could be refactored to a single timer thread which is associated to the store and which goes through all of the accessors in the pool every x hours and refreshes the connection once and for all. Basically the requirement is: Send an SQL in each open DB connection every x hours.
In WriteThroughCommitContext this one was missing: @Override public void preWrite() { // Do nothing }
(In reply to comment #3) > Of course, this could be refactored to a single timer thread which is associated > to the store and which goes through all of the accessors in the pool every x > hours and refreshes the connection once and for all. Yes, I just changed that ;-) connectionKeepAliveTimer is no located in DBStore. Only the tasks stay in the accessor class and are scheduled from there once per accessor instance (but periodically, every 4 hours)
Committed to HEAD
Two issues remain: 1) Don't propagate unchecked exceptions from TimerTask.run() 2) cancel() the TimerTask before connection.close()
This bug is critical because it definitely leads to the death of a repository instance after exactly 1 hour. The trivial fix is absolutely needed in 3.0 GA because every user would hit this problem after 1 hour of running a repository on a JDBC database. As this is a post RC4 fix we need: - Two review+ flags from Committers (Martin F., Stefan and Martin T.?) - Two additional pmc_approved+ flags from PMC members (Ed, Kenn and Sven?).
Please note that the important part of this bug to review for late contribution to Helios is described in comment #7 and has been committed as logged in comment #8.
I notice the bug is marked "P3, normal". that's probably not accurate for a bug so important it requires a fix after RC4? I suggest you make those accurate (though, I'll admit, I don't know the rules your project normally follows for marking bugs ... maybe those are accurate for your project?). Also, normally, a bug to be "blocking" (requiring a respin after RC4) some discussion of work-arounds or alternatives would be appropriate. For example, is the work-around in this case that users have to restart Eclipse? (once every hour? :( And, I assume, this would be a regression from previous release? Also, typically, some discussion of why this was not discovered until so late would be in order. Do testing procedures need to be improved? Or, just a rare path that is only hit in special cases? etc. Some of this is probably obvious to you in the field ... but, since asking for a respin of helios common repo, it should be described for the casual reader ... like me :) Thanks,
(In reply to comment #11) > I notice the bug is marked "P3, normal". that's probably not accurate for > a bug so important it requires a fix after RC4? I suggest you make those > accurate (though, I'll admit, I don't know the rules your project normally > follows for marking bugs ... maybe those are accurate for your project?). The rules are: The reporter gives the severity and we mark the priority. The latter was not necessary because the fix was just two lines. Also we chould have filed a separate bug instead of reopening after my first fix to the original problem. That's why I chose to mark it critical in comment #9. > Also, normally, a bug to be "blocking" (requiring a respin after RC4) some > discussion of work-arounds or alternatives would be appropriate. For example, > is the work-around in this case that users have to restart Eclipse? > (once every hour? :( There is no workaround. CDO involves running a model repository server with a database connection. Mysql closes inactive connections after 4 hours. So we added a Timer that simply touches the connection every hour. With this mechanism the 2 issues mentioned briefly in comment #7 were introduced and these definitely lead to a crash in the model repository server after exactly 1 hour (for all database types). > And, I assume, this would be a regression from previous release? No. > Also, typically, some discussion of why this was not discovered until so late > would be in order. Do testing procedures need to be improved? Or, just a rare > path that is only hit in special cases? etc. Please see my explanation above. > Some of this is probably obvious to you in the field ... but, since asking for > a respin of helios common repo, it should be described for the casual reader > ... like me :) Yes, obvious to all the other involved parties but I understand that in this late case we should have been more precise on these explanations. Do you think it is clear now?
yes, very clear. Thank you. Now ... for when? Earlier the better. Will you be ready at 2 PM eastern today for a respin? (And, by ready, I mean fix in, someone on stand-by in case the build breaks :)
(In reply to comment #13) > yes, very clear. Thank you. Now ... for when? Earlier the better. Will you be > ready at 2 PM eastern today for a respin? That seems to be in 1:10 hours from now. I might need a couple of additional minutes. Just for the case that signing takes longer. But generally yes ;-) > (And, by ready, I mean fix in, > someone on stand-by in case the build breaks :) I'll stay online or check my iphone for mails until I see a green aggregation!
Available in 3.0 GA: http://download.eclipse.org/modeling/emf/cdo/updates/3.0-releases/