Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 328137 - Possible thread leak in remoteservices
Summary: Possible thread leak in remoteservices
Status: RESOLVED FIXED
Alias: None
Product: ECF
Classification: RT
Component: ecf.remoteservices (show other bugs)
Version: 3.3.0   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: ecf.core-inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-19 11:10 EDT by Bryan Hunt CLA
Modified: 2013-01-29 15:57 EST (History)
2 users (show)

See Also:


Attachments
console output (72.44 KB, text/plain)
2010-10-19 11:11 EDT, Bryan Hunt CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bryan Hunt CLA 2010-10-19 11:10:33 EDT
I'm seeing extra threads created by ECF to manage remote services when a service is "rediscovered", and the old threads are not getting cleaned up.  I captured a trace using YourKit with a system as follows:

My system contained one client and three servers all running on the same host but in separate JVMs.  Each of the three servers publishes the exact same remote service, and are configured identically except for the ECF container port number which is unique.  I'm using zookeeper for service discovery.  Once the system is up and running, it is never shut down unless I need to deploy new code.  I would expect the three remote services to be discovered by the client once and never undiscovered.

I'm having a problem with the zookeeper service discovery that could be contributing to the threads leaking.  At random times, zookeeper appears to be undiscovering and then re-discovering the remote services.  I can force the problem to happen by connecting to the client process with YourKit from a remote machine such as when running YourKit at home and running the client at work.  During the initial connection from YourKit,  I will see messages from zookeeper that services are undiscovered and re-discovered.  After YourKit is connected, I will occasionally see the same messages while the process runs.  I have also seen the problem even when YourKit is not connected.  In one case after an overnight run, my thread count went from around 40 to around 120.

I suspect that something in the network is contributing to the problem.  I've been having intermittent problems with the network on my workstation and this box is where I've been testing the system described above.  I have seen cases where a ping from my home machine to my workstation fails for the first few pings and then starts working again.  IT is looking into the issue, and I suspect that some of these problems will go away when they fix the network problem.  Even so, ECF and zookeeper should be tolerant of network issues.

In the YourKit trace, there were three sets of ECF threads for the remote services up to time 3:39:57.  At this time, ECF started two additional IRemoteServiceContainerAdapter:run threads and two additional RSRegistry Dispatcher threads and did not stop any of the others.  Also at this time, ECF stopped the ping, rcvr and sndr threads for all three remote containers on ports 30000, 30001, and 30002.  ECF then started 5 sets of ping, rcvr, and sndr threads.  It started one set for port 30000, one set for port 30001, and three sets for port 30002.

I captured the following console data during this time:

ZooDiscovery> Service Discovered: Oct 15, 2010 3:34:21 PM. ServiceInfo[uri=9.41.188.106;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=9.41.188.106;full=_osgiservices._tcp.default._iana@9.41.188.106];priority=0;weight=0;props=ServiceProperties[{com.ibm.hdwb.jobs.common.monitor.port=9020, ecf.sp.ect=ecf.generic.server, com.ibm.hdwb.jobs.common.monitor.command=/afs/awd/projects/cte/tools/hdwb/prod/llmonitor/monitor.ksh, component.id=18, com.ibm.hdwb.jobs.common.monitor.host=tritium.austin.ibm.com, org.eclipse.ecf.internal.discovery.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@8800880, component.name=com.ibm.hdwb.ll.server.job_queue_service, ll_submit_command=, ecf.rsvc.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@7650765, ecf.sp.cid=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@7e627e62, com.ibm.hdwb.jobs.common.pool.uuid=db8f4161-bb74-4a1a-a4f4-807b365cadf8, com.ibm.hdwb.jobs.common.monitor.submitter=bhunt, service.factoryPid=com.ibm.hdwb.ll.server.job_queue_service, service.pid=com.ibm.hdwb.ll.server.job_queue_service-1287095024917-0, com.ibm.hdwb.jobs.common.monitor.restlet.port=8080, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, osgi.remote.service.interfaces=com.ibm.hdwb.jobs.common.IJobQueueService, ecf.rsvc.ns=ecf.namespace.generic.remoteservice, com.ibm.hdwb.jobs.common.monitor.restlet.host=tritium.austin.ibm.com}]]

bindJobService()
Monitored job service found job service for submitter: bhunt with id: f8657d3d-b478-49ef-b59d-ac9ed9629a65

ZooDiscovery> Service Undiscovered: Oct 15, 2010 3:37:40 PM. ServiceInfo[uri=9.41.188.106;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=9.41.188.106;full=_osgiservices._tcp.default._iana@9.41.188.106];priority=0;weight=0;props=ServiceProperties[{com.ibm.hdwb.jobs.common.monitor.port=9020, ecf.sp.ect=ecf.generic.server, com.ibm.hdwb.jobs.common.monitor.command=/afs/awd/projects/cte/tools/hdwb/prod/llmonitor/monitor.ksh, component.id=18, com.ibm.hdwb.jobs.common.monitor.host=tritium.austin.ibm.com, org.eclipse.ecf.internal.discovery.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@8800880, component.name=com.ibm.hdwb.ll.server.job_queue_service, ll_submit_command=, ecf.rsvc.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@7650765, ecf.sp.cid=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@7e627e62, com.ibm.hdwb.jobs.common.pool.uuid=db8f4161-bb74-4a1a-a4f4-807b365cadf8, com.ibm.hdwb.jobs.common.monitor.submitter=bhunt, service.factoryPid=com.ibm.hdwb.ll.server.job_queue_service, service.pid=com.ibm.hdwb.ll.server.job_queue_service-1287095024917-0, com.ibm.hdwb.jobs.common.monitor.restlet.port=8080, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, osgi.remote.service.interfaces=com.ibm.hdwb.jobs.common.IJobQueueService, ecf.rsvc.ns=ecf.namespace.generic.remoteservice, com.ibm.hdwb.jobs.common.monitor.restlet.host=tritium.austin.ibm.com}]]

unbindJobService()
  submitter: bhunt

ZooDiscovery> Service Discovered: Oct 15, 2010 3:37:42 PM. ServiceInfo[uri=9.41.188.106;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=9.41.188.106;full=_osgiservices._tcp.default._iana@9.41.188.106];priority=0;weight=0;props=ServiceProperties[{com.ibm.hdwb.jobs.common.monitor.port=9020, ecf.sp.ect=ecf.generic.server, com.ibm.hdwb.jobs.common.monitor.command=/afs/awd/projects/cte/tools/hdwb/prod/llmonitor/monitor.ksh, component.id=18, com.ibm.hdwb.jobs.common.monitor.host=tritium.austin.ibm.com, org.eclipse.ecf.internal.discovery.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@517a517a, component.name=com.ibm.hdwb.ll.server.job_queue_service, ll_submit_command=, ecf.rsvc.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@505f505f, ecf.sp.cid=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@4f754f75, com.ibm.hdwb.jobs.common.pool.uuid=db8f4161-bb74-4a1a-a4f4-807b365cadf8, com.ibm.hdwb.jobs.common.monitor.submitter=bhunt, service.factoryPid=com.ibm.hdwb.ll.server.job_queue_service, service.pid=com.ibm.hdwb.ll.server.job_queue_service-1287152632766-0, com.ibm.hdwb.jobs.common.monitor.restlet.port=8080, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, osgi.remote.service.interfaces=com.ibm.hdwb.jobs.common.IJobQueueService, ecf.rsvc.ns=ecf.namespace.generic.remoteservice, com.ibm.hdwb.jobs.common.monitor.restlet.host=tritium.austin.ibm.com}]]

ZooDiscovery> Service Discovered: Oct 15, 2010 3:37:42 PM. ServiceInfo[uri=9.41.188.106;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=9.41.188.106;full=_osgiservices._tcp.default._iana@9.41.188.106];priority=0;weight=0;props=ServiceProperties[{com.ibm.hdwb.jobs.common.monitor.port=9020, ecf.sp.ect=ecf.generic.server, com.ibm.hdwb.jobs.common.monitor.command=/afs/awd/projects/cte/tools/hdwb/prod/llmonitor/monitor.ksh, component.id=18, com.ibm.hdwb.jobs.common.monitor.host=tritium.austin.ibm.com, org.eclipse.ecf.internal.discovery.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@37633763, component.name=com.ibm.hdwb.ll.server.job_queue_service, ll_submit_command=, ecf.rsvc.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@36483648, ecf.sp.cid=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@355e355e, com.ibm.hdwb.jobs.common.pool.uuid=db8f4161-bb74-4a1a-a4f4-807b365cadf8, com.ibm.hdwb.jobs.common.monitor.submitter=bhunt, service.factoryPid=com.ibm.hdwb.ll.server.job_queue_service, service.pid=com.ibm.hdwb.ll.server.job_queue_service-1287095024917-0, com.ibm.hdwb.jobs.common.monitor.restlet.port=8080, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, osgi.remote.service.interfaces=com.ibm.hdwb.jobs.common.IJobQueueService, ecf.rsvc.ns=ecf.namespace.generic.remoteservice, com.ibm.hdwb.jobs.common.monitor.restlet.host=tritium.austin.ibm.com}]]

bindJobService()
Monitored job service found job service for submitter: bhunt with id: f8657d3d-b478-49ef-b59d-ac9ed9629a65

ZooDiscovery> Service Discovered: Oct 15, 2010 3:39:57 PM. ServiceInfo[uri=9.41.188.106;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=9.41.188.106;full=_osgiservices._tcp.default._iana@9.41.188.106];priority=0;weight=0;props=ServiceProperties[{com.ibm.hdwb.jobs.common.monitor.port=9020, ecf.sp.ect=ecf.generic.server, com.ibm.hdwb.jobs.common.monitor.command=/afs/awd/projects/cte/tools/hdwb/prod/llmonitor/monitor.ksh, component.id=19, com.ibm.hdwb.jobs.common.monitor.host=tritium.austin.ibm.com, org.eclipse.ecf.internal.discovery.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@23672367, component.name=com.ibm.hdwb.ll.server.job_queue_service, ll_submit_command=, ecf.rsvc.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@1ac41ac4, ecf.sp.cid=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@16fe16fe, com.ibm.hdwb.jobs.common.pool.uuid=db8f4161-bb74-4a1a-a4f4-807b365cadf8, com.ibm.hdwb.jobs.common.monitor.submitter=bhunt, service.factoryPid=com.ibm.hdwb.ll.server.job_queue_service, service.pid=com.ibm.hdwb.ll.server.job_queue_service-1287175162582-0, com.ibm.hdwb.jobs.common.monitor.restlet.port=8080, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, osgi.remote.service.interfaces=com.ibm.hdwb.jobs.common.IJobQueueService, ecf.rsvc.ns=ecf.namespace.generic.remoteservice, com.ibm.hdwb.jobs.common.monitor.restlet.host=tritium.austin.ibm.com}]]

bindJobService()
Monitored job service found job service for submitter: bhunt with id: 6c50eb3b-bb76-49aa-b002-ec44ef66a9e8
unbindJobService()
  submitter: bhunt
unbindJobService()
  submitter: bhunt
unbindJobService()
  submitter: bhunt

ZooDiscovery> Service Undiscovered: Oct 15, 2010 3:41:44 PM. ServiceInfo[uri=9.41.188.106;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=9.41.188.106;full=_osgiservices._tcp.default._iana@9.41.188.106];priority=0;weight=0;props=ServiceProperties[{com.ibm.hdwb.jobs.common.monitor.port=9020, ecf.sp.ect=ecf.generic.server, com.ibm.hdwb.jobs.common.monitor.command=/afs/awd/projects/cte/tools/hdwb/prod/llmonitor/monitor.ksh, component.id=19, com.ibm.hdwb.jobs.common.monitor.host=tritium.austin.ibm.com, org.eclipse.ecf.internal.discovery.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@23672367, component.name=com.ibm.hdwb.ll.server.job_queue_service, ll_submit_command=, ecf.rsvc.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@1ac41ac4, ecf.sp.cid=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@16fe16fe, com.ibm.hdwb.jobs.common.pool.uuid=db8f4161-bb74-4a1a-a4f4-807b365cadf8, com.ibm.hdwb.jobs.common.monitor.submitter=bhunt, service.factoryPid=com.ibm.hdwb.ll.server.job_queue_service, service.pid=com.ibm.hdwb.ll.server.job_queue_service-1287175162582-0, com.ibm.hdwb.jobs.common.monitor.restlet.port=8080, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, osgi.remote.service.interfaces=com.ibm.hdwb.jobs.common.IJobQueueService, ecf.rsvc.ns=ecf.namespace.generic.remoteservice, com.ibm.hdwb.jobs.common.monitor.restlet.host=tritium.austin.ibm.com}]]

ZooDiscovery> Service Discovered: Oct 15, 2010 3:41:45 PM. ServiceInfo[uri=9.41.188.106;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=9.41.188.106;full=_osgiservices._tcp.default._iana@9.41.188.106];priority=0;weight=0;props=ServiceProperties[{com.ibm.hdwb.jobs.common.monitor.port=9020, ecf.sp.ect=ecf.generic.server, com.ibm.hdwb.jobs.common.monitor.command=/afs/awd/projects/cte/tools/hdwb/prod/llmonitor/monitor.ksh, component.id=19, com.ibm.hdwb.jobs.common.monitor.host=tritium.austin.ibm.com, org.eclipse.ecf.internal.discovery.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@dde0dde, component.name=com.ibm.hdwb.ll.server.job_queue_service, ll_submit_command=, ecf.rsvc.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@cb30cb3, ecf.sp.cid=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@bc90bc9, com.ibm.hdwb.jobs.common.pool.uuid=db8f4161-bb74-4a1a-a4f4-807b365cadf8, com.ibm.hdwb.jobs.common.monitor.submitter=bhunt, service.factoryPid=com.ibm.hdwb.ll.server.job_queue_service, service.pid=com.ibm.hdwb.ll.server.job_queue_service-1287175162582-0, com.ibm.hdwb.jobs.common.monitor.restlet.port=8080, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, osgi.remote.service.interfaces=com.ibm.hdwb.jobs.common.IJobQueueService, ecf.rsvc.ns=ecf.namespace.generic.remoteservice, com.ibm.hdwb.jobs.common.monitor.restlet.host=tritium.austin.ibm.com}]]

ZooDiscovery> Service Discovered: Oct 15, 2010 3:41:45 PM. ServiceInfo[uri=9.41.188.106;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=9.41.188.106;full=_osgiservices._tcp.default._iana@9.41.188.106];priority=0;weight=0;props=ServiceProperties[{com.ibm.hdwb.jobs.common.monitor.port=9020, ecf.sp.ect=ecf.generic.server, com.ibm.hdwb.jobs.common.monitor.command=/afs/awd/projects/cte/tools/hdwb/prod/llmonitor/monitor.ksh, component.id=18, com.ibm.hdwb.jobs.common.monitor.host=tritium.austin.ibm.com, org.eclipse.ecf.internal.discovery.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@6d0f6d0f, component.name=com.ibm.hdwb.ll.server.job_queue_service, ll_submit_command=, ecf.rsvc.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@6bf46bf4, ecf.sp.cid=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@6b0a6b0a, com.ibm.hdwb.jobs.common.pool.uuid=db8f4161-bb74-4a1a-a4f4-807b365cadf8, com.ibm.hdwb.jobs.common.monitor.submitter=bhunt, service.factoryPid=com.ibm.hdwb.ll.server.job_queue_service, service.pid=com.ibm.hdwb.ll.server.job_queue_service-1287152632766-0, com.ibm.hdwb.jobs.common.monitor.restlet.port=8080, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, osgi.remote.service.interfaces=com.ibm.hdwb.jobs.common.IJobQueueService, ecf.rsvc.ns=ecf.namespace.generic.remoteservice, com.ibm.hdwb.jobs.common.monitor.restlet.host=tritium.austin.ibm.com}]]

ZooDiscovery> Service Discovered: Oct 15, 2010 3:41:45 PM. ServiceInfo[uri=9.41.188.106;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=9.41.188.106;full=_osgiservices._tcp.default._iana@9.41.188.106];priority=0;weight=0;props=ServiceProperties[{com.ibm.hdwb.jobs.common.monitor.port=9020, ecf.sp.ect=ecf.generic.server, com.ibm.hdwb.jobs.common.monitor.command=/afs/awd/projects/cte/tools/hdwb/prod/llmonitor/monitor.ksh, component.id=18, com.ibm.hdwb.jobs.common.monitor.host=tritium.austin.ibm.com, org.eclipse.ecf.internal.discovery.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@42034203, component.name=com.ibm.hdwb.ll.server.job_queue_service, ll_submit_command=, ecf.rsvc.id=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@40e840e8, ecf.sp.cid=org.eclipse.ecf.discovery.ServiceProperties$ByteArrayWrapper@3ffe3ffe, com.ibm.hdwb.jobs.common.pool.uuid=db8f4161-bb74-4a1a-a4f4-807b365cadf8, com.ibm.hdwb.jobs.common.monitor.submitter=bhunt, service.factoryPid=com.ibm.hdwb.ll.server.job_queue_service, service.pid=com.ibm.hdwb.ll.server.job_queue_service-1287095024917-0, com.ibm.hdwb.jobs.common.monitor.restlet.port=8080, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, osgi.remote.service.interfaces=com.ibm.hdwb.jobs.common.IJobQueueService, ecf.rsvc.ns=ecf.namespace.generic.remoteservice, com.ibm.hdwb.jobs.common.monitor.restlet.host=tritium.austin.ibm.com}]]

bindJobService()
Monitored job service found job service for submitter: bhunt with id: 6c50eb3b-bb76-49aa-b002-ec44ef66a9e8
bindJobService()
Monitored job service found job service for submitter: bhunt with id: 6c50eb3b-bb76-49aa-b002-ec44ef66a9e8
Monitored job service found duplicate job service for submitter: bhunt with id: 6c50eb3b-bb76-49aa-b002-ec44ef66a9e8
bindJobService()
Monitored job service found job service for submitter: bhunt with id: 6c50eb3b-bb76-49aa-b002-ec44ef66a9e8
Monitored job service found duplicate job service for submitter: bhunt with id: 6c50eb3b-bb76-49aa-b002-ec44ef66a9e8
bindJobService()
Monitored job service found job service for submitter: bhunt with id: 93689605-5598-4549-a9a2-8096ca7b9da0
bindJobService()
Monitored job service found job service for submitter: bhunt with id: f8657d3d-b478-49ef-b59d-ac9ed9629a65

The bindJobService() and unbindJobService() are printed when DS calls those functions to bind and unbind the remote service.  Each remote service is given a UUID and I check that UUID against the services that are already bound so as not to create manager threads for the duplicate services.  As you can see from the console output, bind was called twice with duplicate services.
Comment 1 Bryan Hunt CLA 2010-10-19 11:11:49 EDT
Created attachment 181189 [details]
console output
Comment 2 Bryan Hunt CLA 2010-10-19 11:12:52 EDT
The YourKit snapshot is too large to attach.  Ping me on skype if you want it.
Comment 3 Markus Kuppe CLA 2010-10-19 11:23:06 EDT
(In reply to comment #0)
> I suspect that something in the network is contributing to the problem.  I've
> been having intermittent problems with the network on my workstation and this
> box is where I've been testing the system described above.  I have seen cases
> where a ping from my home machine to my workstation fails for the first few
> pings and then starts working again.  IT is looking into the issue, and I
> suspect that some of these problems will go away when they fix the network
> problem.  Even so, ECF and zookeeper should be tolerant of network issues.

If the network has indeed intermittent problems, I'm inclined to argue that Zookeeper is behaving correctly. From its point of view the service went down with the network and came back after again. It is just that remoteservices does not dispose the lingering connection.
Comment 4 Scott Lewis CLA 2010-10-19 12:47:27 EDT
Thanks for the report.

(In reply to comment #3)
> (In reply to comment #0)
> > I suspect that something in the network is contributing to the problem.  I've
> > been having intermittent problems with the network on my workstation and this
> > box is where I've been testing the system described above.  I have seen cases
> > where a ping from my home machine to my workstation fails for the first few
> > pings and then starts working again.  IT is looking into the issue, and I
> > suspect that some of these problems will go away when they fix the network
> > problem.  Even so, ECF and zookeeper should be tolerant of network issues.
> 
> If the network has indeed intermittent problems, I'm inclined to argue that
> Zookeeper is behaving correctly. From its point of view the service went down
> with the network and came back after again. It is just that remoteservices does
> not dispose the lingering connection.

It may be true that remoteservices is not disposing of the lingering connection...under all necessary failure conditions...but there could still be problems with zookeeper's handling of undiscovery...i.e. via network failure/dropped connections.

I think this bug highlights our ongoing need to put together some sort of distributed test harness...so that network failure situations (for both discovery and remote services) can be explicitly tested.  The hard part about diagnosing and fixing these sorts of situations will be simply reproducing both the general network environment and the specific failure conditions that Bryan is seeing...and then finding and fixing any problems in remote services and/or discovery.

I intend to investigate this as much as possible...but Bryan in the short term it may require some close interaction with you...i.e. to try to reproduce the situations that are leading to these problems in your network and application.  Hopefully we will be able to arrange such interaction.

In any case, thanks.
Comment 5 Markus Kuppe CLA 2010-10-20 12:01:12 EDT
(In reply to comment #4)
> I think this bug highlights our ongoing need to put together some sort of
> distributed test harness...so that network failure situations (for both
> discovery and remote services) can be explicitly tested.  The hard part about
> diagnosing and fixing these sorts of situations will be simply reproducing both
> the general network environment and the specific failure conditions that Bryan
> is seeing...and then finding and fixing any problems in remote services and/or
> discovery.

I totally agree that testing should be extended to the network layer. Problem is, that last time I checked no (Java) framework existed that allows to simulate the network layer (to write tests that test network outages/package loss/...). 

OTOH we need to think about the state space explosion. At some point formal verification might just be the only way.

Anyway, with the new hardware in place, we will hopefully find the resources (time and host-wise) to start with real remote testing across more than one host.
Comment 6 Bryan Hunt CLA 2010-10-21 15:35:44 EDT
I've had the system described (with one additional remote service - 4 total) running on our production for about 24 hours.  This system is in a production lab and I'm not aware of any network issues.  At 2:53 AM zookeeper undiscovered and then immediately re-discovered one service.  At 12:50 PM zookeeper undiscovered and then immediately re-discovered two services.  I dumped javacore and the system has not created any additional ECF threads.
Comment 7 Scott Lewis CLA 2010-10-21 16:02:51 EDT
(In reply to comment #6)
> I've had the system described (with one additional remote service - 4 total)
> running on our production for about 24 hours.  This system is in a production
> lab and I'm not aware of any network issues.  At 2:53 AM zookeeper undiscovered
> and then immediately re-discovered one service.  At 12:50 PM zookeeper
> undiscovered and then immediately re-discovered two services.  I dumped
> javacore and the system has not created any additional ECF threads.

Thanks for the report...but I'm not quite clear...did the above sequence result in any obviously extraneous threads?  Also...what version of zookeeper and ECF generic provider are you using?

One thing I've been thinking about...if zookeeper were to *not* undiscover a service (i.e. not report undiscover because of a zookeeper-server-to-client communication failure)...but the distribution provider didn't detect the failure...e.g. because the failure was transient and within the ECF generic keepalive timeout...and then send a new discovery notification, it could easily result in new client container creation and connection...and of course new threads on both client and server.   

In any event...please keep collecting data as you can, and we'll do all we can to figure out what's going on specifically in your environment.
Comment 8 Bryan Hunt CLA 2010-10-21 16:18:45 EDT
(In reply to comment #7)
> Thanks for the report...but I'm not quite clear...did the above sequence result
> in any obviously extraneous threads?  Also...what version of zookeeper and ECF
> generic provider are you using?

No extra threads were created.  I was simply providing the observation that ECF appears to randomly undiscover and re-discover services under "normal conditions".  I'm using the ECF 3.3 release with the following bundles checked out of head from CVS:

org.eclipse.ecf.osgi.services.discovery
org.eclipse.ecf.osgi.services.distribution
org.eclipse.ecf.remoteservice

with patches from bug fixes applied to org.eclipse.ecf.osgi.services.distribution
Comment 9 Scott Lewis CLA 2013-01-29 15:57:10 EST
I believe this bug was due to a problem in zookeeper which has since been solved by a move to a more recent version of apache zookeeper, and updates to the ECF zookeeper provider code.

If discovery-based threads are being created and not cleaned up with ECF 3.5+, please reopen.