Community
Participate
Working Groups
When not being *extra* gentle (even then), some requests generate InterruptedExceptions and the application seem to hang. Restart is usually required. This can easily be reproduced, even on small traces, by zooming repeatedly or switching between experiments before the previous operation has terminated. Some background: Part of the application is built around a producer/consumer approach where the generated data is queued for its consumers. Both the producers and consumers perform timed access to the shared queue in order to determine if the peer is still alive. When timing out, an InterruptedException is produced and the request is cancelled. In the case of a producer timing out, well, no big deal: just kill the request (no consumer left). In the case of a consumer timing out, it should retry the operation (clean up, new request, etc). This could be problematic if the producer died. What we observe is that the producer is not correctly terminating its request and keeps timing out trying to feed the shared queue. This seems to be caused by the consumer no longer reading the queue either because it believes the request is terminated or, more likely, it is stuck in a deadlock.
Created attachment 172473 [details] InterruptException (timeout) fix This is observed when the LTTngSyntheticEventProvider is in the loop (i.e. servicing requests from CFV, RV, or SV) and it is related to the usage of the data queue as a communication mechanism. As suspected, the problem is that a consumer thread is waiting on a data queue while the producer thread is busy doing something else (like servicing another, higher priority, request from another consumer). The call to getNext() times out and an InterruptException is generated. Another way it can happen is when running on a single core processor (e.g. when runnning in a VM): it turns out that the overhead of switchings between threads to percolate a single event through the software layers is excessive. Note that is is much less of a problem when [1] you can dispatch the threads on different cores, and [2] you process blocks of data instead of single events at a time. The data queue was introduced to allow some asynchronous data exchange between the different components. The timed getNext() (and queueRequest()) - on the data queue - was introduced to detect if the peer is still alive and to take recovery action if it died unexpectedly. One solution would be to replace the timed functions call by some kind of heartbeat but this complicates things. And there is no easy solution with the data queues for the single core case. So, the proposed solution is to bypass the data queues in the case of the LTTngSyntheticEventProvider and have the sub-requests call directly the data handling method of the main request. Although there is still a request execution thread per data providing component (we still want to take advantage of multiple cores if available), the data flow is now closer to a series of function calls and time outs are simply out of the equation. Performance is not significantly better on a "real" machine (i.2. multi-core with a decent clock rate), but much better in VMs. Note: For good measure :-), the getData() and queueRequest() were tampered with and, as a result, 5 JUnit tests are failing. The patch won't be committed until the issues are addressed.
Created attachment 172509 [details] Updated patch Added quick fixes for the JUnits.
Patch committed
Patch also committed to to the Helios maintenance branch.
Delivered with 0.7