| Summary: | [TMF] InterruptedExceptions when zooming repeatedly or switching trace | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | z_Archived | Reporter: | Francois Chouinard <fchouinard> | ||||||
| Component: | LinuxTools | Assignee: | Francois Chouinard <fchouinard> | ||||||
| Status: | CLOSED FIXED | QA Contact: | Francois Chouinard <fchouinard> | ||||||
| Severity: | critical | ||||||||
| Priority: | P3 | ||||||||
| Version: | unspecified | ||||||||
| Target Milestone: | --- | ||||||||
| Hardware: | PC | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Attachments: |
|
||||||||
|
Description
Francois Chouinard
Created attachment 172473 [details]
InterruptException (timeout) fix
This is observed when the LTTngSyntheticEventProvider is in the loop (i.e. servicing requests from CFV, RV, or SV) and it is related to the usage of the data queue as a communication mechanism.
As suspected, the problem is that a consumer thread is waiting on a data queue while the producer thread is busy doing something else (like servicing another, higher priority, request from another consumer). The call to getNext() times out and an InterruptException is generated.
Another way it can happen is when running on a single core processor (e.g. when runnning in a VM): it turns out that the overhead of switchings between threads to percolate a single event through the software layers is excessive. Note that is is much less of a problem when [1] you can dispatch the threads on different cores, and [2] you process blocks of data instead of single events at a time.
The data queue was introduced to allow some asynchronous data exchange between the different components. The timed getNext() (and queueRequest()) - on the data queue - was introduced to detect if the peer is still alive and to take recovery action if it died unexpectedly.
One solution would be to replace the timed functions call by some kind of heartbeat but this complicates things. And there is no easy solution with the data queues for the single core case.
So, the proposed solution is to bypass the data queues in the case of the LTTngSyntheticEventProvider and have the sub-requests call directly the data handling method of the main request. Although there is still a request execution thread per data providing component (we still want to take advantage of multiple cores if available), the data flow is now closer to a series of function calls and time outs are simply out of the equation.
Performance is not significantly better on a "real" machine (i.2. multi-core with a decent clock rate), but much better in VMs.
Note: For good measure :-), the getData() and queueRequest() were tampered with and, as a result, 5 JUnit tests are failing. The patch won't be committed until the issues are addressed.
Created attachment 172509 [details]
Updated patch
Added quick fixes for the JUnits.
Patch committed Patch also committed to to the Helios maintenance branch. Delivered with 0.7 |