Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 508009

Summary: Deadlock on shutdown with forte 1.8.2
Product: [IoT] 4DIAC Reporter: NOJA Power <nojaapps>
Component: FORTEAssignee: Alois Zoitl <alois.zoitl>
Status: CLOSED WORKSFORME QA Contact:
Severity: normal    
Priority: P3 CC: alois.zoitl, cabralcochi, nojaapps
Version: 1.8.2   
Target Milestone: 1.8.3   
Hardware: PC   
OS: Linux   
Whiteboard:

Description NOJA Power CLA 2016-11-22 20:22:12 EST
Hi,

I found forte gets into a deadlock state on shutdown in the destruction of EMB_RES. The reason is as follows:

The ECET destructor is called first thus CEventChainExecutionThread class and all virtual functions (including end()) are destroyed. The CEventChainExecutionThread::end() can now no longer be called.

Next the destructor for CPosixThread class is called and proceeds to call end() but instead of calling CEventChainExecutionThread::end() it calls CThreadBase::end() and because of this the resume from self suspend is never called.

The quick workaround was to add end() to CEventChainExecutionThread class destructor. I'm not sure if this is the best solution?

CEventChainExecutionThread::~CEventChainExecutionThread(){
    end();
}
Comment 1 Jose Maria Jesus Cabral Lassalle CLA 2016-11-24 03:38:43 EST
Hi,

I'll put here the link to a previous conversation about this same issue in bug 488085, comment 2
Comment 2 NOJA Power CLA 2016-11-24 16:46:04 EST
Hi,

The suggestion from the linked ticket bug 488085, comment 2
If you have a different way to shutdown your device I would recommend that you invoke the changeFBExecutionState(cg_nMGM_CMD_Kill); in the desctructor of your device or even better before you delete your device.

Mmmm ... right you are ... looks like someone has changed (in main.cpp) this:
  poDev->changeFBExecutionState(cg_nMGM_CMD_Kill);

to this:
  poDev->MGR.getResourceEventExecution()->changeExecutionState(cg_nMGM_CMD_Kill);

The comment for the change says:
  Process hangs after deleting FBOOT file

This may have been the case for 1.7? Not sure if it is needed for 1.8. Please give me a couple of days to repeat the test. I'll post here what I find.
Comment 3 NOJA Power CLA 2016-11-24 22:25:34 EST
Hi,

I did some more testing and there is an issue. The issue is either in our implementation or in forte, not sure at the moment. I'll describe the scenario that requires poDev->changeFBExecutionState(cg_nMGM_CMD_Kill) to be changed to poDev->MGR.getResourceEventExecution()->changeExecutionState(cg_nMGM_CMD_Kill) as follows.

•Place FBOOT file in required location
•Run forte
•Send STOP Event
  This is done by a button which results in execution of some code in forte
  The code iterates through each resource, locates each E_RESTART and sends STOP with fb->changeFBExecutionState(cg_nMGM_CMD_Stop);
•CTRL-C forte. In our implementation this goes through the shutdown of createDev() but does not exit the running forte process, just results in calling createDev() again. Unfortunately createDev() is stuck and never completes unless poDev->changeFBExecutionState(cg_nMGM_CMD_Kill) is changed to poDev->MGR.getResourceEventExecution()->changeExecutionState(cg_nMGM_CMD_Kill)


Changing poDev->changeFBExecutionState(cg_nMGM_CMD_Kill) to poDev->MGR.getResourceEventExecution()->changeExecutionState(cg_nMGM_CMD_Kill) has the unfortunate side effect of the issue described in this ticket and ~CEventChainExecutionThread() needs to have end()

CEventChainExecutionThread::~CEventChainExecutionThread(){
    end();
}

After some more testing I found that if I iterate through each resource, locate each E_RESTART and send a START before calling poDev->changeFBExecutionState(cg_nMGM_CMD_Kill) then everything works fine and no changes need to be made to the forte code. Unfortunately in our case, the STOP is like an emergency stop button and the process must stop and must never be started again.

Any thoughts on what the best solution might be?
Comment 4 Alois Zoitl CLA 2016-11-27 04:57:43 EST
Thanks for the detailed explanation. You bring up here a very interesting issue. After giving it some thought I think there are two main points that need to be considered here.

 1. It should be possible to correctly shutdown and delete a device which is in the stopped state.
 2. Not sure if invoking stop is the best solution for your emergency stop button

Before going in to detail with both points I would like to recap the statemachine defined in IEC 61499 in Clause 6.4 for managed objects. There the intended behaviour and possible actions resulting from commands like stop and kill are described. Both STOP and KILL stop execution of Applications. And are prerequisites to delete elements. The main difference between STOP and KILL is that after STOP it should be possible to re-enable execution. While in KILL you can only restart execution after a reset commands, which is reinitilizing everything. 

WE deciced for 4diac that Killing a device is equivalent with shutting it down. 

In your use-case after stopping the device all elements except the management resources are put in the stop state. When sending now a kill command to the device nothing will happen as in the statemachine no transition from STOPPED to KILLED is defined. But as you are allowed to START from the STOPPED state and continue from the state where stop has been invoked we keep all threads alive waiting for new events. Which results in the blocking situation that you are experiencing.

So for addressing my point 1 from above what is definitely needed that the code which is shutting down a device will need to check if the device is in stopped state and perform the correct measures. This would also mean that the destructor of the CEventChainExecutionThread will need to correctly end the execution of the threads. I think the best solution is here to do as you suggested by adding the end call to the destructor of CEventChainExecutionThread.

However also addressing my point 2 from above: If your device must not be started again after the emergency stop button has been hit wouldn't the kill command be more appropriate?
Comment 5 NOJA Power CLA 2016-11-27 16:55:26 EST
Hi,

Sorry, I should have been a little clearer. After STOP the operator has the following choices:
 * START - if it is safe to do so, this results in a WARM start (since START has previously been issued) for each resource via E_RESTART and applications can continue (via wired button)
    * The thinking is that States/Counters/etc can be preserved between STOP/START
 * If it is no longer safe to proceed, the operator can decide on one of the following:
    * Clean Resource (from IDE)
    * Clean Device (from IDE)
    * Kill Device (from IDE or a wired button)

I'm tentatively going with the recommendation for the following. If there are any issues, I'll post them here. So far, testing has shown the changes are viable.

Change poDev->changeFBExecutionState(cg_nMGM_CMD_Kill) to poDev->MGR.getResourceEventExecution()->changeExecutionState(cg_nMGM_CMD_Kill) which will allow ECETs to terminate directly from the STOP state

Add the end() to ~CEventChainExecutionThread() which caters for the previously mentioned changed.
CEventChainExecutionThread::~CEventChainExecutionThread(){
    end();
}
Comment 6 Alois Zoitl CLA 2016-11-28 15:20:27 EST
Thanks for the clarification. I asumend that the situation is a little bit more complex. In that sense your changes are correct and I we should apply them also into the official version of FORTE.

There is one thing left I would like to point out. This may not be of interest for your applicaiton but for the general correctness of FORTE I think the endFORTE code should be then like that:

void endForte(int pa_nSig){
  (void) pa_nSig;
  if(0 != poDev){
    if( e_RUNNING == poDev->getState()){
      poDev->changeFBExecutionState(cg_nMGM_CMD_Kill);      
    } else {
      poDev->MGR.getResourceEventExecution()->changeExecutionState(cg_nMGM_CMD_Kill);
    }    
  }
}

Ohterwise FORTE will not correctly shut down if the device is in the running state. If this is ok for you I'll perform the according chagnes.
Comment 7 NOJA Power CLA 2016-11-28 19:18:36 EST
Hi,

Looks pretty good.

I just did a quick scan through and I don't see anything in the thread initialisation to block threads on standard signals. I'm not sure if forte architecture needs consideration for blocking threads to ensure no state transition between test and shutdown calls or if that would be a good thing in a signal handler?
Comment 8 Alois Zoitl CLA 2016-12-16 05:57:41 EST
I'm sorry but I could not follow your last comment could you please give more explanation.
Comment 9 NOJA Power CLA 2017-04-11 17:51:46 EDT
(In reply to Alois Zoitl from comment #8)
> I'm sorry but I could not follow your last comment could you please give
> more explanation.

Hi,

According to the man pages for our platform there is no guarantee which thread will service the signal so if a signal occurs mid critical section in a thread there is a chance of deadlock.

To cater for this we modified the signal handler so that it looks like this.

void endForte(int pa_nSig){
  (void) pa_nSig;
  running = 0;
}

After making this change the deadlock issue went away for us.
Comment 10 Alois Zoitl CLA 2018-01-17 17:19:33 EST
Sorry for the late reply. How should we accomodate this in the offical forte?
Comment 11 Alois Zoitl CLA 2019-01-18 03:44:56 EST
As I currently don't see how we can handle that in the default 4diac FORTE I'll close this issue for now.

If the situation changes or new issues arise feel free to reopen or file new bugs.