Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 357420

Summary: XSD performance
Product: [Modeling] EMF Reporter: S Kalia <kalia>
Component: XSDAssignee: Ed Merks <Ed.Merks>
Status: NEW --- QA Contact:
Severity: enhancement    
Priority: P3 CC: ahunter.eclipse, kimbert
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Whiteboard:

Description S Kalia CLA 2011-09-12 17:05:53 EDT
Build Identifier: Eclipse 3.6.2  XSD plugin:  org.eclipse.xsd_2.6.0

Performance analysis of the loading and validating moderately sized XSD  has highlighted some interesting observations. Approx. 2/3 of the CPU instruction can be attributed to calling
 org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA.minimize() and its descendants

   (FYI, the numbers in the columns below are : total number of calls, %CPU instructions in method, %CPU instructions in method and descendents, total instructions in method, total instructions in method and descendents)
   
  
    Parent 4055420  13.28  42.40   737351552  2353297920    J:org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA.minimize()V
 
    Self 4055420     13.28  42.40   737351552  2353297920    J:org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA.isEquivalent(Lorg/eclipse/xsd/XSDParticle$DFA$State;Lorg/eclipse/xsd/XSDParticle$DFA$State;)Z
 
   Child 4020344     1.39    19.90    77410440  1104273792    J:org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA$TransitionList.contains(Ljava/lang/Object;)Z
   Child 16101508   2.53     2.53   140500256   140500256    J:org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA$StateImpl.getTransitions()Ljava/util/List;
   Child 12081164   1.81     1.81   100319160   100319160    J:org/eclipse/emf/common/util/BasicEList.size()I
   Child 4020404     0.90    1.30    49935564    71983096    J:org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA$TransitionList.toArray([Ljava/lang/Object;)[Ljava/lang/Object;
   Child 8110840    1.25     1.25    69584000    69584000    J:org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA$StateImpl.isAccepting()Z
   Child 4020344    0.62     0.62    34149672    34149672    J:org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA$TransitionImpl.getState()Lorg/eclipse/xsd/XSDParticle$DFA$State;
   Child 4020344    0.61     0.61    34121992    34121992    J:org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA$TransitionImpl.getParticle()Lorg/eclipse/xsd/XSDParticle;
   Child 4020344    0.57      0.57    31661830    31661830    J:org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA$TransitionImpl.setParticle(Lorg/eclipse/xsd/XSDParticle;)V
   Child 4020344    0.53     0.53    29352402    29352402    J:org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA$TransitionImpl.setState(Lorg/eclipse/xsd/XSDParticle$DFA$State;)V
 
   
Any redesign and/or improvements to org/eclipse/xsd/impl/XSDParticleImpl$XSDNFA.minimize() and its descendants will go a long way improving overall performance of loading and validating XSD.

Reproducible: Always
Comment 1 Ed Merks CLA 2011-09-18 10:23:45 EDT
The only real way to address this problem (quadratic or worse performance of state minimization) is to eliminate the step entirely; the basic parts of the algorithm have already been performance tuned to the greatest extent possible and the algorithm itself has basic limitations.  In other words, validation and content recognition could use an NFA instead of a DFA. But that won't happen without several man weeks of effort, which in turn won't happen without associated funding for such work.  And even then, things like <all> also cause a problem in that they result in an exponentially large NFA, so even that would need specialized support beyond what's possible with an NFA.
Comment 2 Tim Kimber CLA 2011-11-29 09:31:55 EST
Ed: Thanks for the clarification. 

Just wondering whether the DFA minimization is an optional step that could be switched off if the client app knows that the performance costs will outweigh the benefits. Can you explain the purpose of the DFA minimization, and what exact benefits it gives?
Comment 3 Ed Merks CLA 2011-11-29 09:45:34 EST
The DFA is required to do subset testing, i.e., to test that a content model is a valid restriction of some base content model.  It's also used to test the unique particle attribution rule.  In theory those could be done with a DFA with epsilon transitions. 

Another possibility is to make all the validations that require the DFA be optional.
Comment 4 Tim Kimber CLA 2012-11-29 11:44:08 EST
Ed, two questions:

1. Can you give a sizing for making the call to XSDNFA.minimize() optional ( I am assuming that the validation code would still work if the XSDNFA.minimize() was not called ).
2. Can you give an off-the-cuff sizing for making the entire validation step optional?
Comment 5 Ed Merks CLA 2012-11-29 12:34:43 EST
Making validations themselves optional is probably 2 days of work.

Making validation work with an NFA is *minimally* 10 days of work. Handling validation for <all> without an exponentially large NFA requires something other than an NFA, i.e., something that involves pruning transitions/states so they can be used at most once. That's a bit more of an open-ended design problem. I imagine a complete design is likely to involve 15 days of work, but until I spend at least 2 days on a preliminary detailed analysis and prototyping, this is just an off-the-cuff sense of the complexity of the problem.

I imagine using an NFA will require keeping track of the set of states for the automaton and to deal with <all>, will require keeping track of disallowed states/transition in addition to that.  It's important to realize that this will move some of the very expensive up front cost of minimization to produce a DFA to the (hopefully much less expensive) validation processing using the NFA instead of a DFA.