Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 185878 - TPTP Datapool has various performance problems
Summary: TPTP Datapool has various performance problems
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: TPTP (show other bugs)
Version: unspecified   Edit
Hardware: All All
: P2 major (vote)
Target Milestone: ---   Edit
Assignee: Paul Slauenwhite CLA
QA Contact:
URL:
Whiteboard: closed462
Keywords:
Depends on: 213329
Blocks:
  Show dependency tree
 
Reported: 2007-05-07 20:56 EDT by Joe Toomey CLA
Modified: 2016-05-05 10:29 EDT (History)
6 users (show)

See Also:
jerome.bozier: review+
paulslau: review?


Attachments
Patch for poor Datapool CSV import performance (1.49 KB, patch)
2007-05-07 21:11 EDT, Joe Toomey CLA
no flags Details | Diff
Patch for poor Datapool CSV import performance (1.91 KB, patch)
2007-05-07 21:15 EDT, Joe Toomey CLA
no flags Details | Diff
Updated patch with additional optimization -- requires validation! (4.80 KB, patch)
2007-05-23 15:28 EDT, Joe Toomey CLA
no flags Details | Diff
Patch. (24.89 KB, patch)
2008-11-14 15:37 EST, Paul Slauenwhite CLA
no flags Details | Diff
Patch Errata - requires the first patch. (17.76 KB, patch)
2008-11-19 19:06 EST, Paul Slauenwhite CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Joe Toomey CLA 2007-05-07 20:56:26 EDT
This defect is to track the various performance problems that have been reported indirectly by consuming products.  I have a small amount of detail on the specific issues that have been encountered (importing from CSV), and I have one simple fix, and some additional observations about datapool performance.  I will comment on these in an additional note in this defect.

If you have experienced problems with TPTP datapool performance, please add some details to this defect.  Particularly, please specify what operations are performing poorly, and if possible, supply sample files with an example of the poor performance and an expectation of what performance you would expect to see.
Comment 1 Joe Toomey CLA 2007-05-07 21:09:37 EDT
I have investigated the poor performance of datapool import, and I have a simple fix in hand.

The root issue here is that we have a substantial body of code and underlying extensibility mechanism that was designed into TPTP datapools, but has not really been leveraged or validated by consuming products.  In the case of datapool import performance, over 90% of the time spent importing data was in inefficient code that was handling the absence of a registered handler for the imported datatype.  (This is noteworthy because we have no registered handlers at all in TPTP, so all data goes through this code path.)  When the setCellValue() call found no registered handlers, it threw a DatapoolException, which was caught by the catch block on the next line of code.  Of course, exceptions are a very inefficient way to manage flow control.  In the attached patch file, I have replaced the throw with the equivalent code to process a cell with no registered handler.  I also corrected an inefficient string operation.  With these changes, the performance of import from CSV is substantially improved.  In sample testing, a CSV file that previously took ~14 seconds to import now takes ~1 second.
Comment 2 Joe Toomey CLA 2007-05-07 21:11:03 EDT
Created attachment 66231 [details]
Patch for poor Datapool CSV import performance
Comment 3 Joe Toomey CLA 2007-05-07 21:15:16 EDT
Created attachment 66232 [details]
Patch for poor Datapool CSV import performance
Comment 4 Joe Toomey CLA 2007-05-07 21:23:35 EDT
This defect should also track the fact that import from CSV is a long running operation performed in the UI thread without a progress monitor or cancel mechanism.  This appears to be an artifact of the choice of "New Wizard" vs. "Import Wizard" as the user gesture used to create a new datapool by importing the contents of a CSV file.  The TPTP import facilities have progress monitors as part of their interfaces, while New wizards, which are typically not long running, do not.  There is also an Import datapool option, but it only allows importing of additional data from a CSV file into a datapool that already exists in the workspace.

My personal opinion is that this is counterintuitive, as all the Eclipse import wizards I am familiar with have the effect of creating resources in the workspace that were not there before the import.  Feel free to post your opinions on that issue.
Comment 5 Joe Toomey CLA 2007-05-07 21:52:41 EDT
Beyond import, I have also heard complaints about the time taken to open a datapool, both from the test runtime, and also from the workbench.  These complaints have largely been secondhand and without accompanying data.  I have done some light benchmarking, and I would appreciate feedback from those who have first hand knowledge of the issues.

Opening a datapool in the workbench involves first opening the EMF model in memory and then creating the UI to display that model in the workbench.  Opening a datapool from the test runtime involves only opening the EMF model in memory.  In some light benchmarking of these two operations, I have drawn a few conclusions:

1) The majority of the time spent opening the EMF model in memory (which is the entire cost of opening at runtime, and contributes to the cost of opening in the workbench) is spent in XML parsing.  In one example, a 650K datapool (which is actually 56MB of compressed data, imported from a 48MB CSV file) takes ~7 seconds to open in memory.

2) The majority of time spent opening a datapool in the workbench is not spent opening the model in memory, but rather is spent constructing the UI to display the model in the workbench.  For the same example file above, after taking ~7 seconds to open the model in memory, it took ~43 seconds to construct the complete SWT table.  This process is inefficient because each row is represented in SWT by a real widget in the underlying operating system.  In windows, for example, as each row is added to the table, the windows message pump is used to process the creation of the underlying OS row and each cell within it.

The first observation I'd make is that 7 seconds doesn't seem like a terribly long time to read a 50MB into memory and parse its XML contents.  Opening the source CSV file in Excel takes ~5 seconds.  Of course we could do much better if we did not read the entire file into memory, and somewhat better still if we did not use XML for persistence.  But on the whole, I'm not yet convinced that the performance of datapool open (just reading the model into memory) is that bad.  I am interested to hear if people disagree, and I'd especially appreciate sample datapools that exhibit the poor behavior, and an expectation of what performance would be acceptable.

The second observation I'd make is that we could substantially improve the time taken to open a datapool in the workbench by using the SWT Virtual Table support.  Virtual table support is much simpler to implement than Virtual Tree support (which we have some recent experience with.)  Although there would be a small lag in performance when scrolling through datapools using the virtual table, the performance could be improved quite substantially -- I would suggest that the 43 seconds could probably be reduced to less than 5 seconds (for the UI portion of the open operation.)
Comment 6 Paul Slauenwhite CLA 2007-05-10 09:45:32 EDT
Joe, please provide a sizing.
Comment 7 Joe Toomey CLA 2007-05-10 10:32:53 EDT
Added sizing
Comment 8 Joe Toomey CLA 2007-05-23 15:28:43 EDT
Created attachment 68430 [details]
Updated patch with additional optimization -- requires validation!

This patch contains the above patch, plus an additional optimzation to remove an O(n*m) algorithm in the DatapoolTable class.  This second code change needs to be validated by running the complete datapool editor test suite to ensure that the logic is sound.  If it is not, we can still optimize this code path by keeping a map of variable names to column indexes.
Comment 9 Paul Slauenwhite CLA 2008-04-10 06:43:15 EDT
Note, there have been numerous performance improvements to the datapool import operation introduced under defect https://bugs.eclipse.org/bugs/show_bug.cgi?id=213329.
Comment 10 Paul Slauenwhite CLA 2008-11-11 15:52:25 EST
(In reply to comment #1)
> I have investigated the poor performance of datapool import, and I have a
> simple fix in hand.
> 
> The root issue here is that we have a substantial body of code and underlying
> extensibility mechanism that was designed into TPTP datapools, but has not
> really been leveraged or validated by consuming products.  In the case of
> datapool import performance, over 90% of the time spent importing data was in
> inefficient code that was handling the absence of a registered handler for the
> imported datatype.  (This is noteworthy because we have no registered handlers
> at all in TPTP, so all data goes through this code path.)  When the
> setCellValue() call found no registered handlers, it threw a DatapoolException,
> which was caught by the catch block on the next line of code.  Of course,
> exceptions are a very inefficient way to manage flow control.  In the attached
> patch file, I have replaced the throw with the equivalent code to process a
> cell with no registered handler.  I also corrected an inefficient string
> operation.  With these changes, the performance of import from CSV is
> substantially improved.  In sample testing, a CSV file that previously took ~14
> seconds to import now takes ~1 second.
> 

Covered under defect #249588.
Comment 11 Paul Slauenwhite CLA 2008-11-14 15:37:59 EST
Created attachment 117941 [details]
Patch.
Comment 12 Paul Slauenwhite CLA 2008-11-14 16:02:58 EST
A couple of points:

1)The issues with performance of the datapool import operation have been addressed under defect #249588.

1)There are no apparent performance improvements to be made to the datapool load (and save) operations (see https://bugs.eclipse.org/bugs/show_bug.cgi?id=249588#c5).

3) After reading up on virtual tables, they are only recommended for:

-A table displaying static, read-only content.
-A table displaying mutable content in the following manner:
  -deletes may occur anywhere.
  -updates should occur only in the current view.
  -adds happen ONLY at the end.

Given the above, the /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/control/DatapoolTable.java class is quite complex (~3500 LOC), the performance improvements provided by the attached patch, and the current resources of the project, we will not explore prototyping this solution at this time.

4) This defect is covered by the 'Test with Sample CSV files (scalability and performance)' test case in /test/org.eclipse.hyades.test.ui.datapool.tests/manual/datapool/Test.UI.DatapoolEditor_importExport.testsuite.

5) The attached patch improves opening a datapool in the Datapool Editor by ~65%, on average.  Discounting the fixed cost to load the datapool, The attached patch improves opening a datapool in the Datapool Editor by ~94%, on average.  For example, using a 12 MB CSV file(255 columns/5249 rows), which imported is ~3.65 MB compressed and ~135 MB uncompressed, as a datapool.  Here are the details:

Before:
Total time to import the datapool (includes parsing/saving/loading/opening): 216.75 s
Time to load the datapool: 64.5 s
Time to open the datapool (includes loafing/opening): 195 s
Time to open the datapool (includes only opening): 130.5 s

After:
Total time to import the datapool (includes parsing/saving/loading/opening): 88.5 s
Time to load the datapool: 59.75 s
Time to open the datapool (includes loafing/opening): 67.5 s
Time to open the datapool (includes only opening): 7.75 s


Jerome, please review the attached patch.
Comment 13 Bozier jerome CLA 2008-11-17 05:00:39 EST
patch reviewed and approved
Comment 14 Paul Slauenwhite CLA 2008-11-18 11:30:56 EST
Patch checked-in to CVS (HEAD).
Comment 15 Paul Slauenwhite CLA 2008-11-19 19:06:12 EST
Created attachment 118328 [details]
Patch Errata - requires the first patch.

After testing the first patch, it was found that it removed the equivalence class and equivalence class index data elements from the TableItems in the datapool.  However, the edit/move row dialog referenced the equivalence class data element, albeit using a different key constant.  As a result, a NPE is thrown when opening the edit/move row dialog in the DatapoolTable.  This patch resolves this NPE and isolates the key constants for all of the data elements to DatapoolTableUtil.
Comment 16 Paul Slauenwhite CLA 2008-11-19 19:06:56 EST
Jerome, can you please review the second patch (errata)?
Comment 17 Bozier jerome CLA 2008-11-20 04:01:03 EST
patch reviewed and approved
Comment 18 Paul Slauenwhite CLA 2008-11-20 06:36:34 EST
(In reply to comment #17)
> patch reviewed and approved
> 

Patch checked in to CVS (HEAD).
Comment 19 Paul Slauenwhite CLA 2009-01-26 10:30:25 EST
Originator:  Please verify this defect BEFORE Wednesday, February 3, 2009, to allow time to resolve any outstanding issues before the end of TPTP 4.5.2 I3 Test Pass 1 (see http://www.eclipse.org/tptp/home/project_info/releaseinfo/4.5.2/schedule.html). 
Comment 20 Paul Slauenwhite CLA 2010-03-10 08:35:47 EST
As of TPTP 4.6.0, TPTP is in maintenance mode and focusing on improving quality by resolving relevant enhancements/defects and increasing test coverage through test creation, automation, Build Verification Tests (BVTs), and expanded run-time execution. As part of the TPTP Bugzilla housecleaning process (see http://wiki.eclipse.org/Bugzilla_Housecleaning_Processes), this enhancement/defect is verified/closed by the Project Lead since this enhancement/defect has been resolved and unverified for more than 1 year and considered to be fixed. If this enhancement/defect is still unresolved and reproducible in the latest TPTP release (http://www.eclipse.org/tptp/home/downloads/), please re-open.