Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 249588

Summary: Datapool (CSV) import/export wizard/operation is slow when importing large CSV files.
Product: z_Archived Reporter: Paul Slauenwhite <paulslau>
Component: TPTPAssignee: Paul Slauenwhite <paulslau>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: P2 CC: jerome.bozier, paulslau
Version: unspecifiedFlags: jerome.bozier: review+
Target Milestone: ---   
Hardware: All   
OS: All   
Whiteboard:
Bug Depends on: 249587    
Bug Blocks:    
Attachments:
Description Flags
Patch (Part 1).
none
Patch.
none
Profile project.
none
Sample Datapool. none

Description Paul Slauenwhite CLA 2008-10-03 07:45:16 EDT
Datapool (CSV) import/export wizard/operation is slow when importing large CSV files.

When CSV files are imported into the workspace as datapools and datapools are exported to the file system as CSV files, the /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/util/CSVImportExportUtil.java class is called from the Datapool (CSV) import/export wizard.  There are several potential performance improvements for this class, specifically:

org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.importCSV(IDatapool, String, boolean, boolean, String, int, int)

org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.exportCSV(IDatapool, String, boolean, boolean, boolean, String)

For example, the org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.createRecord(IDatapool, String, boolean) method (called from importCSV(IDatapool, String, boolean, boolean, String, int, int)), validates each line of the CSV file before creating a record with the line.  This validation should be done once per datapool using the org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.validateCSVFile(String, boolean, boolean, String) method.

In addition, pre-compiled regular expressions should be used to parse each line of the CSV file instead of using /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/util/CSVTokenizer.java.

In addition, the call to EMFUtil.save(Resource) in org.eclipse.hyades.test.ui.internal.wizard.DatapoolImportWizard.performFinish().new WorkspaceModifyOperation() {...}.execute(IProgressMonitor) is costly and the TPTP/EMF code needs to be profiled to find any potential gains in the performance.

Finally, there are several less than optimal loops, array/list copy operations, and string parsing in /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/util/CSVImportExportUtil.java.
Comment 1 Paul Slauenwhite CLA 2008-10-03 07:48:36 EDT
This defect requires a new test case.
Comment 2 Paul Slauenwhite CLA 2008-10-31 12:50:19 EDT
After some refactoring (e.g. password validation moved to the wizard page) and the patch from defect #249587, the main costs for the datapool import operation is:

1) Converting a CSV file to a datapool:  When importing a datapool, this operation accounts for ~48.5% of the total time, on average.  For example, on average for a 12 MB CSV file this operation is 24.8s of the total 51.1s.

2) Saving the datapool:  When importing a datapool, this operation accounts for ~51.3% of the total time, on average.  For example, on average for a 12 MB CSV file this operation is 26.2s of the total 51.1s.
Comment 3 Paul Slauenwhite CLA 2008-10-31 13:33:39 EDT
(In reply to comment #2)
> After some refactoring (e.g. password validation moved to the wizard page) and
> the patch from defect #249587, the main costs for the datapool import operation
> is:
> 
> 1) Converting a CSV file to a datapool:  When importing a datapool, this
> operation accounts for ~48.5% of the total time, on average.  For example, on
> average for a 12 MB CSV file this operation is 24.8s of the total 51.1s.
> 
> 2) Saving the datapool:  When importing a datapool, this operation accounts for
> ~51.3% of the total time, on average.  For example, on average for a 12 MB CSV
> file this operation is 26.2s of the total 51.1s.
> 

In stand-alone tests, on average (10 runs + warm up) for a 12 MB CSV file calling the org.eclipse.hyades.test.core.util.EMFUtil.save(Resource) method takes 24s.
Comment 4 Paul Slauenwhite CLA 2008-11-04 09:53:23 EST
Created attachment 116938 [details]
Patch (Part 1).

(In reply to comment #2)
> After some refactoring (e.g. password validation moved to the wizard page)...

Patch for these changes (based on changed delivered under defect #249587).
Comment 5 Paul Slauenwhite CLA 2008-11-11 16:07:39 EST
(In reply to comment #3)
> (In reply to comment #2)
> > After some refactoring (e.g. password validation moved to the wizard page) and
> > the patch from defect #249587, the main costs for the datapool import operation
> > is:
> > 
> > 1) Converting a CSV file to a datapool:  When importing a datapool, this
> > operation accounts for ~48.5% of the total time, on average.  For example, on
> > average for a 12 MB CSV file this operation is 24.8s of the total 51.1s.
> > 
> > 2) Saving the datapool:  When importing a datapool, this operation accounts for
> > ~51.3% of the total time, on average.  For example, on average for a 12 MB CSV
> > file this operation is 26.2s of the total 51.1s.
> > 
> 
> In stand-alone tests, on average (10 runs + warm up) for a 12 MB CSV file
> calling the org.eclipse.hyades.test.core.util.EMFUtil.save(Resource) method
> takes 24s.
> 

After profiling both loading (org.eclipse.hyades.test.core.util.EMFUtil.load(ResourceSet, String)) and saving (org.eclipse.hyades.test.core.util.EMFUtil.save(Resource)) datapools, no performance improvements were found in the TPTP datapool model code.  Unfortunately, the time to load/save a datapool cannot be improved due to the characteristics of EMF.
Comment 6 Paul Slauenwhite CLA 2008-11-11 16:11:08 EST
(In reply to comment #0)

> For example, the
> org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.createRecord(IDatapool,
> String, boolean) method (called from importCSV(IDatapool, String, boolean,
> boolean, String, int, int)), validates each line of the CSV file before
> creating a record with the line.  This validation should be done once per
> datapool using the
> org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.validateCSVFile(String,
> boolean, boolean, String) method.

Fixed.
 
> In addition, the call to EMFUtil.save(Resource) in
> org.eclipse.hyades.test.ui.internal.wizard.DatapoolImportWizard.performFinish().new
> WorkspaceModifyOperation() {...}.execute(IProgressMonitor) is costly and the
> TPTP/EMF code needs to be profiled to find any potential gains in the
> performance.

See comment #5

> Finally, there are several less than optimal loops, array/list copy operations,
> and string parsing in
> /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/util/CSVImportExportUtil.java.

Fixed.

Comment 7 Paul Slauenwhite CLA 2008-11-11 18:27:23 EST
Created attachment 117601 [details]
Patch.
Comment 8 Paul Slauenwhite CLA 2008-11-11 18:28:44 EST
Jerome, please review the attached patch.
Comment 9 Paul Slauenwhite CLA 2008-11-11 18:48:51 EST
(In reply to comment #0)
> In addition, pre-compiled regular expressions should be used to parse each line
> of the CSV file instead of using
> /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/util/CSVTokenizer.java.

Given the performance improvements provided with this patch, the stability/maturity of CSVTokenizer.java, and the complexity of this parsing algorithm, we will not refactor this class.

(In reply to comment #1)
> This defect requires a new test case.
> 

Covered by /test/org.eclipse.hyades.test.ui.datapool.tests/manual/datapool/Test.UI.DatapoolEditor_importExport.testsuite (Test with Sample CSV files (scalability and performance)).

(In reply to comment #2)
> After some refactoring (e.g. password validation moved to the wizard page) and
> the patch from defect #249587, the main costs for the datapool import operation
> is:
> 
> 1) Converting a CSV file to a datapool:  When importing a datapool, this
> operation accounts for ~48.5% of the total time, on average.  For example, on
> average for a 12 MB CSV file this operation is 24.8s of the total 51.1s.

With this patch, this operation now accounts for ~21.7% of the total time, on average.  For example, on average for a 12 MB CSV file this operation now is 5s of the total 23s.

(In reply to comment #5)
> After profiling both loading
> (org.eclipse.hyades.test.core.util.EMFUtil.load(ResourceSet, String)) and
> saving (org.eclipse.hyades.test.core.util.EMFUtil.save(Resource)) datapools, no
> performance improvements were found in the TPTP datapool model code. 
> Unfortunately, the time to load/save a datapool cannot be improved due to the
> characteristics of EMF.

Interestingly, with this patch, this operation now accounts for ~78.3% of the total time, on average.  For example, on average for a 12 MB CSV file this operation now is 18s of the total 23s.
 

Comment 10 Bozier jerome CLA 2008-11-17 05:10:23 EST
patch reviewed and approved
sorry for delay, big patch that needed some time for review
Comment 11 Paul Slauenwhite CLA 2008-11-18 11:28:07 EST
Patch checked-in to CVS (HEAD).
Comment 12 Paul Slauenwhite CLA 2008-11-18 12:39:02 EST
(In reply to comment #5)

> After profiling both loading
> (org.eclipse.hyades.test.core.util.EMFUtil.load(ResourceSet, String)) and
> saving (org.eclipse.hyades.test.core.util.EMFUtil.save(Resource)) datapools, no
> performance improvements were found in the TPTP datapool model code. 
> Unfortunately, the time to load/save a datapool cannot be improved due to the
> characteristics of EMF.

Some further observations on the load operation from the enclosed trace from profiling the enclosed datapool (12 MB CSV file with 255 columns/5249 rows, which imported is ~3.65 MB compressed and ~135 MB uncompressed, as a datapool):

-A bulk of the time (~95%) is spent in EMF code:
    -~65% is parsing the XMI.
    -~30% is reading the model from the file system, decompressing the contents, and populating the model.

-The TPTP model code accounts for ~5% of the time.  The main culprets include:
    -DPLCellImpl:  Extended EMF methods (e.g. eStaticClass(), eSet(), etc.) and constructpor/setters (value/varaible) are consuming the time.  There is nothing that can be improved here.
    -Comon_DatapoolFactoryImpl: Creating model objects is consuming the time.  There is nothing that can be improved here.
    -DPLRecordImpl:  Extended EMF methods (e.g. eStaticClass(), eSet(), eIsSet(), etc.) and constructpor/getter (cell) are consuming the time.  There is nothing that can be improved here.
    -FacadeResourceImpl.attached()/DatapoolFacadeResourceImpl.attached(): Attempted to refactor the code in these methods to a functionally equivient but more effecitent implemetnation but with little overall effect on the total time.
    -FacadeResourceImpl.getEObjectByID()/DatapoolFacadeResourceImpl.getEObjectByID(): Map lookup time that cannnot be imporved.

-Since uniqueness is not required for most (if not all) of the lists in the datapool model, I tried setting the isUnique property to false (for example, org.eclipse.hyades.models.common.testprofile.impl.TPFVerdictListImpl.getVerdictEvents().new EObjectResolvingEList() {...}.isUnique()), with little overall effect on the total time.


Comment 13 Paul Slauenwhite CLA 2008-11-18 12:39:37 EST
Created attachment 118161 [details]
Profile project.
Comment 14 Paul Slauenwhite CLA 2008-11-18 12:40:00 EST
Created attachment 118162 [details]
Sample Datapool.
Comment 15 Paul Slauenwhite CLA 2008-12-22 10:42:08 EST
Verified in TPTP-4.5.2-200812010100.

Closing.