| Summary: | Datapool (CSV) import/export wizard/operation is slow when importing large CSV files. | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | z_Archived | Reporter: | Paul Slauenwhite <paulslau> | ||||||||||
| Component: | TPTP | Assignee: | Paul Slauenwhite <paulslau> | ||||||||||
| Status: | CLOSED FIXED | QA Contact: | |||||||||||
| Severity: | normal | ||||||||||||
| Priority: | P2 | CC: | jerome.bozier, paulslau | ||||||||||
| Version: | unspecified | Flags: | jerome.bozier:
review+
|
||||||||||
| Target Milestone: | --- | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | All | ||||||||||||
| Whiteboard: | |||||||||||||
| Bug Depends on: | 249587 | ||||||||||||
| Bug Blocks: | |||||||||||||
| Attachments: |
|
||||||||||||
This defect requires a new test case. After some refactoring (e.g. password validation moved to the wizard page) and the patch from defect #249587, the main costs for the datapool import operation is: 1) Converting a CSV file to a datapool: When importing a datapool, this operation accounts for ~48.5% of the total time, on average. For example, on average for a 12 MB CSV file this operation is 24.8s of the total 51.1s. 2) Saving the datapool: When importing a datapool, this operation accounts for ~51.3% of the total time, on average. For example, on average for a 12 MB CSV file this operation is 26.2s of the total 51.1s. (In reply to comment #2) > After some refactoring (e.g. password validation moved to the wizard page) and > the patch from defect #249587, the main costs for the datapool import operation > is: > > 1) Converting a CSV file to a datapool: When importing a datapool, this > operation accounts for ~48.5% of the total time, on average. For example, on > average for a 12 MB CSV file this operation is 24.8s of the total 51.1s. > > 2) Saving the datapool: When importing a datapool, this operation accounts for > ~51.3% of the total time, on average. For example, on average for a 12 MB CSV > file this operation is 26.2s of the total 51.1s. > In stand-alone tests, on average (10 runs + warm up) for a 12 MB CSV file calling the org.eclipse.hyades.test.core.util.EMFUtil.save(Resource) method takes 24s. Created attachment 116938 [details] Patch (Part 1). (In reply to comment #2) > After some refactoring (e.g. password validation moved to the wizard page)... Patch for these changes (based on changed delivered under defect #249587). (In reply to comment #3) > (In reply to comment #2) > > After some refactoring (e.g. password validation moved to the wizard page) and > > the patch from defect #249587, the main costs for the datapool import operation > > is: > > > > 1) Converting a CSV file to a datapool: When importing a datapool, this > > operation accounts for ~48.5% of the total time, on average. For example, on > > average for a 12 MB CSV file this operation is 24.8s of the total 51.1s. > > > > 2) Saving the datapool: When importing a datapool, this operation accounts for > > ~51.3% of the total time, on average. For example, on average for a 12 MB CSV > > file this operation is 26.2s of the total 51.1s. > > > > In stand-alone tests, on average (10 runs + warm up) for a 12 MB CSV file > calling the org.eclipse.hyades.test.core.util.EMFUtil.save(Resource) method > takes 24s. > After profiling both loading (org.eclipse.hyades.test.core.util.EMFUtil.load(ResourceSet, String)) and saving (org.eclipse.hyades.test.core.util.EMFUtil.save(Resource)) datapools, no performance improvements were found in the TPTP datapool model code. Unfortunately, the time to load/save a datapool cannot be improved due to the characteristics of EMF. (In reply to comment #0) > For example, the > org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.createRecord(IDatapool, > String, boolean) method (called from importCSV(IDatapool, String, boolean, > boolean, String, int, int)), validates each line of the CSV file before > creating a record with the line. This validation should be done once per > datapool using the > org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.validateCSVFile(String, > boolean, boolean, String) method. Fixed. > In addition, the call to EMFUtil.save(Resource) in > org.eclipse.hyades.test.ui.internal.wizard.DatapoolImportWizard.performFinish().new > WorkspaceModifyOperation() {...}.execute(IProgressMonitor) is costly and the > TPTP/EMF code needs to be profiled to find any potential gains in the > performance. See comment #5 > Finally, there are several less than optimal loops, array/list copy operations, > and string parsing in > /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/util/CSVImportExportUtil.java. Fixed. Created attachment 117601 [details]
Patch.
Jerome, please review the attached patch. (In reply to comment #0) > In addition, pre-compiled regular expressions should be used to parse each line > of the CSV file instead of using > /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/util/CSVTokenizer.java. Given the performance improvements provided with this patch, the stability/maturity of CSVTokenizer.java, and the complexity of this parsing algorithm, we will not refactor this class. (In reply to comment #1) > This defect requires a new test case. > Covered by /test/org.eclipse.hyades.test.ui.datapool.tests/manual/datapool/Test.UI.DatapoolEditor_importExport.testsuite (Test with Sample CSV files (scalability and performance)). (In reply to comment #2) > After some refactoring (e.g. password validation moved to the wizard page) and > the patch from defect #249587, the main costs for the datapool import operation > is: > > 1) Converting a CSV file to a datapool: When importing a datapool, this > operation accounts for ~48.5% of the total time, on average. For example, on > average for a 12 MB CSV file this operation is 24.8s of the total 51.1s. With this patch, this operation now accounts for ~21.7% of the total time, on average. For example, on average for a 12 MB CSV file this operation now is 5s of the total 23s. (In reply to comment #5) > After profiling both loading > (org.eclipse.hyades.test.core.util.EMFUtil.load(ResourceSet, String)) and > saving (org.eclipse.hyades.test.core.util.EMFUtil.save(Resource)) datapools, no > performance improvements were found in the TPTP datapool model code. > Unfortunately, the time to load/save a datapool cannot be improved due to the > characteristics of EMF. Interestingly, with this patch, this operation now accounts for ~78.3% of the total time, on average. For example, on average for a 12 MB CSV file this operation now is 18s of the total 23s. patch reviewed and approved sorry for delay, big patch that needed some time for review Patch checked-in to CVS (HEAD). (In reply to comment #5) > After profiling both loading > (org.eclipse.hyades.test.core.util.EMFUtil.load(ResourceSet, String)) and > saving (org.eclipse.hyades.test.core.util.EMFUtil.save(Resource)) datapools, no > performance improvements were found in the TPTP datapool model code. > Unfortunately, the time to load/save a datapool cannot be improved due to the > characteristics of EMF. Some further observations on the load operation from the enclosed trace from profiling the enclosed datapool (12 MB CSV file with 255 columns/5249 rows, which imported is ~3.65 MB compressed and ~135 MB uncompressed, as a datapool): -A bulk of the time (~95%) is spent in EMF code: -~65% is parsing the XMI. -~30% is reading the model from the file system, decompressing the contents, and populating the model. -The TPTP model code accounts for ~5% of the time. The main culprets include: -DPLCellImpl: Extended EMF methods (e.g. eStaticClass(), eSet(), etc.) and constructpor/setters (value/varaible) are consuming the time. There is nothing that can be improved here. -Comon_DatapoolFactoryImpl: Creating model objects is consuming the time. There is nothing that can be improved here. -DPLRecordImpl: Extended EMF methods (e.g. eStaticClass(), eSet(), eIsSet(), etc.) and constructpor/getter (cell) are consuming the time. There is nothing that can be improved here. -FacadeResourceImpl.attached()/DatapoolFacadeResourceImpl.attached(): Attempted to refactor the code in these methods to a functionally equivient but more effecitent implemetnation but with little overall effect on the total time. -FacadeResourceImpl.getEObjectByID()/DatapoolFacadeResourceImpl.getEObjectByID(): Map lookup time that cannnot be imporved. -Since uniqueness is not required for most (if not all) of the lists in the datapool model, I tried setting the isUnique property to false (for example, org.eclipse.hyades.models.common.testprofile.impl.TPFVerdictListImpl.getVerdictEvents().new EObjectResolvingEList() {...}.isUnique()), with little overall effect on the total time. Created attachment 118161 [details]
Profile project.
Created attachment 118162 [details]
Sample Datapool.
Verified in TPTP-4.5.2-200812010100. Closing. |
Datapool (CSV) import/export wizard/operation is slow when importing large CSV files. When CSV files are imported into the workspace as datapools and datapools are exported to the file system as CSV files, the /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/util/CSVImportExportUtil.java class is called from the Datapool (CSV) import/export wizard. There are several potential performance improvements for this class, specifically: org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.importCSV(IDatapool, String, boolean, boolean, String, int, int) org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.exportCSV(IDatapool, String, boolean, boolean, boolean, String) For example, the org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.createRecord(IDatapool, String, boolean) method (called from importCSV(IDatapool, String, boolean, boolean, String, int, int)), validates each line of the CSV file before creating a record with the line. This validation should be done once per datapool using the org.eclipse.hyades.test.ui.datapool.internal.util.CSVImportExportUtil.validateCSVFile(String, boolean, boolean, String) method. In addition, pre-compiled regular expressions should be used to parse each line of the CSV file instead of using /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/util/CSVTokenizer.java. In addition, the call to EMFUtil.save(Resource) in org.eclipse.hyades.test.ui.internal.wizard.DatapoolImportWizard.performFinish().new WorkspaceModifyOperation() {...}.execute(IProgressMonitor) is costly and the TPTP/EMF code needs to be profiled to find any potential gains in the performance. Finally, there are several less than optimal loops, array/list copy operations, and string parsing in /org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/datapool/internal/util/CSVImportExportUtil.java.