| Summary: | Umlauts in zip files lead to errors while crawling and the files will not be processed. | ||
|---|---|---|---|
| Product: | z_Archived | Reporter: | Andreas Schank <andreas.schank> |
| Component: | Smila | Assignee: | Andreas Schank <andreas.schank> |
| Status: | CLOSED FIXED | QA Contact: | |
| Severity: | enhancement | ||
| Priority: | P5 | ||
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Windows 7 | ||
| Whiteboard: | |||
|
Description
Andreas Schank
Andreas could you please give an estimation on effort for using commons-comress library? tested it again... To describe this error more precisely: The zipped entries in question are just not indexed but reported as an error in the way like following, the error is listed in the JConsole Crawler's error buffer and the files are accounted as Exceptions (although no critical exceptions). Other files within the same zip are indexed, though. The crawler continues working. --- 2011-10-11 14:18:26.700 --- org.eclipse.smila.connectivity.framework.CrawlerException: Error reading content of ZipEntry 'pfad im zip/noch'n pfad/_?.tx' of record id file:<Path=c:\data\doppelzip.zip> at org.eclipse.smila.connectivity.framework.compound.zip.ZipCompoundCrawler.readZipEntryContent(ZipCompoundCrawler.java:394) at org.eclipse.smila.connectivity.framework.compound.zip.ZipCompoundCrawler.readAttribute(ZipCompoundCrawler.java:432) at org.eclipse.smila.connectivity.framework.compound.zip.ZipCompoundCrawler.readAttachment(ZipCompoundCrawler.java:466) at org.eclipse.smila.connectivity.framework.compound.zip.ZipCompoundCrawler.getAttachment(ZipCompoundCrawler.java:328) at org.eclipse.smila.connectivity.framework.util.internal.DataReferenceImpl.getRecord(DataReferenceImpl.java:131) at org.eclipse.smila.connectivity.framework.impl.CrawlThread.updateDataReference(CrawlThread.java:389) at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReference(CrawlThread.java:342) at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReferences(CrawlThread.java:308) at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processCompounds(CrawlThread.java:506) at org.eclipse.smila.connectivity.framework.impl.CrawlThread.updateDataReference(CrawlThread.java:401) at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReference(CrawlThread.java:342) at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReferences(CrawlThread.java:308) at org.eclipse.smila.connectivity.framework.impl.CrawlThread.run(CrawlThread.java:194) Caused by: java.lang.NullPointerException at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1025) at org.apache.commons.io.IOUtils.copy(IOUtils.java:999) at org.eclipse.smila.connectivity.framework.compound.zip.ZipCompoundCrawler.readZipEntryContent(ZipCompoundCrawler.java:376) ... 12 more To be more precise about the suggested solutions: Use Java 7 (but since the workaround must use API calls which are not present in Java 6 this solution would prevent SMILA from being run in a Java 6 environment) or use commons.compress (but which is not in orbit). Both solutions would have tu guess which charset to use (in my tests, "CP850" always did the Job, but I do not know how you can be sure about that, so the solution would be to guess and extract until an IllegalArgumentException would show the wrong charset was used). Maybe there is another way, but I do not know one :-( Unzip also assumes that ZIP files not compressed on Unix systems use CP850 (http://www.linuxfromscratch.org/blfs/view/cvs/general/unzip.html), the ZIP format specification (http://www.pkware.com/documents/casestudies/APPNOTE.TXT) only allows IBM Code Page 437 or UTF-8. Code page 850 (http://en.wikipedia.org/wiki/Codepage_850) is a modification of code page 437, so maybe both codepages could be tried before finally giving up or maybe trying CP437 is sufficient. This would have to be tested. See also: http://blogs.oracle.com/xuemingshen/entry/non_utf_8_encoding_in. Thank you Andreas for the information. Since other files in the ZIP archive are being indexed, I do not see a big problem or important missing feature here. Therefore this issue will remain an enhancement with very low priority. There are two seemingly undocumented properties the Java6 zip implementation uses: sun.zip.altEncoding and sun.zip.encoding. So using -Dsun.zip.altEncoding=CP850 as a VM argument when starting SMILA should work fine with (DOS generated) zip files with Non-UTF-8 Umlauts. does not work in SMILA, though, since the parameters don't effect the ZipFile, only the ZipInputStream instances, and SMILA's crawler uses ZipFile (at least at the moment). Has been resolved since switch to Java 7. The encoding to use has to be configured (configuration parameter zip.encoding for bundle org.eclipse.smila.importing.compounds.simple) Also see http://wiki.eclipse.org/SMILA/Documentation/Importing/CompoundExtractorService Closing this |