Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 338905 - Umlauts in zip files lead to errors while crawling and the files will not be processed.
Summary: Umlauts in zip files lead to errors while crawling and the files will not be ...
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: Smila (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows 7
: P5 enhancement (vote)
Target Milestone: ---   Edit
Assignee: Andreas Schank CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-04 04:08 EST by Andreas Schank CLA
Modified: 2022-07-07 11:31 EDT (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Schank CLA 2011-03-04 04:08:20 EST
Build Identifier: 0.7

A zip file containing files with umlauts in the file name will lead to errors during crawling.

The files will not be processed.

The problem is intrinsic in the Java's zip handling when the file names are not encoded in UTF-8.

A possible solution could be using commons-comress library that is not limited to UTF-8 file names.

Reproducible: Always

Steps to Reproduce:
1. create a zip (e.g with 7zip or Windows explorer with built-in zip capabilities) in the cralwer directory (e.g. c:\data)
2. put a file with an umlaut in the file name into it
3. start crawler
Comment 1 Igor Novakovic CLA 2011-09-22 10:18:55 EDT
Andreas could you please give an estimation on effort for using commons-comress library?
Comment 2 Andreas Schank CLA 2011-10-11 08:38:50 EDT
tested it again...
To describe this error more precisely:
The zipped entries in question are just not indexed but reported as an error in the way like following, the error is listed in the JConsole Crawler's error buffer and the files are accounted as Exceptions (although no critical exceptions).

Other files within the same zip are indexed, though. The crawler continues working.

--- 2011-10-11 14:18:26.700 ---
org.eclipse.smila.connectivity.framework.CrawlerException: Error reading content of ZipEntry 'pfad im zip/noch'n pfad/_?.tx' of record id file:<Path=c:\data\doppelzip.zip>
	at org.eclipse.smila.connectivity.framework.compound.zip.ZipCompoundCrawler.readZipEntryContent(ZipCompoundCrawler.java:394)
	at org.eclipse.smila.connectivity.framework.compound.zip.ZipCompoundCrawler.readAttribute(ZipCompoundCrawler.java:432)
	at org.eclipse.smila.connectivity.framework.compound.zip.ZipCompoundCrawler.readAttachment(ZipCompoundCrawler.java:466)
	at org.eclipse.smila.connectivity.framework.compound.zip.ZipCompoundCrawler.getAttachment(ZipCompoundCrawler.java:328)
	at org.eclipse.smila.connectivity.framework.util.internal.DataReferenceImpl.getRecord(DataReferenceImpl.java:131)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.updateDataReference(CrawlThread.java:389)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReference(CrawlThread.java:342)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReferences(CrawlThread.java:308)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processCompounds(CrawlThread.java:506)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.updateDataReference(CrawlThread.java:401)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReference(CrawlThread.java:342)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.processDataReferences(CrawlThread.java:308)
	at org.eclipse.smila.connectivity.framework.impl.CrawlThread.run(CrawlThread.java:194)
Caused by: java.lang.NullPointerException
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1025)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:999)
	at org.eclipse.smila.connectivity.framework.compound.zip.ZipCompoundCrawler.readZipEntryContent(ZipCompoundCrawler.java:376)
	... 12 more
Comment 3 Andreas Schank CLA 2011-10-11 08:58:50 EDT
To be more precise about the suggested solutions:

Use Java 7 (but since the workaround must use API calls which are not present in Java 6 this solution would prevent SMILA from being run in a Java 6 environment) or use commons.compress (but which is not in orbit).

Both solutions would have tu guess which charset to use (in my tests, "CP850" always did the Job, but I do not know how you can be sure about that, so the solution would be to guess and extract until an IllegalArgumentException would show the wrong charset was used). Maybe there is another way, but I do not know one :-(

Unzip also assumes that ZIP files not compressed on Unix systems use CP850 (http://www.linuxfromscratch.org/blfs/view/cvs/general/unzip.html), the ZIP format specification (http://www.pkware.com/documents/casestudies/APPNOTE.TXT) only allows IBM Code Page 437 or UTF-8. Code page 850 (http://en.wikipedia.org/wiki/Codepage_850) is a modification of code page 437, so maybe both codepages could be tried before finally giving up or maybe trying CP437 is sufficient. This would have to be tested.

See also: http://blogs.oracle.com/xuemingshen/entry/non_utf_8_encoding_in.
Comment 4 Igor Novakovic CLA 2011-10-11 10:06:04 EDT
Thank you Andreas for the information.
Since other files in the ZIP archive are being indexed, I do not see a big problem or important missing feature here. Therefore this issue will remain an enhancement with very low priority.
Comment 5 Andreas Schank CLA 2012-02-13 10:56:42 EST
There are two seemingly undocumented properties the Java6 zip implementation uses:
sun.zip.altEncoding and sun.zip.encoding.

So using -Dsun.zip.altEncoding=CP850 as a VM argument when starting SMILA should work fine with (DOS generated) zip files with Non-UTF-8 Umlauts.
Comment 6 Andreas Schank CLA 2012-02-13 11:25:19 EST
does not work in SMILA, though, since the parameters don't effect the ZipFile, only the ZipInputStream instances, and SMILA's crawler uses ZipFile (at least at the moment).
Comment 7 Andreas Schank CLA 2012-10-17 06:30:51 EDT
Has been resolved since switch to Java 7.

The encoding to use has to be configured (configuration parameter zip.encoding for bundle org.eclipse.smila.importing.compounds.simple)

Also see http://wiki.eclipse.org/SMILA/Documentation/Importing/CompoundExtractorService
Comment 8 Andreas Weber CLA 2013-04-15 11:52:02 EDT
Closing this