This Bugzilla instance is deprecated, and most Eclipse projects now use GitHub or Eclipse GitLab. Please see the deprecation plan for details.
Bug 189353 - More file patterns should transfer in text mode by default
Summary: More file patterns should transfer in text mode by default
Status: RESOLVED FIXED
Alias: None
Product: Target Management
Classification: Tools
Component: RSE (show other bugs)
Version: 2.0   Edit
Hardware: PC Windows XP
: P4 enhancement (vote)
Target Milestone: 3.3 M7   Edit
Assignee: David McKnight CLA
QA Contact: Martin Oberhuber CLA
URL:
Whiteboard:
Keywords:
Depends on: 279014
Blocks:
  Show dependency tree
 
Reported: 2007-05-28 01:13 EDT by Martin Oberhuber CLA
Modified: 2011-03-31 15:59 EDT (History)
1 user (show)

See Also:
mober.at+eclipse: review+


Attachments
patch with additional remote file types (3.81 KB, patch)
2009-05-14 14:15 EDT, David McKnight CLA
no flags Details | Diff
updated patch with more complete xml handling (5.95 KB, patch)
2009-06-01 12:12 EDT, David McKnight CLA
no flags Details | Diff
remote file type extensions with xml (1.43 KB, patch)
2011-03-08 09:54 EST, David McKnight CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Oberhuber CLA 2007-05-28 01:13:07 EDT
The file transfer mode registry should also include files like 
*.sh, *.properties, *.txt

Note that we need to weigh the benefit of fixing CRLF endings against potential issues with encodings.

-----------Enter bugs above this line-----------
TM 2.0RC1 Testing
installation : eclipse-platform-3.3M6 (I20070323-1616), cdt-4.0M5, emf-2.3M5
RSE install  : download RSE 2.0RC1 runtime-all + terminal
java.runtime : Sun 1.6.0-b105
os.name:     : Windows XP 5.1, Service Pack 2
------------------------------------------------
systemtype   : Linux-ssh (dstore-processes)
targetos     : SUSE Linux Enterprise Server 10 (ppc)
targetuname  : Linux build 2.6.16.27-0.9-ppc64 #1 SMP ppc64 GNU/Linux
targetvm     : ibm-java2-sdk-5.0-0-4.0-linux-ppc
------------------------------------------------
Comment 1 Martin Oberhuber CLA 2008-09-16 15:47:42 EDT
Text mode transfer is more interesting for IBM... Dave would you like to take that one on?
Comment 2 David McKnight CLA 2009-05-14 14:15:58 EDT
Created attachment 135832 [details]
patch with additional remote file types

I've added more remoteFileType extensions to expand the file transfer defaults including several of those used by the Team->File Content preferences.  Is the list in this patch okay, or are there more you can think of that need adding?
Comment 3 Martin Oberhuber CLA 2009-05-22 13:53:37 EDT
Hm... can you explain the following in the patch:

  <remoteFileTypes extension="history" type="text"/>
  <remoteFileTypes extension="index" type="text"/>
  <remoteFileTypes extension="mxsd" type="text"/>

I don't know what exactly these are used for, but there could be CDT .index file which are not text but binary (would need to check).

I also don't understand why all the XML types (xmi, xsl) are considered binary... is it because they specify their own encoding in the header? And how does that relate to CRLF conversion?
Comment 4 David McKnight CLA 2009-05-22 14:10:04 EDT
(In reply to comment #3)
> Hm... can you explain the following in the patch:
> 
>   <remoteFileTypes extension="history" type="text"/>
>   <remoteFileTypes extension="index" type="text"/>
>   <remoteFileTypes extension="mxsd" type="text"/>
> 
> I don't know what exactly these are used for, but there could be CDT .index
> file which are not text but binary (would need to check).

These types were listed as text under the preferences for Team->File Content so I figured we ought to do the same for RSE.

> 
> I also don't understand why all the XML types (xmi, xsl) are considered
> binary... is it because they specify their own encoding in the header? And how
> does that relate to CRLF conversion?
> 

The XML types are binary because, as you guess, their encodings are in their headers.

Comment 5 Martin Oberhuber CLA 2009-05-28 10:15:36 EDT
OK. Aligning with the defaults from Platform/Team is a good idea.
Go for it - will be RC3 since RC2 was completed yesterday.
Comment 6 David McKnight CLA 2009-05-28 10:37:01 EDT
I've committed the change to cvs.
Comment 7 Martin Oberhuber CLA 2009-05-29 15:46:59 EDT
On 2nd review, I notice the following problems:

(1) You committed 2 copies of <?eclipse version="3.0"?>
(2) Many of the file types you add as "text" are actually XML files, so I
    think these should be binary? Note that RSE's concept of "text mode" is
    different than Platform/Team. Where Platform/Team only considers CRLF
    conversion, RSE also considers encoding conversion on upload/download,
    and this must not happen for XML.
    But perhaps I'm wrong and XML files are always exempt from encoding 
    conversion? Then they may be text in order to receive CRLF conversion.
    I remember there was some special code for XML in some manager...

Following file types are definitely XML (I checked):
.classpath
.ecore
.exsd
.jardesc
.launch
.product
.project

For follwing I'm not sure:
.emof
.history
.index
.mxsd

Please review again, thanks.
Comment 8 David McKnight CLA 2009-06-01 11:55:41 EDT
(In reply to comment #7)
> On 2nd review, I notice the following problems:
> 
> (1) You committed 2 copies of <?eclipse version="3.0"?>
> (2) Many of the file types you add as "text" are actually XML files, so I
>     think these should be binary? Note that RSE's concept of "text mode" is
>     different than Platform/Team. Where Platform/Team only considers CRLF
>     conversion, RSE also considers encoding conversion on upload/download,
>     and this must not happen for XML.
>     But perhaps I'm wrong and XML files are always exempt from encoding 
>     conversion? Then they may be text in order to receive CRLF conversion.
>     I remember there was some special code for XML in some manager...
> 
> Following file types are definitely XML (I checked):
> .classpath
> .ecore
> .exsd
> .jardesc
> .launch
> .product
> .project

RSE's SystemEncodingUtility currently only recognizes .xml and .xmi files as being xml files.  What this means, is that these are the only files that we actually look at the file to determine encoding from.  I guess we ought to do add the other XML file types you've listed here to the list used by SystemEncodingUtility and list them as binary in the file type extensions.

> 
> For follwing I'm not sure:
> .emof
> .history
> .index
> .mxsd
> 
> Please review again, thanks.
> 

Comment 9 David McKnight CLA 2009-06-01 12:12:21 EDT
Created attachment 137867 [details]
updated patch with more complete xml handling
Comment 10 David McKnight CLA 2009-06-03 12:45:11 EDT
I've committed the newer plugin.xml file.  I'm still waiting on feedback on what to do about the SystemEncodingUtil change.
Comment 11 Martin Oberhuber CLA 2009-06-03 14:25:10 EDT
As mentioned on the phone call, it is a potential risk to change files which used to be binary to transfer in text mode, because encoding conversion can destroy file contents.

The default setting in RSE for files not listed is to transfer "binary", so adding files as "text" is potentially dangerous, and a change in behavior compared to RSE 3.0:

*.cvsignore
*.emof
*.history
*.index
*.jpage
*.MF
*.mxsd
*.options
*.prefs
*.properties
*.sh

Dave I'm awaiting your investigations with respect to 
  (a) what happens when a remote UTF-8 file that contains foreign language
      characters is edited on a windows system with default cp-1252 encoding 
      and transferred in text mode, and
  (b) what the XML utility really does to files depending on whether the
      <?xml> tag is in the file or not.

Moving forward, I'd also like to see some user documentation that explains the impact spectifying "text" transfer mode for a file with respect to encoding conversion, and how users can change the local default encoding in order to avoid loss of data (some users may want to set RemoteSystemsTempFiles project to UTF-8 default encoding in order to avoid loss of data issues, for instance). Or perhaps we want to make UTF-8 the default encoding in RemoteSystemsTempFiles and force products / users to change that default on demand if needed?
Comment 12 David McKnight CLA 2009-06-03 16:17:51 EDT
(In reply to comment #11)
> As mentioned on the phone call, it is a potential risk to change files which
> used to be binary to transfer in text mode, because encoding conversion can
> destroy file contents.
> 
> The default setting in RSE for files not listed is to transfer "binary", so
> adding files as "text" is potentially dangerous, and a change in behavior
> compared to RSE 3.0:
> 
> *.cvsignore
> *.emof
> *.history
> *.index
> *.jpage
> *.MF
> *.mxsd
> *.options
> *.prefs
> *.properties
> *.sh
> 

Would you like me to remove these extensions from the list?

> Dave I'm awaiting your investigations with respect to 
>   (a) what happens when a remote UTF-8 file that contains foreign language
>       characters is edited on a windows system with default cp-1252 encoding 
>       and transferred in text mode, and

If a utf8 file contains characters outside of the cp1252 range, the workspace encoding is cp1252 and a file is downloaded in text mode using dstore, then we hit a corruption problem - this is described by bug 279014.

>   (b) what the XML utility really does to files depending on whether the
>       <?xml> tag is in the file or not.

When an XML file doesn't have an encoding specified, we use the following code (with lots of comments):

	// From section 4.3.3 of the XML specification:
	// In the absence of information provided by an external transport protocol
	// (e.g. HTTP or MIME), it is an error for an entity including an encoding declaration
	// to be presented to the XML processor in an encoding other than that named in the
	// declaration, or for an entity which begins with neither a Byte Order Mark nor an
	// encoding declaration to use an encoding other than UTF-8. Note that since ASCII is
	// a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

	// We'll assume that this is UTF-8 or ASCII encoding without an encoding declaration.
	// Of course, it could also be another encoding that doesn't have an encoding declaration
	// in which case it has violated the XML specification (any encoding beside UTF-8 or UTF-16
	// must specify the character encoding). From section 4.3.3 of the XML specification:
	// In the absence of external character encoding information (such as MIME headers),
	// parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin
	// with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration.
	encodingGuess = SystemEncodingUtil.ENCODING_UTF_8;



> 
> Moving forward, I'd also like to see some user documentation that explains the
> impact spectifying "text" transfer mode for a file with respect to encoding
> conversion, and how users can change the local default encoding in order to
> avoid loss of data (some users may want to set RemoteSystemsTempFiles project
> to UTF-8 default encoding in order to avoid loss of data issues, for instance).
> Or perhaps we want to make UTF-8 the default encoding in RemoteSystemsTempFiles
> and force products / users to change that default on demand if needed?
> 

Perhaps we should but is this something we can do right now or for a future release?
Comment 13 Martin Oberhuber CLA 2009-06-03 17:22:32 EDT
(In reply to comment #12)
> Would you like me to remove these extensions from the list?

I guess I'd like to better understand bug 279014 and whether we can do anything about it, before I decide this. But I tend to think that we should only change file types from "binary" to "text" mode if we absolutely think it is necessary. 
We should unvoid unnecessary late breaking change like this. Are there any file types that you think you must have, or could you live with removing them from the list of types to be "text" mode?

> >   (b) what the XML utility really does to files depending on whether the
> >       <?xml> tag is in the file or not.
> 
> When an XML file doesn't have an encoding specified, we use the following code
>         encodingGuess = SystemEncodingUtil.ENCODING_UTF_8;

Hm, so the util will guess that the file is UTF-8 but will transfer the file in binary mode? What impact does the guessed encoding have? Does it even make a difference for XML files registered in the SystemEncodingUtil whether their file type is marked as "text" or "binary"?

At the point we are currently at, we should avoid code changes wherever we can and resort to documentation instead wherever possible.

Let's focus on bug 279014 first and get that one resolved. Then we can decide what to do to this one. But it's likely not going to go into RC3 and given that we've been able to live with the current extension list for a while we need to be very sure it's the right thing to do before we put it into RC4.
Comment 14 David McKnight CLA 2009-06-04 09:28:30 EDT
(In reply to comment #13)
> (In reply to comment #12)
> > Would you like me to remove these extensions from the list?
> 
> I guess I'd like to better understand bug 279014 and whether we can do anything
> about it, before I decide this. But I tend to think that we should only change
> file types from "binary" to "text" mode if we absolutely think it is necessary. 
> We should unvoid unnecessary late breaking change like this. Are there any file
> types that you think you must have, or could you live with removing them from
> the list of types to be "text" mode?
> 

I would like to at least leave those files that were listed as "text" in previous releases there.  I don't think anything else is a "must-have".


> > >   (b) what the XML utility really does to files depending on whether the
> > >       <?xml> tag is in the file or not.
> > 
> > When an XML file doesn't have an encoding specified, we use the following code
> >         encodingGuess = SystemEncodingUtil.ENCODING_UTF_8;
> 
> Hm, so the util will guess that the file is UTF-8 but will transfer the file in
> binary mode? 

The download in binary has already happened before we try to guess the encoding.

> What impact does the guessed encoding have? Does it even make a
> difference for XML files registered in the SystemEncodingUtil whether their
> file type is marked as "text" or "binary"?

No, because any file that we consider to be xml will be downloaded as binary.


Comment 15 Martin Oberhuber CLA 2009-06-04 10:39:26 EDT
Ok, so it looks like the best thing to do is just keep the old list of file types from 3.0 at least until we know if bug 279014 can be addressed: getting user feedback for when the data loss happens should be the highest priority.

Adding more "binary" file types for xml like stuff won't hurt since the default transfer mode is binary anyways, on the other hand it may be surprising for people to see lots of xml files configured as binary since people perceive them to be text files. In other words, I don't see the value of adding these types as binary now, and I'm thus in favor of postponing this bug.

If you agree, can you revert CVS HEAD? Re-applying the patch later should not be a problem if we want.

The 2nd priority should be adding or enhancing the userdoc context help page for that preference page, explaining the meaning of "text" vs "binary".
Comment 16 David McKnight CLA 2009-06-04 11:57:45 EDT
(In reply to comment #15)

> If you agree, can you revert CVS HEAD? Re-applying the patch later should not
> be a problem if we want.

Okay, I've reverted the file back.


Comment 17 Mahi CLA 2009-06-05 17:16:59 EDT
When I transfer a local WSDL and XSD files to z/OS Unix file system using Remote Systems explorer (in Binary mode), the remote WSDL and XSD files are being set with the host encoding (example IBM-047). In order for me to view the contents in an editor I had to manually set the file encoding to UTF-8 by going to properties. But in he case of XML files, RSE transfers as binary and then uses the "encoding" attribute to determine which encoding to use in the editor. I think WSDL and XSD should be treated the same way as XML.
Comment 18 David McKnight CLA 2011-03-08 09:54:25 EST
Created attachment 190660 [details]
remote file type extensions with xml

Revisiting this bug after some time, the following bugs/enhancements have been resolved since then:

bug 283033 (xml file type)
bug 279014 (text file corruption on down)

At this point, it might be worth while updating the list to include the xml file types listed in this patch.
Comment 19 David McKnight CLA 2011-03-08 09:55:57 EST
Any objections to the patch?  Other than that, I'm not sure what else (if anything) we want to do with this enhancement.  Martin?
Comment 20 David McKnight CLA 2011-03-31 15:59:49 EDT
I've committed the change to cvs.