This Bugzilla instance is deprecated, and most Eclipse projects now use GitHub or Eclipse GitLab. Please see the deprecation plan for details.
Bug 407803 - Documents without DTD contain text on root level after reading
Summary: Documents without DTD contain text on root level after reading
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: Mylyn (show other bugs)
Version: 1.1.0   Edit
Hardware: All All
: P3 major (vote)
Target Milestone: 1.1.0 M2   Edit
Assignee: Florian Thienel CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-05-11 15:40 EDT by Carsten Hiesserich CLA
Modified: 2013-10-02 12:41 EDT (History)
0 users

See Also:


Attachments
A unit test to catch the described behaviour (2.36 KB, patch)
2013-05-11 15:41 EDT, Carsten Hiesserich CLA
no flags Details | Diff
A possible patch with basic validation in DocumentBuilder (4.58 KB, patch)
2013-05-11 15:42 EDT, Carsten Hiesserich CLA
no flags Details | Diff
Revised patch (1.05 KB, patch)
2013-05-12 11:28 EDT, Carsten Hiesserich CLA
no flags Details | Diff
Revised patch (4.45 KB, patch)
2013-05-12 11:31 EDT, Carsten Hiesserich CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Carsten Hiesserich CLA 2013-05-11 15:40:32 EDT
Documents with a Schema instead of a DTD are not correctly read by the SAX-Parser.

1. Create the following document
<?xml version='1.0' encoding='UTF-8'?>
<article xmlns="http://docbook.org/ns/docbook">
	<title>Title</title>
	<para>Para1</para> 
	<!-- Comment 1 -->
	<!-- Comment 2 -->
</article>

2. Open the document with VEX

Result: VES inserts TEXT elements to the DOM

This is caused by the SAX-Parser that reports all whitespace. When used with a DTD, the parser reports ignorable whitespace instead.

The attached patch uses a very basic validation already in the document builder.
Comment 1 Carsten Hiesserich CLA 2013-05-11 15:41:26 EDT
Created attachment 230819 [details]
A unit test to catch the described behaviour
Comment 2 Carsten Hiesserich CLA 2013-05-11 15:42:01 EDT
Created attachment 230820 [details]
A possible patch with basic validation in DocumentBuilder
Comment 3 Florian Thienel CLA 2013-05-12 05:21:23 EDT
Carsten, thank you for this contribution.

As far as I can see in the test, the problem are the text nodes containing only whitespace, right? I refactored the test to use the new IAxis-API in order to make it more expressive:

	@Test
	public void shouldNotAddWhitespaceTextNodesInDocumentWithoutDTD() throws Exception {
		final DocumentReader reader = new DocumentReader();
		final IDocument document = reader.read(TestResources.get("documentWithoutDTD.xml"));
		final IElement rootElement = document.getRootElement();

		final IFilter<INode> recursiveWhitespaceTextNodeFilter = new IFilter<INode>() {
			public boolean matches(final INode node) {
				return node.accept(new BaseNodeVisitorWithResult<Boolean>(false) {
					@Override
					public Boolean visit(final IElement element) {
						return containsEmptyTextNodes(element);
					}

					@Override
					public Boolean visit(final IText text) {
						return text.getText().trim().length() == 0;
					}
				});
			}

			private boolean containsEmptyTextNodes(final IElement element) {
				return !element.children().matching(this).isEmpty();
			}
		};

		final IAxis<? extends INode> nodesWithEmptyTextNodes = rootElement.children().matching(recursiveWhitespaceTextNodeFilter);

		assertEquals("whitespace text nodes", 0, nodesWithEmptyTextNodes.count());
	}

What do you think about this?
Comment 4 Carsten Hiesserich CLA 2013-05-12 09:42:23 EDT
This looks good to me. The result is a bit clearer.

I just realized the provided patch has a little side effect, as it removes all invalid text nodes from the source. Originally, for elements with DTD an exception is thrown, docs without dtd are displayed but uneditable.
I think it would be ok to remove those texts silently, or would it be better to show an error to the user (Something like 'Invalid document: Content not allowed in xxx')?
Comment 5 Florian Thienel CLA 2013-05-12 10:06:29 EDT
(In reply to comment #4)
Currently we don't have any concept implemented for handling invalid content. There is the idea of having some kind of markers (e.g. for missing elements, wrong attribute values etc.) and providing quickfixes, but nothing realized yet in this direction. 

At the moment we can react in two ways:
1. superflous whitespace can be removed silently (this is conform with the XML spec)
2. documents containing invalid content should maybe opened read-only and a hint should be shown

I think this bug involves only (1). With (2) I'm not really shure, but this should be only a temporary solution anyway because a user should always be able to make an invalid document valid in Vex (hopefully in the future :-) )

The solutions for this bug should concentrate on (1) (tiny little steps) and we should have a discussion about a short term concept for invalid content in a separate context.
Comment 6 Carsten Hiesserich CLA 2013-05-12 11:28:46 EDT
Created attachment 230830 [details]
Revised patch
Comment 7 Carsten Hiesserich CLA 2013-05-12 11:29:40 EDT
(In reply to comment #5)
Ok, i agree. I changed the patch to ignore only whitespace.
It still ignores some characters it should not ignore according to XML standard. Thats caused by the use of Character#isWhitespace, that returns true for unicode chars that should not be ignored.
Comment 8 Carsten Hiesserich CLA 2013-05-12 11:31:39 EDT
Created attachment 230831 [details]
Revised patch
Comment 9 Florian Thienel CLA 2013-05-14 15:22:43 EDT
The new patch looks good to me. I did some clean-up and removed the dependency from DocumentReader to WTPVexValidator. Your contribution was commited as f726a9d02dc8a1422402e8a681153456d5139c75.
Comment 10 Carsten Hiesserich CLA 2013-06-01 03:32:23 EDT
I assert that I:
1. authored 100% the content they are contributing
2. have the rights to donate the content to Eclipse
3. contribute the content under the EPL
Comment 11 Florian Thienel CLA 2013-10-02 12:41:28 EDT
M2 released