SourceForge.net Logo
Main Overview Wiki Issues Forum Build Fisheye
Issue Details (XML | Word | Printable)

Key: CMP-842
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Shay Banon
Reporter: Kenny MacLeod
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Compass

AbstractXmlWriterXmlContentConverter mangles XML

Created: 26/Feb/09 07:36 AM   Updated: 26/Feb/09 11:26 AM
Component/s: Compass::Core
Affects Version/s: 2.1.0 GA
Fix Version/s: 2.2.0 RC1

File Attachments: 1. Zip Archive test.zip (2 kB)



 Description  « Hide
When using XSEM with either the STAX or SAX converters, dom4j is mangling certain XML documents as they are written into the index.

Specifically, both converters extend AbstractXmlWriterXmlContentConverter, which in its toXml method uses dom4j's OutputFormat.createCompactFormat(). This output format is lossy - it trims whitespace from the document, so what you store into the index may not be what you get out again. This operation should be lossless.

I've attached a test case illustrating the problem.

The workaround is to create a custom subclass of the converter and override toXml() to use a lossless formatter.



 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Shay Banon added a comment - 26/Feb/09 07:52 AM
how do you configure it to use the looseless format? In your test (I have not ran it), what do you get back with the current implementation?

Kenny MacLeod added a comment - 26/Feb/09 07:55 AM
You get back

<test id="1">x y</test>

The leading and trailing spaces are removed, and the double space between the x and y is reduced to a single space.

It's possible that the double space thing is actually a bug in dom4j, it's hard to tell since the dom4j code isn't very well written.

I think you can just call OutputFormat's default constructor to get a formatter which is lossless.


Kenny MacLeod added a comment - 26/Feb/09 08:15 AM
Actually, org.dom4j.io.XMLWriter can be constructed without an OutputFormat, and it defaults to a lossless default.

Kenny MacLeod added a comment - 26/Feb/09 08:20 AM
Also, a similar problem exists with the JDOM converter (which should use getRawFormat rather than getCompactFormat). The org.w3c.dom implementation is fine, though.

Shay Banon added a comment - 26/Feb/09 09:15 AM
Since you have a solution for now, I am going to aim at fixing it for 2.2 (I will make it configurable). I also want to simplify the content converter configuration a bit. Will report back once things are committed.

Kenny MacLeod added a comment - 26/Feb/09 09:19 AM
OK. Why the need to make it configurable, though? What value is there in the current behaviour?

Shay Banon added a comment - 26/Feb/09 10:03 AM
I thought that users who don't mind the compact version problems, and prefer the benefit of having less data to store thanks to the compact form might want this feature. It will be turned "off" by default.

Shay Banon added a comment - 26/Feb/09 10:39 AM
The simplification for XSEM configuration issue (which I wanted to do for ages) can be found here now: CMP-843.

Shay Banon added a comment - 26/Feb/09 11:20 AM
I assume that loosless for dom4j means that I don't create an XmlWriter with an OutputFormat. Is that how you tested it?

Kenny MacLeod added a comment - 26/Feb/09 11:25 AM
That's correct, you just pass the StringWriter, no OutputFormat. The test passes OK with that.

Shay Banon added a comment - 26/Feb/09 11:26 AM
ok, in 2.2 the default will be without any compaction.

In dom4j, the 'compass.xsem.contentConverter.dom4j.outputFormat' setting can be set to 'compact'.
In jdom, the 'compass.xsem.contentConverter.jdom.outputFormat' setting can be set to 'compact'.