Using XML Parser for Java, 17 of 22

Character Sets

Encoding iso-8859-1 in xmlparser

Question

I have some XML-Documents with encoding="iso-8859-1". I am trying to parse these with xmlparser SAX API. In characters (char[], int, int), I would like to output the content in iso-8859-1 (Latin1) too.

With System.out.println() it doesn't work correctly. German umlauts result in '?' in the output stream. Internally ,,,÷,',ý,Ù,>, are stored as 65508,65526,65532,65476,65494,65500,65503 respectively. What do I have to do to get the output in Latin1? Host system here is a SPARC Solaris 2.6.

Answer

You cannot use System.out.println(). You need to use an output stream which is encoding aware, for example, OutputStreamWriter.

You canconstruct an outputstreamwrite and use the write(char[], int, int) method to:

print.Ex:OutputStreamWriter out = new OutputStreamWriter(System.out, "8859_1");
/* Java enc string for ISO8859-1*/

Parsing XML Stored in NCLOB With UTF-8 Encoding

Question

I'm having trouble with parsing XML stored in NCLOB column using UTF-8 encoding. Here is what I'm running:

Windows NT 4.0 ServerOracle 8i (8.1.5)
EEJDeveloper 3.0
JDK 1.1.8
Oracle XML Parser v2 (2.0.2.5?)

The following XML sample that I loaded into the database contains two UTF-8 multi-byte characters:

<?xml version="1.0" encoding="UTF-8"?>
<G>
<A>GÂ,otingen, BrÃ¼ck_W</A>
</G>

G(0xc2, 0x82)otingen, Br(0xc3, 0xbc)ck_W

If I'm not mistaken, both multibyte characters are valid UTF-8 encodings and they are defined in ISO-8859-1 as:

 0xC2 LATIN CAPITAL LETTER A WITH CIRCUMFLEX
0xFC  LATIN SMALL LETTER U WITH DIAERESIS

I wrote a Java stored function that uses the default connection object to connect to the database, runs a Select query, gets the OracleResultSet, calls the getCLOB method and calls the getAsciiStream() method on the CLOB object. Then it executes the following piece of code to get the XML into a DOM object:

DOMParser parser = new DOMParser();
parser.setPreserveWhitespace(true);
parser.parse(istr); 
// istr getAsciiStreamXMLDocument xmldoc = parser.getDocument();

Before the stored function can do other tasks, this code throws an exception complaining that the above XML contains "Invalid UTF8 encoding".

When I remove the first multibyte character (0xc2, 0x82) from the XML, it parses fine.
When I do not remove this character, but connect via the JDBC racle:thin driver (note that now I'm not running inside the RDBMS as stored function anymore) the XML is parsed with no problem and I can do what ever I want with the XMLDocument.

I loaded the sample XML into the database using the thin JDBC driver. I tried two database configurations with WE8ISO8859P1/WE8ISO8859P1 and WE8ISO8859P1/UTF8 and both showed the same problem.

Answer

Yes, the character (0xc2, 0x82) is valid UTF-8. We suspect that the character is distorted when getAsciiStream() is called. Try to use getUnicodeStream() and getBinaryStream() instead of getAsciiStream().

If this does not work, try print out the characters before to make sure that they are not distorted before they are sent to the parser in step: parser.parse(istr)

NLS support within XML

Question

I've got Japanese data stored in an nvarchar2 field in the database. I have a dynamic SQL procedure that utilizes the PL/SQL web toolkit that allows me to access data via OAS and a browser. This procedure uses the XML Parser to correctly format the result set in XML before returning it to the browser.

My problem is that the Japanese data is returned and displayed on the browser as upside down question marks. Is there anything I can do so that this data is correctly returned and displayed as Kanji?

Answer

Unfortunately, Java and XML default character set is UTF8 while I haven't heard of any UTF8 OS nor people using it as in their database and people writing their web pages in UTF8. All this means is that you have a character code conversion problem. Answer to your last question is 'yes'. We do have both PL/SQL and Java XML parsers working in Japanese. Unfortunately, we cannot provide a simple solution that will fit in this space.

UTF-16 Encoding with XML Parser for Java V2

Question

This is my XML Document:

Documento de Prueba de gestin de contenidos. Roberto P⁰/₀₀rez Lita

This is the way in which I parse the document:

DOMParser parser=new DOMParser(); 
parser.setPreserveWhitespace(true); 
parser.setErrorStream(System.err); 
parser.setValidationMode(false); 
parser.showWarnings(true);
parser.parse ( new FileInputStream(new File("PruebaA3Ingles.xml")));

I get the following error:

XML-0231 : (Error) Encoding 'UTF-16' is not currently supported

I am using the XML Parser for Java V2_0_2_5 and I am confused because the documentation says that the UTF-16 encoding is supported in this version of the Parser. Does anybody know how can I parse documents containing spanishaccents?

Answer

Oracle just uploaded a new release of V2 Parser. It should support UTF-16.Yet, other utilities still have some problems with UTF-16encoding.