XML parsing in Unicon

Introduction.

This page describes the XML parser in the library.  The parser takes its source input as a string, and produces as output a hierarchical tree structure representing the document.  There are also formatter classes that take the document structure and produce a string representation of the document as output.

The reason for using strings as input and output rather than files is for two reasons.  Firstly, it is more flexible - a file can be turned into a string easily, but not vice-versa (without using a temporary file).   Secondly, it allows the parser to use the built-in string scanning functions, which increase parse speed.  The downside of using strings is that the document has to be read into memory prior to parsing, so the parser may not be suitable for really huge documents.
 

The XML parser

Create the XML parser using the following code :-
   import xml
   ...
   p := XmlParser()
To parse a document in string form, just invoke the parse() method, as follows :-
 
   s := "<?xml version=\"1.0\" encoding=\"UTF-8\"?><simple></simple>"
   d := p.parse(s) | stop("Couldn't parse")
Note that if the input string s is not well-formed, the parse method will fail.  Otherwise, it succeeds and returns an XmlDocument object, which can then be inspected as needed.

If you have a file you need to parse, just load it into a string with the following code first :-

      f := open(file_name) | stop("couldn't open")
      s := ""
      while s ||:= reads(f, 1000)
      close(f)

The XMLDocument and XMLElement classes

As noted above, the parser returns an XMLDocument instance.  XMLDocument has methods to inspect the result of the parsing.  Most notably, the method get_root_element() will return the XMLElement which is the root of the element structure :-
    e := d.get_root_element()
The XMLElement class has methods which enable you to search the children of the element, and to inspect the element attributes.  For example, say that the element e represented the following structure :-
 
    <top a1="val1" a2="val2">
      <inner a3="val3">
          Some text
      </inner>
    </top>
Then
     e.get_name() == "top"
     e.get_attribute("a1") == "val1"
     e.get_attribute("a2") == "val2"
     e.get_attribute("absent") == &null
     f := e.search_children("inner") sets f to another XmlElement, representing inner.  If there
              were several inner elements, it would suspend them in sequence.
     e.search_children("absent") fails
     f.get_string_content() returns "        Some text        "
     f.get_trimmed_string_content() returns "Some text"
Please see the API documentation for more details.

Validation

If the document is not well-formed then the parse() method will fail.  If the document is invalid then parse() succeeds, but the parser will have done two things.  Firstly, it will have noted a count of the validity errors in the document, which can be obtained with d.get_validity_errors().  Also, during parsing the parser will have made callbacks to the parser's ErrorHandler.  An error handler just provides some methods to invoke when errors (fatal and invalidity) or warnings occur.  The default error handler is DefaultErrorHandler, and it just prints out the error messages to a given file (by default &output).

The parser's error handler can be changed by using something like the following :-

import xml
class MyErrorHandler : ErrorHandler()
   method fatal_error(msg, stack)
      # Handle...
   end

   method validity_error(msg, stack)
      # Handle...
   end

   method warning(msg, stack)
      # Handle...
   end
end
...
p.set_error_handler(MyErrorHandler())
However, most of the time, the DefaultErrorHandler will be fine.  It has methods to configure the output file and the level for which output should be generated, so it relatively flexible.

The parser's validation process can be turned off if desired, using the clear_validate() method.  This will increase parser speed, but may effect the result in terms of whitespace (see below).


External entity resolution

During parsing, the parser sometimes needs to resolve external entities.  Typically this is when an external DTD needs to be loaded, as in the doctype declaration
<!DOCTYPE web-app
  PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.2//EN"
  "http://java.sun.com/j2ee/dtds/web-app_2_2.dtd">
In order to resolve this, and obtain the external data, the parser uses a Resolver class.  A custom resolver class can be used as follows :-
import xml
class MyResolver : Resolver()
   method resolve(external_id)
      local s, t
      s := external_id.get_public_id() # eg -//Sun Microsystems, Inc.//DTD Web Application 2.2//EN
      t := external_id.get_system_id() # eg http://java.sun.com/j2ee/dtds/web-app_2_2.dtd
      # Do something with t and s, to return a string representing the external entity
   end
end
...
set_resolver(MyResolver())
The parser uses a default resolver, DefaultResolver, which should be sufficient for normal purposes.  It resolves system ids beginning with "file://" and "http://" locally and over the network respectively.  If the system id doesn't begin with either of those strings, then it is treated as a local file path.
 

Formatted output

Output of document structures is done using a set of formatter classes.  For XML documents, use the XmlFormatter class :-
f := XmlFormatter()
s := f.format(d)
write(s)
Note how the formatter returns a string from a document, so it is really the reverse process of a parser.  Various options can be set on the formatter; for example
f.set_text_trim()
f.set_indent(3)
will output all the text content with whitespace trimmed, and all elements formatted with an indent of 3 spaces.

Namespaces

Namespaces are fully supported as a post-processing step to normal parsing.  This is on by default, but can be turned off by using p.clear_do_namespaces().  Assuming namespaces are being processed, then the XmlElement class has extra methods which can be used to find child elements and attributes based on the global name, which is a pair of a URI and a local name.  A global name is represented by a GlobalName instance, which can be created with something like :-
gn := GlobalName("Local", "http://schemas.xmlsoap.org/soap/envelope/")
This global name can then be used to select the element "Prefix:Local" in the following example :-
<parent xmlns:Prefix="http://schemas.xmlsoap.org/soap/envelope/">
   <Prefix:Local attr="123"/>
</parent>
The selection methods for elements and attributes using global names can be found in the XmlElement class.  Please see the API docs for full details.

Test suite and limitations

To test the parser, there is a script "dotests.sh" in the distribution directory which runs about 1700 test documents through the parser.  The test documents come from various sources, and fall into one of three categories:- There are three or four instances (all from one test suite) where I can't agree with their definition of what is and isn't well-formed/invalid.  The XML spec can be maddeningly vague in some respects, so I am probably just not interpreting it right.

There are also a very small number of cases where a well formed but invalid document is not reported as invalid.  There are no cases where a valid document is reported as invalid, or a well-formed document will not parse.

Finally, and more importantly, because Icon's characters and strings are based on 8-bit extended ASCII, any document using Unicode characters will not be handled correctly.

All these tests which cause the parser problems are commented out with an appropriate commentary in the dotests.sh file.

Testxml

This is a small program which can be used to see what the parser does with a particular document.  Given an input filename, testxml will parse the document and output various sections showing the formatted version of the document, a complete display of the document's structure, and the document's constraints read from the DTD.

Just run testxml with no arguments for a list of options.

API documentation

The API documentation for the XML parser library can be found in the xml package section of the main API documentation.