The reason for using strings as input and output rather than files is
for two reasons. Firstly, it is more flexible - a file can be turned
into a string easily, but not vice-versa (without using a temporary file).
Secondly, it allows the parser to use the built-in string scanning functions,
which increase parse speed. The downside of using strings is that
the document has to be read into memory prior to parsing, so the parser
may not be suitable for really huge documents.
import xml
...
p := XmlParser()To parse a document in string form, just invoke the parse() method, as follows :-
s := "<?xml version=\"1.0\" encoding=\"UTF-8\"?><simple></simple>"
d := p.parse(s) | stop("Couldn't parse")
Note that if the input string s is not well-formed, the parse method will
fail. Otherwise, it succeeds and returns an XmlDocument object, which
can then be inspected as needed.
If you have a file you need to parse, just load it into a string with the following code first :-
f := open(file_name) | stop("couldn't open")
s := ""
while s ||:= reads(f, 1000)
close(f)
e := d.get_root_element()The XMLElement class has methods which enable you to search the children of the element, and to inspect the element attributes. For example, say that the element e represented the following structure :-
<top a1="val1" a2="val2">
<inner a3="val3">
Some text
</inner>
</top>Then
e.get_name() == "top"
e.get_attribute("a1") == "val1"
e.get_attribute("a2") == "val2"
e.get_attribute("absent") == &null
f := e.search_children("inner") sets f to another XmlElement, representing inner. If there
were several inner elements, it would suspend them in sequence.
e.search_children("absent") fails
f.get_string_content() returns " Some text "
f.get_trimmed_string_content() returns "Some text"Please see the API documentation for more details.
The parser's error handler can be changed by using something like the following :-
import xml
class MyErrorHandler : ErrorHandler()
method fatal_error(msg, stack) # Handle... end method validity_error(msg, stack) # Handle... end method warning(msg, stack) # Handle... end end
...
p.set_error_handler(MyErrorHandler())However, most of the time, the DefaultErrorHandler will be fine. It has methods to configure the output file and the level for which output should be generated, so it relatively flexible.
The parser's validation process can be turned off if desired, using the clear_validate() method. This will increase parser speed, but may effect the result in terms of whitespace (see below).
<!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.2//EN" "http://java.sun.com/j2ee/dtds/web-app_2_2.dtd">In order to resolve this, and obtain the external data, the parser uses a Resolver class. A custom resolver class can be used as follows :-
import xml
class MyResolver : Resolver() method resolve(external_id)
local s, t
s := external_id.get_public_id() # eg -//Sun Microsystems, Inc.//DTD Web Application 2.2//EN
t := external_id.get_system_id() # eg http://java.sun.com/j2ee/dtds/web-app_2_2.dtd
# Do something with t and s, to return a string representing the external entity
end
end
...
set_resolver(MyResolver())The parser uses a default resolver, DefaultResolver, which should be sufficient for normal purposes. It resolves system ids beginning with "file://" and "http://" locally and over the network respectively. If the system id doesn't begin with either of those strings, then it is treated as a local file path.
f := XmlFormatter()
s := f.format(d)
write(s)Note how the formatter returns a string from a document, so it is really the reverse process of a parser. Various options can be set on the formatter; for example
f.set_text_trim()
f.set_indent(3)will output all the text content with whitespace trimmed, and all elements formatted with an indent of 3 spaces.
gn := GlobalName("Local", "http://schemas.xmlsoap.org/soap/envelope/")
This global name can then be used to select the element "Prefix:Local"
in the following example :-
<parent xmlns:Prefix="http://schemas.xmlsoap.org/soap/envelope/">
<Prefix:Local attr="123"/>
</parent>The selection methods for elements and attributes using global names can be found in the XmlElement class. Please see the API docs for full details.
There are also a very small number of cases where a well formed but invalid document is not reported as invalid. There are no cases where a valid document is reported as invalid, or a well-formed document will not parse.
Finally, and more importantly, because Icon's characters and strings are based on 8-bit extended ASCII, any document using Unicode characters will not be handled correctly.
All these tests which cause the parser problems are commented out with an appropriate commentary in the dotests.sh file.
Just run testxml with no arguments for a list of options.