HTML parsing in Unicon
Introduction.
The HTML parser uses the similar document structures for input and output
as the XML parser. However, it is much simpler
than the XML parser.
The parser
Create the HTML parser using the following code :-
import xml
...
p := HtmlParser()
To parse a document in string form, just invoke the parse() method, as follows
:-
s := "<html>here is some html<p>here is some more</html>"
d := p.parse(s)
The parse method will return an HtmlDocument object, which can then be inspected
as needed.
In contrast to the XML parser, the parse method will never fail. In
other words, even if something that isn't remotely like HTML is given as
input, it will still try to make sense of it. This is in recognition
of the fact that much HTML out on the Web is malformed! A fussy parser
would be of little use.
The HtmlDocument and HtmlElement classes
As noted above, the parser returns an HtmlDocument instance. This
works in a very similar way to the XmlDocument class used for XML parsing;
in fact the two classes share a common base class. The most important
method is get_root_element(), which will return the HtmlElement which is
the root of the element structure :-
e := d.get_root_element()
HtmlElement is also related to XmlElement by way of a common base class,
and they have the same methods for inspecting attributes and child elements.
So, for example, say that the element e represented the following
structure :-
<html lang="en">
Some html text
<p>
Some more
</html>
Then
e.get_name() == "html"
e.get_attribute("lang") == "en"
f := e.search_children("p") sets f to another HtmlElement, representing inner. This has one
child, namely the text content between the <p> and the </html>
e.search_children("absent") fails
f.get_string_content() returns " Some more "
f.get_trimmed_string_content() returns "Some more"
The search_tree method of the Element class is very useful if you want to
get at an Element deep within the document. Please see the API documentation
for more details.
Formatted output
Output of document structures is done using a set of formatter classes.
For XML documents, use the XmlFormatter class :-
f := HtmlFormatter()
s := f.format(d)
write(s)
Note how the formatter returns a string from a document, so it is really
the reverse process of a parser.
Testhtml
This is a small program which can be used to see what the parser does with
a particular document. Given an input filename, testhtml will parse
the document and output the formatted equivalent, and a complete display
of the document's structure.
Just run testhtml with no arguments for a list of options.
API documentation
The API documentation for the HTML parser can be found in the xml package
section of the main API
documentation.