HTML parsing in Unicon


The HTML parser uses the similar document structures for input and output as the XML parser.  However, it is much simpler than the XML parser. 

The parser

Create the HTML parser using the following code :-
   import xml
   p := HtmlParser()
To parse a document in string form, just invoke the parse() method, as follows :-
   s := "<html>here is some html<p>here is some more</html>"
   d := p.parse(s)
The parse method will return an HtmlDocument object, which can then be inspected as needed.

In contrast to the XML parser, the parse method will never fail.  In other words, even if something that isn't remotely like HTML is given as input, it will still try to make sense of it.  This is in recognition of the fact that much HTML out on the Web is malformed!  A fussy parser would be of little use.

The HtmlDocument and HtmlElement classes

As noted above, the parser returns an HtmlDocument instance.  This works in a very similar way to the XmlDocument class used for XML parsing; in fact the two classes share a common base class.  The most important method is get_root_element(), which will return the HtmlElement which is the root of the element structure :-
    e := d.get_root_element()
HtmlElement is also related to XmlElement by way of a common base class, and they have the same methods for inspecting attributes and child elements.  So,  for example, say that the element e represented the following structure :-
    <html lang="en">
Some html text
Some more


     e.get_name() == "html"
     e.get_attribute("lang") == "en"
     f := e.search_children("p") sets f to another HtmlElement, representing inner.  This has one
child, namely the text content between the <p> and the </html>
     e.search_children("absent") fails
     f.get_string_content() returns "        Some more        "
     f.get_trimmed_string_content() returns "Some more"
The search_tree method of the Element class is very useful if you want to get at an Element deep within the document.  Please see the API documentation for more details.

Formatted output

Output of document structures is done using a set of formatter classes.  For XML documents, use the XmlFormatter class :-
f := HtmlFormatter()
s := f.format(d)
Note how the formatter returns a string from a document, so it is really the reverse process of a parser.


This is a small program which can be used to see what the parser does with a particular document.  Given an input filename, testhtml will parse the document and output the formatted equivalent, and a complete display of the document's structure.

Just run testhtml with no arguments for a list of options.

API documentation

The API documentation for the HTML parser can be found in the xml package section of the main API documentation.