Parser-Example: HTML-Browser

Introduction

This particular HTML browser project has been discontinued. Major development has been in 1996, with the last minor bugfix version from 2002-03-18, which is available for download. Instead there is a new ongoing project: Synx.

A simple HTML Browser realized as a Java Applet

Scanning and Parsing

To evaluate a complex language, the technique of scanners and parsers can be used. Before interpreting a text written in a (formal or informal) language, it is evident that the text is broken into separate tokens. This lexical process is done by the scanner which must know what tokens to recognize. Then the resulting TokenSequence is checked and the tokens are semantically grouped. This grammatical process is done by the parser which must know rules for the grammar of the language. Having all tokens scanned and semantically grouped according to grammatical rules, the interpretation of the statements (or sentences) can start depending upon the specific task.

The Tokens

Tokens consist of Tokens, if declared literally in a file are specified as: Two kinds of tokens exist:

The Scanner

The scanner is a base class that consecutively scans in tokens from an input Reader or scanner. The token Declarations must be provided, either directly as an array or indirectly with the token declaration syntax in a (File-)Reader.
This implementation scans definite tokens explicitly declared in the lexical token declaration file (like Assurance) and regular tokens specified as a Regular Expression (like [A-Z][0-9]*).
Reading from an input Reader, the scanner first tries to match the longest definite token possible. If no token matches or no alternativeToken() can be found for the current symbolPart, then all Regular Expression specifications are run through a non deterministic Automata matching Regular Expressions if possible.

The Parser

A parser is capable of parsing a TokenSequence and returning Symbols in that the tokens result. For HTML the parser is rather simple in a way that it simply concatenates the tokens of type WORD and CIRCUM. Additionally he recursively starts another HTML parser when a token of type TAG is found and finishes of, if the matching ETAG (if necessary) is found.

The Interpreter - Browser

For this simple HTML Browser, the Interpreter only prints out the set of symbol words parsed and reacts to the enclosing Tags. For every Tag known by this Browser, another graphical style is used before the text is displayed. I know that this is not all a Browser does when displaying but for a single demonstration of a parser's possibilities it's enough, I think.

Of course, this HTML Browser has a regard on line breaks and automatically fits the text to the next line if necessary. The Browser still does not follow links (which actually is the most important facility a Browser should offer, by the way).

The token declaration file

For the HTML Browser, the lexical token declaration file is:
TAG|<HTML>|HTML|
ETAG|</HTML>|HTML|
TAG|<HEAD>|HEAD|
ETAG|</HEAD>|HEAD|
TAG|<BODY>|BODY|
ETAG|</BODY>|BODY|
TAG|<H1>|H1|
ETAG|</H1>|H1|
TAG|<H2>|H2|
ETAG|</H2>|H2|
TAG|<H3>|H3|
ETAG|</H3>|H3|
TAG|<B>|B|
ETAG|</B>|B|
TAG|<I>|I|
ETAG|</I>|I|
STAG|<BR>|BR|
STAG|<P>|P|
STAG|</P>|/P|
WORD|([a-zA-Z]+)|=|
SCHAR|([_.,;:!])|=|
CIRCUM|&auml;|ä|
CIRCUM|&ouml;|ö|
CIRCUM|&uuml;|ü|
CIRCUM|&szlig;|ß|
CIRCUM|&eacute;|é|
SKIP|([ ]+)|=|
SKIP|
| |
For this HTML Browser example, the types are

Java HTML Browser


requires Java 2 Platform or Java Collections Framework. Application will display the parsed HTML page in a separate window.

Download

If you want to test the simple HTML browser or scanner & parser:

The Future

See introduction.