This particular HTML browser project has been discontinued. Major development has been in 1996, with the last minor bugfix version from 2002-03-18, which is available for download. Instead there is a new ongoing project: Synx.
A simple HTML Browser realized as a Java Applet
Scanning and ParsingTo evaluate a complex language, the technique of scanners and parsers can be used. Before interpreting a text written in a (formal or informal) language, it is evident that the text is broken into separate tokens. This lexical process is done by the scanner which must know what tokens to recognize. Then the resulting TokenSequence is checked and the tokens are semantically grouped. This grammatical process is done by the parser which must know rules for the grammar of the language. Having all tokens scanned and semantically grouped according to grammatical rules, the interpretation of the statements (or sentences) can start depending upon the specific task.
The TokensTokens consist of
- a grammatical type which they belong to, like TAG
- a symbol that they match, like Assurance
- a token string which they raise when they occur, like BODY
- definite tokens, explicitly naming a String that they match to, like Assurance
- indefinite or regular tokens, matching to all Strings that fulfil a certain Regular Expression, like [A-Z][0-9]*
This implementation scans definite tokens explicitly declared in the lexical token declaration file (like Assurance) and regular tokens specified as a Regular Expression (like [A-Z][0-9]*).
Reading from an input Reader, the scanner first tries to match the longest definite token possible. If no token matches or no alternativeToken() can be found for the current symbolPart, then all Regular Expression specifications are run through a non deterministic Automata matching Regular Expressions if possible.
The Interpreter - BrowserFor this simple HTML Browser, the Interpreter only prints out the set of symbol words parsed and reacts to the enclosing Tags. For every Tag known by this Browser, another graphical style is used before the text is displayed. I know that this is not all a Browser does when displaying but for a single demonstration of a parser's possibilities it's enough, I think.
Of course, this HTML Browser has a regard on line breaks and automatically fits the text to the next line if necessary. The Browser still does not follow links (which actually is the most important facility a Browser should offer, by the way).
The token declaration fileFor the HTML Browser, the lexical token declaration file is:
For this HTML Browser example, the types areTAG|<HTML>|HTML| ETAG|</HTML>|HTML| TAG|<HEAD>|HEAD| ETAG|</HEAD>|HEAD| TAG|<BODY>|BODY| ETAG|</BODY>|BODY| TAG|<H1>|H1| ETAG|</H1>|H1| TAG|<H2>|H2| ETAG|</H2>|H2| TAG|<H3>|H3| ETAG|</H3>|H3| TAG|<B>|B| ETAG|</B>|B| TAG|<I>|I| ETAG|</I>|I| STAG|<BR>|BR| STAG|<P>|P| STAG|</P>|/P| WORD|([a-zA-Z]+)|=| SCHAR|([_.,;:!])|=| CIRCUM|ä|ä| CIRCUM|ö|ö| CIRCUM|ü|ü| CIRCUM|ß|ß| CIRCUM|é|é| SKIP|([ ]+)|=| SKIP| | |
- TAG - represents a beginning tag like <A>.
- ETAG - represents an ending tag like </A>.
- STAG - represents single tags that don't need a matching end tag like <BR>.
- WORD - regular expression that represents any natural language word (with normal chars).
- SCHAR - represents a special character like full stops and exclamation marks.
- CIRCUM - represents the HTML Circumscription chars for all unicode characters like é for é.
- SKIP - a type internal to the scanner that defines that all these tokens are regarded useless and be skipped.
Java HTML Browser
requires Java 2 Platform or Java Collections Framework. Application will display the parsed HTML page in a separate window.
If you want to test the simple HTML browser or scanner & parser:
- download the sources which are part of the examples in the Orbital library 1.0 documentation
- Note that our simple browser project is discontinued in favor of Synx, so the simple browser is no longer contained in Orbital library release 1.1, but only in 1.0.