XMLScanner

Tokenizer

The Tokenizer is a helper class aimed at support of syntax highlighting / colorizing of source codes loaded into DOM elements for rendering: <pre>,<code>,<textarea>,<plaintext>.

Intended to be used together with Selection.applyMark() function to mark text runs.

Constants

N/A

Properties

tokenStart
bookmark, start of token.
tokenEnd
bookmark, position where token ends.
tag
– string, markup tokenizer only – tag name, valid at #TAG-START and #TAG-END tokens.
attr
– string, markup tokenizer only – attribute name, valid at TAG-ATTR token.
value
– string, token text content or attribute value at TAG-ATTR token.
type
– symbol, either #source or #markup – type of current tokenizer model.
element
– DOM element where parsed token was found.

Methods

this
( element: Element, tokenizerType: symbol [, subType: symbol] ) : Tokenizer

Constructs Tokenizer instance, parameters:

  • element – DOM element that renders the source code text.
  • tokenizerType – defines type of tokenizer used, either #markup or #source.
  • subType – symbol, at the moment it can be #style  or anything else.  When #style is provided then tokenizers changes behavaior of #NAME token parsing – allows dash ( - ) to be a part of name token according to CSS syntax.
push
( tokenizerType: symbol, until: string [, subType: symbol] ) : this

Pushes sub-tokenizer for the "island" with different syntax. As a rule it is used with base #markup tokenizer to parse content of <style> and <script> elements.

  • tokenizerType – defines type of tokenizer used, either #markup or #source.
  • until – string, defines end of "island", examples: "</script>", "</style>" . When until is encountered the tokenizer returns special #END-OF-ISLAND token.
  • subType – see above.
pop
( ) : this

Pops last pushed tokenizer from internal stack. As a rule it is used in response of getting #END-OF-ISLAND token.

token
( ) : symbol

parses input and returns type of token:

  • Tokens of #markup tokenizer:
    • #TAG-START – start of tag, e.g. when <div is parsed;
    • #TAG-HEAD-END – end of tag "head", e.g. when <div id="some">  is parsed;
    • #TAG-EMPTY-END – empty tag ended,  when <div id="some" />  is parsed;
    • #TAG-END – end of tag, e.g. when </div> is parsed;
    • #TAG-ATTR – when attribute name/value parsed in tag head;
    • #TEXT – when text is parsed, tokenizer.value is the text;
    • #COMMENT – when comment is parsed, tokenizer.value is the comment;
    • #CDATA – when CDATA SGML sections is parsed;
    • #PI – when SGML processing instruction is parsed;
    • #DOCTYPE – when doctype declaration is parsed;
    • #ENTITY – after parse of HTML/XML entity, tokenizer.value is the entity text;
    • #ERROR – on basic HTML/XML syntax errors.
  • #source specific tokens:
    • #NUMBER – number literal is parsed: 123, 456.09, 0xFF, etc.
    • #NUMBER-UNIT – number with unit desgnator is parsed: 100px, 400ms, etc.
    • #STRING – string literal, only "string" or ‘string’ at the moment.
    • #NAME – name token is parsed – sequence of characters that looks like keyword or variable name.
    • #COMMENT – comment is parsed, either single line //  comment or multi-line /* ... */ .
    • #OPERATOR – sequence of "opeartor" characters: :!%+-/*; ,etc.
    • #O-PAREN and #C-PAREN( or )  are encountered.
    • #ERROR – basic syntax errors, for example not closed string literal.
    • #END-OF-ISLAND – when pushed tokeinizer gets sequence of characters defined by until parameter.