Class TagTreeScanner
In: TagTreeScanner.rb
Parent: Object


The TagTreeScanner class provides a generic framework for creating a nested hierarchy of tags and text (like XML or HTML) by parsing text. An example use (and the reason it was written) is to convert a wiki markup syntax into HTML.

Example Usage

  require 'TagTreeScanner'

  class SimpleMarkup < TagTreeScanner
     @root_factory.allows_text = false

     @tag_genres[ :root ] = [ ]

     @tag_genres[ :root ] << :paragraph,
        # A line that doesn't have whitespace at the start
        :open_match => /(?=\S)/, :open_requires_bol => true,

        # Close when you see a double return
        :close_match => /\n[ \t]*\n/,
        :allows_text => true,
        :allowed_genre => :inline

     @tag_genres[ :root ] << :preformatted,
        # Grab all lines that are indented up until a line that isn't
        :open_match => /((\s+).+?)\n+(?=\S)/m, :open_requires_bol => true,
        :setup => lambda{ |tag, scanner, tagtree|
           # Throw the contents I found into the tag
           # but remove leading whitespace
           tag << scanner[1].gsub( /^#{scanner[2]}/, '' )
        :autoclose => :true

     @tag_genres[ :inline ] = [ ]

     @tag_genres[ :inline ] << :bold,
        # An asterisk followed by a letter or number
        :open_match => /\*(?=[a-z0-9])/i,

        # Close when I see an asterisk OR a newline coming up
        :close_match => /\*|(?=\n)/,
        :allows_text => true,
        :allowed_genre => :inline

     @tag_genres[ :inline ] << :italic,
        # An underscore followed by a letter or number
        :open_match => /_(?=[a-z0-9])/i,

        # Close when I see an underscore OR a newline coming up
        :close_match => /_|(?=\n)/,
        :allows_text => true,
        :allowed_genre => :inline

  raw_text = <<ENDINPUT
  Hello World! You're _soaking in_ my test.
  This is a *subset* of markup that I allow.

  Hi paragraph two. Yo! A code sample:

    def foo
      puts "Whee!"

  _That, as they say, is that._


  markup = raw_text ).to_xml
  puts markup

  #=> <paragraph>Hello World! You're <italic>soaking in</italic> my test.
  #=> This is a <bold>subset</bold> of markup that I allow.</paragraph>
  #=> <paragraph>Hi paragraph two. Yo! A code sample:</paragraph>
  #=> <preformatted>def foo
  #=>   puts "Whee!"
  #=> end</preformatted>
  #=> <paragraph><italic>That, as they say, is that.</italic></paragraph>


TagFactories at 10,000 feet

Each possible output tag is described by a TagFactory, which specifies some or all of the following:

  • The name of the tags it creates (required)
  • The regular expression to look for to start the tag
  • The regular expression to look for to close the tag, or
  • Whether the tag is automatically closed after creation
  • What genre of tags are allowed within the tag
  • Whether the tag supports raw text inside it
  • Code to run when creating a tag

See the TagFactory class for more information on specifying factories.

Genres as a State Machine

As a new tag is opened, the scanner uses the Tag#allowed_genre property of that tag (set by the allowed_genre property on the TagFactory) to determine which tags to be looking for. A genre is specified by adding an array in the @tag_genres hash, whose key is the genre name. For example:

  @tag_genres[ :inline ] = [ ]

adds a new genre named ‘inline’, with no tags in it. TagFactory instances should be pushed onto this array in the order that they should be looked for. For example:

  @tag_genres[ :inline ] << :italic,
    # see the TagFactory#initialize for options

Note that the close_match regular expression of the current tag is always checked before looking to open/create any new tags.

Consuming Text

As the text is being parsed, there will (probably) be many cases where you have raw text that doesn’t close or open any new tags. Whenever the scanner reaches this state, it runs the @text_match regexp against the text to move the pointer ahead. If the current tag has Tag#allows_text? set to true (through TagFactory#allows_text), then this text is added as contents of the tag. If not, the text is thrown away.

The safest regular expression consumes only one character at a time:

  @text_match = /./m

It is vital that your regexp match newlines (the ‘m’) unless every single one of your tags is set to close upon seeing a newline.

Unfortunately, the safest regular expression is also the slowest. If speed is an issue, your regexp should strive to eat as many characters as possible at once...while ensuring that it doesn’t eat characters that would signify the start of a new tag.

For example, setting a regexp like:

  @text_match = /\w+|./m

allows the scanner to match a whole word at a time. However, if you have a tag factory set to look for "Hvv2vvO" to indicate a subscripted ‘2’, the entire string would be eaten as text and the subscript tag would never start.

Using the Scanner

As shown in the example above, consumers of your class initialize it by passing in the string to be parsed, and then calling to_xml or to_html on it.

(This two-step process allows the consumer to run other code after the tag parsing, before final conversion. Examples might include replacing special command tags with other input, or performing database lookups on special wiki-page-link tags and replacing with HTML anchors.)


new   tags   tags_by_name   to_html   to_xml  

Classes and Modules

Class TagTreeScanner::Tag
Class TagTreeScanner::TagFactory
Class TagTreeScanner::TextNode

Public Class methods

Scans through string_to_parse and builds a tree of tags based on the regular expressions and rules set by the TagFactory instances present in @tag_genres.

After parsing the tree, call to_xml or to_html to retrieve a string representation.

Public Instance methods

Returns an array of all root-level tags found

Returns an array of all tags in the tree whose Tag#tag_name matches the supplied tag_name.

Returns an HTML representation of the tag tree.

This is the same as the to_xml method except that empty tags use an explicit close tag, e.g. <div></div> versus <div />

Returns an XML representation of the tag tree.

This method is the same as the to_html method except that empty tags do not use an explicit close tag, e.g. <div /> versus <div></div>