Class | TagTreeScanner |
In: |
TagTreeScanner.rb
|
Parent: | Object |
The TagTreeScanner class provides a generic framework for creating a nested hierarchy of tags and text (like XML or HTML) by parsing text. An example use (and the reason it was written) is to convert a wiki markup syntax into HTML.
require 'TagTreeScanner' class SimpleMarkup < TagTreeScanner @root_factory.allows_text = false @tag_genres[ :root ] = [ ] @tag_genres[ :root ] << TagFactory.new( :paragraph, # A line that doesn't have whitespace at the start :open_match => /(?=\S)/, :open_requires_bol => true, # Close when you see a double return :close_match => /\n[ \t]*\n/, :allows_text => true, :allowed_genre => :inline ) @tag_genres[ :root ] << TagFactory.new( :preformatted, # Grab all lines that are indented up until a line that isn't :open_match => /((\s+).+?)\n+(?=\S)/m, :open_requires_bol => true, :setup => lambda{ |tag, scanner, tagtree| # Throw the contents I found into the tag # but remove leading whitespace tag << scanner[1].gsub( /^#{scanner[2]}/, '' ) }, :autoclose => :true ) @tag_genres[ :inline ] = [ ] @tag_genres[ :inline ] << TagFactory.new( :bold, # An asterisk followed by a letter or number :open_match => /\*(?=[a-z0-9])/i, # Close when I see an asterisk OR a newline coming up :close_match => /\*|(?=\n)/, :allows_text => true, :allowed_genre => :inline ) @tag_genres[ :inline ] << TagFactory.new( :italic, # An underscore followed by a letter or number :open_match => /_(?=[a-z0-9])/i, # Close when I see an underscore OR a newline coming up :close_match => /_|(?=\n)/, :allows_text => true, :allowed_genre => :inline ) end raw_text = <<ENDINPUT Hello World! You're _soaking in_ my test. This is a *subset* of markup that I allow. Hi paragraph two. Yo! A code sample: def foo puts "Whee!" end _That, as they say, is that._ ENDINPUT markup = SimpleMarkup.new( raw_text ).to_xml puts markup #=> <paragraph>Hello World! You're <italic>soaking in</italic> my test. #=> This is a <bold>subset</bold> of markup that I allow.</paragraph> #=> <paragraph>Hi paragraph two. Yo! A code sample:</paragraph> #=> <preformatted>def foo #=> puts "Whee!" #=> end</preformatted> #=> <paragraph><italic>That, as they say, is that.</italic></paragraph>
Each possible output tag is described by a TagFactory, which specifies some or all of the following:
See the TagFactory class for more information on specifying factories.
As a new tag is opened, the scanner uses the Tag#allowed_genre property of that tag (set by the allowed_genre property on the TagFactory) to determine which tags to be looking for. A genre is specified by adding an array in the @tag_genres hash, whose key is the genre name. For example:
@tag_genres[ :inline ] = [ ]
adds a new genre named ‘inline’, with no tags in it. TagFactory instances should be pushed onto this array in the order that they should be looked for. For example:
@tag_genres[ :inline ] << TagFactory.new( :italic, # see the TagFactory#initialize for options )
Note that the close_match regular expression of the current tag is always checked before looking to open/create any new tags.
As the text is being parsed, there will (probably) be many cases where you have raw text that doesn’t close or open any new tags. Whenever the scanner reaches this state, it runs the @text_match regexp against the text to move the pointer ahead. If the current tag has Tag#allows_text? set to true (through TagFactory#allows_text), then this text is added as contents of the tag. If not, the text is thrown away.
The safest regular expression consumes only one character at a time:
@text_match = /./m
It is vital that your regexp match newlines (the ‘m’) unless every single one of your tags is set to close upon seeing a newline.
Unfortunately, the safest regular expression is also the slowest. If speed is an issue, your regexp should strive to eat as many characters as possible at once...while ensuring that it doesn’t eat characters that would signify the start of a new tag.
For example, setting a regexp like:
@text_match = /\w+|./m
allows the scanner to match a whole word at a time. However, if you have a tag factory set to look for "Hvv2vvO" to indicate a subscripted ‘2’, the entire string would be eaten as text and the subscript tag would never start.
As shown in the example above, consumers of your class initialize it by passing in the string to be parsed, and then calling to_xml or to_html on it.
(This two-step process allows the consumer to run other code after the tag parsing, before final conversion. Examples might include replacing special command tags with other input, or performing database lookups on special wiki-page-link tags and replacing with HTML anchors.)
Scans through string_to_parse and builds a tree of tags based on the regular expressions and rules set by the TagFactory instances present in @tag_genres.
After parsing the tree, call to_xml or to_html to retrieve a string representation.
Returns an HTML representation of the tag tree.
This is the same as the to_xml method except that empty tags use an explicit close tag, e.g. <div></div> versus <div />