Parsing Html

last modified: December 28, 2012

[under construction]

Parsing HTML can be sticky because one often is looking for something specific rather than a general parsing, and because a lot of the HTML in the field is crappy, with bad nesting, missing end tags, etc. I propose that parsing results be put into a table-set or structure similar to this:

table: tags
------------
tagSequence
tagName   // Tag name. End tags include the forward-slash
docRef    // Foreign key to document
enderType // "pair" or "self". "self" means it ended with a forward-slash at the end
pairRef   // Reference to start or end pair (if found)  
parentRef // Reference to next outer tag. For example, a TD should have a TR parent.
content   // Content "between" start and end pairs (or at least up to the next tag)
warnings  // Any problems found by the parser

table: attributes
----------------
attribSequence
tagRef   // Foreign key to "tags" table
attribName
attribValue
valueCount  // 0 if attribute lacks a value, 1 if it has it (legitimate for some HTML)
quoting  // quote used to wrap value, if any

If end-tags, parent tags, etc. cannot be found, then the references are simply left empty. Other parsing tools tends to try to stuff everything into a tree, but if the nesting is messed up or the parser gets off track, then tree-ness will be damaged. The above instead includes any tree-ness if found, but still maintains info about individual tags if big-picture parsing goes wrongs. The above is more limp-friendly (LimpVersusDie) because it starts with a kind of "flat-ness" and adds tag relationship info only if properly found.

--top


Loading...