wiiki - ParsingHtmlWithRegex

AntiPattern Name: Parsing HTML with regular expressions

Type: Development

Problem: Extracting information from HTML text, such as elements or groups of elements with a specific property.

Context: Programmer understands how to use regular expressions and sees that HTML looks fairly regular and easy to parse using this tool.

Causes:

Using a formated page as a data source instead of seeking a proper data source.
Not having taken a formal class in parser construction.
Not reading the formal HTML specification.

Supposed Solution: Write a half-baked regular expression that works for specific test cases.

Resulting Context: Regular expression may work for most cases, but fails on oddly structured but valid HTML or broken HTML, both of which are all-too-common on the Wild Wild Web.

Common Resulting problems:

The parser breaks if the HTML code is changed at all.
HTML code remaining in the parsed result.
Inability to cope with tricky nestling structures.
Failure to correctly decode entities.

RefactoredSolution: Find a data source that was designed to be used as a data source. Alternatively, use an existing parser of the correct complexity.

See also: ParsingHtml, http://stackoverflow.com/a/1732454/699224 for another take on this anti-pattern

CategoryDevelopmentAntiPattern