Enumerating Regular Languages

last modified: April 30, 2006

A problem proposed by Jayadev Misra is allegedly a good test for qualities of a language, methodology, individual technique, etc. We can look at it as a good LanguageTestCase. Of course, being heavily algorithmic in nature most OO languages do not fare well with regards to EconomyOfExpression, and to my chagrin it's not even clear how SchemeLanguage can compete with HaskellLanguage and MlLanguage.

The problem is stated as follows: you have a regular language described by a given RegularExpression. Given a natural number N, enumerate the first N strings in that language in a modified lexicographical order: the strings are compared first on their length and two strings of the same length are considered in lexicographical order, with some ordering imposed on the characters of the alphabet (assume ASCII ordering, to keep the problem simpler).

The simplest idea is to generate a (optimized if possible) NondeterministicFiniteAutomaton that recognizes the language, and emulate its functioning for an unknown input source (do it in reverse, the automaton unrolls the imaginary input source) while when reaching final states outputting the string that led to that final state. A nice Haskell solution can be picked up from DougMcIlroy page. It's an outstanding demonstration of the EconomyOfExpression qualities of HaskellLanguage.

Another brute force solution would be to enumerate all strings in lexicographical order (at least limit the alphabet to the characters present in the RE) while testing for matching and counting up to N. Of course that won't give you much elegance points (and don't try this at home for large expression or big N, you may be disappointed).

Sounds like a job involving lots of CoRoutines...


I've translated the version in the paper published at http://www.cs.utexas.edu/users/misra/Notes.dir/RegExp.pdf from HaskellLanguage to RubyLanguage. I've used the lazy lists library for Ruby to handle the lists (http://lazylist.rubyforge.org/). The code base length is just a little bit longer than the original and it's almost copied from there. -- AurelianoCalvo.

Usage is like this:

((("A".atom | "B".atom).repeat ) + "C!".atom).enum.take( 10 ) #enumerates the 10 first elements of the RegularExpression language (a|b)*C!.

The result is:

["C!", "AC!", "BC!", "AAC!", "ABC!", "BAC!", "BBC!", "AAAC!", "AABC!", "ABAC!"]

Below is the implementation:

require 'lazylist'

class LazyList
        def + (other)
                self.merge( other ) do 
                        |a,b| 
                        a.length != b.length ? a.length < b.length : a < b
                end
        end
        def * (other)
                return list if self.empty? or other.empty?
                return list( self.head + other.head ) {
                        (self.tail * LazyList[[other.head]]) + (self * other.tail) },
        end
        def closure
                return LazyList[[ "" ]] if self.empty?
                return closure( self.tail ) if self.head == ""
                return list( "" ) {
                        self * closure
                },
        end
end

class String
        def atom
                EnumRegLang::Atom.new(self)
        end
end

module EnumRegLang

        class RegLang
                def | (other)
                        Option.new( self, other )
                end
                def + (other)
                        Concat.new( self, other )
                end
                def repeat
                        Repeat.new( self )
                end
        end

        VOID = RegLang.new

        def VOID.enum
                list
        end

        class Atom < RegLang
                def initialize( text )
                        @text = text
                end

                def enum
                        LazyList.from_enum [ @text ]
                end
        end

        class Option < RegLang
                def initialize( lang1, lang2 )
                        @lang1 = lang1
                        @lang2 = lang2
                end
                def enum
                        @lang1.enum + @lang2.enum
                end
        end

        class Concat
                def initialize( lang1, lang2 )
                        @lang1 = lang1
                        @lang2 = lang2
                end
                def enum
                        @lang1.enum * @lang2.enum
                end
        end

        class Repeat < RegLang
                def initialize( lang )
                        @lang = lang
                end
                def enum
                        @lang.enum.closure
                end
        end
end
other.head ""

Loading...