Building sets from set() and ANY

When defining a regular expression there is a well known duality between a set of characters specified by enclosing characters or character ranges within square brackets […] and defining the pattern of the complement set [^…] by preceding the caret symbol ^ .

[0-9] — matches all digits

[^0-9] — matches all non-digits

The nature of the complement set mentioned here is less precisely defined and is implementation dependent in practice. If a regexp matcher knows about ASCII only then [^0-9] is the set of all ASCII characters which are not digits.

Suppose now we want to build a character matching set from sets in the sense of a set data-type as inhabited in Python as a builtin. This is easy because we start from the empty set() and add just more elements. Each finite set can be constructed in this way and we can establish a set ~ pattern correspondence:

set(range(9)) ~ [0-9]

But how do we find a corresponding set for [^0-9]?

This cannot be finitely constructed from the bottom up using set() as a start value but it has to be finitely de-constructed from the set of all elements. This looks like we are running into unlimited comprehension and the Russel paradox but actually the set theory we are using is quite weak and element membership can be decided just by doing hashtable lookups.

Instead of constructing a set from set() we want to (de-)construct it from ANY. The name ANY is not chosen by chance because we want to express the correspondence with . pattern:

ANY ~ .

( Actually ANY is a little stronger because . doesn’t match newline characters unless explicitly specified by the DOTALL flag in the re.compile function. )

Using ANY we can easily state the above correspondence:

ANY – set(range(9)) ~ [^0-9]

Implementation of ANY

We define ANY as an instance of univset. The univset has a set-like interface and implements set operations consistently on sets of the type ANY-A, ANY-B,… where A, B… are finite sets constructed from set(). For example

ANY-A ∩ ANY-B = ANY – ( A ∪ B )

ANY-A ∪ B = ANY – ( A – B )

class univset ( object ) : def __init__ ( self ) : self ._diff = set ( ) def __sub__ ( self , other ) : S = univset ( ) if type ( other ) == set : S._diff = self ._diff | other return S else : S._diff = self ._diff | other._diff return S def __rsub__ ( self , other ) : return other & ; self ._diff def __contains__ ( self , obj ) : return not obj in self ._diff def __and__ ( self , other ) : return other - self ._diff def __rand__ ( self , other ) : return other - self ._diff def __repr__ ( self ) : if self ._diff == set ( ) : return "ANY" else : return "ANY - %s" % self ._diff def __or__ ( self , other ) : S = univset ( ) S._diff = self ._diff - other return S def __xor__ ( self , other ) : return ( self - other ) | ( other - self ) def add ( self , elem ) : if elem in self ._diff: self ._diff. remove ( elem ) def update ( self , elem ) : self ._diff = self ._diff - other def __ror__ ( self , other ) : return self . __or__ ( other ) def union ( self , other ) : return self . __or__ ( other ) def difference ( self , other ) : return self . __sub__ ( other ) def intersection ( self , other ) : return self . __and__ ( other ) def symmetric_difference ( self , other ) : return self . __xor__ ( other ) def issubset ( self , other ) : if type ( other ) == set : return False if issubset ( other._diff , self ._diff ) : return True return False def issuperset ( self , other ) : if self ._diff & ; other: return False return True def __lt__ ( self , other ) : return self . issubset ( other ) def __eq__ ( self , other ) : if type ( other ) == set : return False try : return self ._diff == other._diff except AttributeError : return False def __ne__ ( self , other ) : return not self . __eq__ ( other ) def __le__ ( self , other ) : return self . __lt__ ( other ) or self . __eq__ ( other ) def __gt__ ( self , other ) : return self . issuperset ( other ) def __gt__ ( self , other ) : return self . issuperset ( other ) or self == other

Examples

ANY = univset ( ) NON_DIGITS = ANY - set ( range ( 9 ) ) assert 8 not in NON_DIGITS assert 'a' in NON_DIGITS assert 0 in NON_DIGITS | set ( [ 0 ] )

Tags: Add new tag