8 Regular Expressions (E Next - In)

This example also illustrates the use of the function __str__, which is internally invoked
to convert a user-defined object into a string. When two instances of a Box are added, the
function __add__ is invoked. In this case, the add function is creating a new Box that
holds the sum of the argument values.
Other operators are implemented in a similar fashion. The following table lists a few of
the more common operators, and their functional equivalent:
a+b __add__ (a, b)

A–b __sub__ (a, b)
A*b __mul__ (a, b)
A/b __div__ (a, b)
A%b __mod__ (a, b)
A&b __and__ (a, b)
-a __neg__ (a)
Len(a) __len__ (a)
a.b __getitem__ (a, b)
a.b = c __setitem__ (a, b, c)
Str(a) __str__ (a)
If a __nonzero__ (a)
Int(a) __int__ (a)
A(b, c, d) __call__(a, b, c, d)
Regular Expressions and String Processing
The method find is used to find a fixed string embedded within a larger string. For
example, to locate the location of the text “fis” within the string “existasfisamatic”, you
can execute the following:
>>> s = 'existasfisamatic'
>>> s.find('fis')
7
The length of the search string (in this case, 3) gives you the size of the matched text. But
what if you were searching for a floating point constant? A number has not only an
indefinite number of digits, but it may or may not have a fractional part, and it may or
may not have an exponent. Even if you can locate the start of the string (by, for example,
searching for a digit character), how do you determine the length of the matched text?
The solution is to use a technique termed regular expressions. Regular expression

notations were being used by mathematicians and computer scientists even before
computers were common. The particular notation used by the Python library derives from
conventions originating with the Unix operating system. In Python the regular expression
package is found in the module named re.
Exploring Python – Chapter 11: Advanced Features 4
https://E-next.in
The regular expression notation will at first seem cryptic; but it has the advantage of
being short and, with practice, easy to understand and remember. The most common
regular expression notations are shown in the
Text Matches literal
table at right. Symbols such as ^ and $ are
& Start of string
used to represent the start and end of a string.
$ End of string
Parenthesis can be used for grouping, and the
(…)* Zero or more occurrences
* and + signs are used to represent the idea of
one-or-more, or zero-or-more, respectively. (…)+ One or more occurrences
Square brackets denote character classes; a (…)? Optional (zero or one)
single character from a given set of values. [chars] One character from range
Dashes help simplify the description of a [^chars] One character not from range
range of characters, for example a-f Pat | pat Alternative (one or the other)
represents the set abcdef, and A-Z can be (…) Group
used to match any capital letter. . Any char except newline
Let us see how to define a regular expression for floating point literals. The basic unit is
the digit. This is a single character from a range of possibilities (sometimes termed a
character class). The square brackets are used surround the list of possible characters. So
this could be written as [0123456789]. However, when the characters are part of a
sequential group of ASCII characters the regular expression library allows you to simply
list the starting and ending points, as in [0-9]. (Other common sequences of characters are
the lower case letter, a-z, and the upper case letters, A-Z). Since our floating point
number begins with one or more digits, we need to surround the pattern with a
parenthesis and a plus sign, as in ([0-9])+.
Next, we need a pattern to match a decimal point followed by zero or more digits. Literal
characters generally represent themselves, but the period has a special meaning, and so it
must be escaped. So this is \.([0-9])*. If we want to make it optional, we surround it with
a question mark, as in (\.([0-9])*)?.
Finally, we have the optional exponent part, which is followed by an optional sign, and a
number consisting of one or more digits. This is ([eE]([+-])?([0-9])+)?. Putting
everything together gives us our final pattern:
([0-9])+(\.[0-9]*)?([eE]([+-])?([0-9])+)?
Having defined the regular expression pattern, we must then compile it into a regular
expression object. The regular expression object is an internal form used for pattern
matching. This the following illustrates this process:
>>> import re
>>> pat = re.compile(“([0-9])+(\.[0-9]*)?([eE]([+-])?([0-9])+)?”)
Make sure you qualify the name compile with the prefix re. There is another function
named compile in the standard library, which does a totally different task. The pattern
then supports a number of different search operations. The simplest of these is named
https://E-next.in
search. This operation takes as argument a text string, and returns a match object. Again,
make sure you qualify the name. A match object support various different operations.
One is to test whether or not the match was successful:
>>> text = “the value 2.45e-35 is close to correct”

>>> mtcobj = pat.search(text)
>>> if mtcobj: print “found it’
found it
However, the match object can also tell you the start and ending positions of the matched
text:
>>> print mtcobj.start(), mtcobj.end()

10 18
>>> text[mtcobj.start():mtcobj.end()]
2.45e-35
A table in Appendix A lists the most common operations found in the regular expression
module.
Iterators and Generators
The for loop has the general form:
for ele in collection:
In earlier chapters we have seen how various different types of collection can be used
with a for loop. If the collection is a string, the loop iterates over the characters in the
string. If it is a list, the elements in the list are generated. If the collection is a dictionary,
the elements refer to the keys for the dictionary. Finally, if the collection is a file, the
elements produced are the individual lines from the file.
It is also possible to create a user defined object that interacts with the for statement. This
happens in two steps. In the first step, the for statement passes the message __iter__ to
the collection. This function should return a value that understands the iterator protocol.
The iterator protocol consists of a single method, named next, that is expected to produce
values in turn until the collection is exhausted. Once exhausted, the function must raise a
StopIteration exception.
The following two classes illustrate this behavior. The first class maintains a collection of
values stored in a list. When an iterator is requested, it creates an instance of another class
named SquareIterator. The SquareIterator class cycles through the values in the list,
returning the square of every element.
class SquareCollection (object):

def __init__ (self):
self.values = [ ]
def add (self, v):
self.values.append(v)
https://E-next.in

8 Regular Expressions (E Next - In)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

8 Regular Expressions (E Next - In)

Uploaded by

Copyright:

Available Formats

This example also illustrates the use of the function str, which is internally invoked

a+b add (a, b)

Regular Expressions and String Processing

The solution is to use a technique termed regular expressions. Regular expression

Exploring Python – Chapter 11: Advanced Features 4

Exploring Python – Chapter 11: Advanced Features 5

>>> text = “the value 2.45e-35 is close to correct”

>>> print mtcobj.start(), mtcobj.end()

Iterators and Generators

The for loop has the general form:

for ele in collection:

class SquareCollection (object):

Exploring Python – Chapter 11: Advanced Features 6

You might also like