You are on page 1of 16

Sets and Frozensets

Introduction

The philosopher Wittgenstein didn't like the set theory and complained mathematics
is "ridden through and through with the pernicious idioms of set theory," He dismissed
the set theory as "utter nonsense", as being "laughable" and "wrong". His criticism
appeared years after the death of the German mathematician Georg Cantor, the founder of
the set theory.
Cantor defined a set at the beginning of his "Beiträge zur Begründung der transfiniten
Mengenlehre":
"A set is a gathering together into a whole of definite, distinct objects of our perception
and of our thought - which are called elements of the set." Nowadays, we can say in
"plain" English: A set is a well defined collection of objects.
The elements or members of a set can be anything: numbers, characters, words, names,
letters of the alphabet, even other sets, and so on. Sets are usually denoted with capital
letters. This is not the exact mathematical definition, but it is good enough for the
following.

Sets in Python
The data type "set", which is a collection type, has been part of Python since
version 2.4. A set contains an unordered collection of unique and immutable objects. The
set data type is, as the name implies, a Python implementation of the sets as they are
known from mathematics. This explains, why sets unlike lists or tuples can't have
multiple occurrences of the same element.

Creating Sets
If we want to create a set, we can call the built-in set function with a sequence or
another iterable object.
In the following example, a string is singularized into its characters to build the resulting
set x:

>>> x = set("A Python Tutorial")


>>> x
set(['A', ' ', 'i', 'h', 'l', 'o', 'n', 'P', 'r', 'u', 't', 'a', 'y', 'T'])
>>> type(x)
<type 'set'>
>>>

We can pass a list to the built-in set function, as we can see in the following:

>>> x = set(["Perl", "Python", "Java"])


>>> x
set(['Python', 'Java', 'Perl'])
>>>
We want to show now, what happens, if we pass a tuple with reappearing elements to the
set function - in our example the city "Paris":

>>> cities = set(("Paris", "Lyon", "London","Berlin","Paris","Birmingham"))


>>> cities
set(['Paris', 'Birmingham', 'Lyon', 'London', 'Berlin'])
>>>

As we have expected, no doublets are in the resulting set of cities.

Immutable Sets
Sets are implemented in a way, which doesn't allow mutable objects. The
following example demonstrates that we cannot include, for example, lists as elements:

>>> cities = set((("Python","Perl"), ("Paris", "Berlin", "London")))


>>> cities = set((["Python","Perl"], ["Paris", "Berlin", "London"]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>>

Frozensets
Though sets can't contain mutable objects, sets are mutable:

>>> cities = set(["Frankfurt", "Basel","Freiburg"])


>>> cities.add("Strasbourg")
>>> cities
set(['Freiburg', 'Basel', 'Frankfurt', 'Strasbourg'])
>>>

Frozensets are like sets except that they cannot be changed, i.e. they are immutable:

>>> cities = frozenset(["Frankfurt", "Basel","Freiburg"])


>>> cities.add("Strasbourg")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'frozenset' object has no attribute 'add'
>>>

Simplified Notation
We can define sets (since Python2.6) without using the built-in set function. We
can use curly braces instead:

>>> adjectives = {"cheap","expensive","inexpensive","economical"}


>>> adjectives
set(['inexpensive', 'cheap', 'expensive', 'economical'])
>>>

Set Operations
add(element)
A method which adds an element, which has to be immutable, to a set.

>>> colours = {"red","green"}


>>> colours.add("yellow")
>>> colours
set(['green', 'yellow', 'red'])
>>> colours.add(["black","white"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>>
Of course, an element will only be added, if it is not already contained in the set. If it is
already contained, the method call has no effect.

clear()
All elements will removed from a set.
>>> cities = {"Stuttgart", "Konstanz", "Freiburg"}
>>> cities.clear()
>>> cities
set([])
>>>
Copy()
Creates a shallow copy, which is returned.
>>> more_cities = {"Winterthur","Schaffhausen","St. Gallen"}
>>> cities_backup = more_cities.copy()
>>> more_cities.clear()
>>> cities_backup
set(['St. Gallen', 'Winterthur', 'Schaffhausen'])
>>>

Just in case, you might think, an assignment might be enough:


>>> more_cities = {"Winterthur","Schaffhausen","St. Gallen"}
>>> cities_backup = more_cities
>>> more_cities.clear()
>>> cities_backup
set([])
>>>
The assignment "cities_backup = more_cities" just creates a pointer, i.e. another name, to
the same data structure.
difference()
This method returns the difference of two or more sets as a new set.
>>> x = {"a","b","c","d","e"}
>>> y = {"b","c"}
>>> z = {"c","d"}
>>> x.difference(y)
set(['a', 'e', 'd'])
>>> x.difference(y).difference(z)
set(['a', 'e'])
>>>
Instead of using the method difference, we can use the operator "-":
>>> x - y
set(['a', 'e', 'd'])
>>> x - y - z
set(['a', 'e'])
>>>

difference_update()
The method difference_update removes all elements of another set from this set.
x.difference_update(y) is the same as "x = x - y"
>>> x = {"a","b","c","d","e"}
>>> y = {"b","c"}
>>> x.difference_update(y)
>>>
>>> x = {"a","b","c","d","e"}
>>> y = {"b","c"}
>>> x = x - y
>>> x
set(['a', 'e', 'd'])
>>>

discard(el)
An element el will be removed from the set, if it is contained in the set. If el is not
a member of the set, nothing will be done.
>>> x = {"a","b","c","d","e"}
>>> x.discard("a")
>>> x
set(['c', 'b', 'e', 'd'])
>>> x.discard("z")
>>> x
set(['c', 'b', 'e', 'd'])
>>>
remove(el)
works like discard(), but if el is not a member of the set, a KeyError will be
raised.
>>> x = {"a","b","c","d","e"}
>>> x.remove("a")
>>> x
set(['c', 'b', 'e', 'd'])
>>> x.remove("z")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'z'
>>>

intersection(s)
Returns the intersection of the instance set and the set s as a new set. In other
words: A set with all the elements which are contained in both sets is returned.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x.intersection(y)
set(['c', 'e', 'd'])
>>>
This can be abbreviated with the ampersand operator "&":
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x.intersection(y)
set(['c', 'e', 'd'])
>>>
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x & y
set(['c', 'e', 'd'])
>>>

isdisjoint()
This method returns True if two sets have a null intersection.
issubset()
x.issubset(y) returns True, if x is a subset of y. "<=" is an abbreviation for "Subset
of" and ">=" for "superset of"
"<" is used to check if a set is a proper subset of a set.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d"}
>>> x.issubset(y)
False
>>> y.issubset(x)
True
>>> x < y
False
>>> y < x # y is a proper subset of x
True
>>> x < x # a set can never be a proper subset of oneself.
False
>>> x <= x
True
>>>

issuperset()
x.issuperset(y) returns True, if x is a superset of y. ">=" is an abbreviation for
"issuperset of"
">" is used to check if a set is a proper superset of a set.

>>> x = {"a","b","c","d","e"}
>>> y = {"c","d"}
>>> x.issuperset(y)
True
>>> x > y
True
>>> x >= y
True
>>> x >= x
True
>>> x > x
False
>>> x.issuperset(x)
True
>>>

pop()
pop() removes and returns an arbitrary set element. The method raises a KeyError
if the set is empty
>>> x = {"a","b","c","d","e"}
>>> x.pop()
'a'
>>> x.pop()
'c'
Python 3
Sets and Frozensets

Introduction
In this chapter of our tutorial, we are dealing with Python's implementation of sets.
Though sets are nowadays an integral part of modern mathematics, this has not always
been like this. The set theory had been rejected by many, even great thinkers. One of
them was the philosopher Wittgenstein. He didn't like the set theory and complained
mathematics is "ridden through and through with the pernicious idioms of set theory...".
He dismissed the set theory as "utter nonsense", as being "laughable" and "wrong". His
criticism appeared years after the death of the German mathematician Georg Cantor, the
founder of the set theory. David Hilbert defended it from its critics by famously
declaring: "No one shall expel us from the Paradise that Cantor has created.

Cantor defined a set at the beginning of his "Beiträge zur Begründung der transfiniten
Mengenlehre":
"A set is a gathering together into a whole of definite, distinct objects of our perception
and of our thought - which are called elements of the set." Nowadays, we can say in
"plain" English: A set is a well defined collection of objects.
The elements or members of a set can be anything: numbers, characters, words, names,
letters of the alphabet, even other sets, and so on. Sets are usually denoted with capital
letters. This is not the exact mathematical definition, but it is good enough for the
following.
The data type "set", which is a collection type, has been part of Python since version 2.4.
A set contains an unordered collection of unique and immutable objects. The set data type
is, as the name implies, a Python implementation of the sets as they are known from
mathematics. This explains, why sets unlike lists or tuples can't have multiple
occurrences of the same element.

Sets
If we want to create a set, we can call the built-in set function with a sequence or another
iterable object:
In the following example, a string is singularized into its characters to build the resulting
set x:
>>> x = set("A Python Tutorial")
>>> x
{'A', ' ', 'i', 'h', 'l', 'o', 'n', 'P', 'r', 'u', 't', 'a', 'y', 'T'}
>>> type(x)
<class 'set'>
>>>
We can pass a list to the built-in set function, as we can see in the following:
>>> x = set(["Perl", "Python", "Java"])
>>> x
{'Python', 'Java', 'Perl'}
>>>
Now, we want to show what happens, if we pass a tuple with reappearing elements to the
set function - in our example the city "Paris":

>>> cities = set(("Paris", "Lyon", "London","Berlin","Paris","Birmingham"))


>>> cities
{'Paris', 'Birmingham', 'Lyon', 'London', 'Berlin'}
>>>

As we have expected, no doublets are in the resulting set of cities.


Immutable Sets
Sets are implemented in a way, which doesn't allow mutable objects. The following
example demonstrates that we cannot include, for example, lists as elements:
>>> cities = set((("Python","Perl"), ("Paris", "Berlin", "London")))
>>> cities = set((["Python","Perl"], ["Paris", "Berlin", "London"]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>>

Frozensets
Though sets can't contain mutable objects, sets are mutable:

>>> cities = set(["Frankfurt", "Basel","Freiburg"])


>>> cities.add("Strasbourg")
>>> cities
{'Freiburg', 'Basel', 'Frankfurt', 'Strasbourg'}
>>>

Frozensets are like sets except that they cannot be changed, i.e. they are immutable:
>>> cities = frozenset(["Frankfurt", "Basel","Freiburg"])
>>> cities.add("Strasbourg")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'frozenset' object has no attribute 'add'
>>>

Improved notation
We can define sets (since Python2.6) without using the built-in set function. We can use
curly braces instead:

>>> adjectives = {"cheap","expensive","inexpensive","economical"}


>>> adjectives
{'inexpensive', 'cheap', 'expensive', 'economical'}
>>>
Set Operations
add(element)
A method which adds an element, which has to be immutable, to a set.
>>> colours = {"red","green"}
>>> colours.add("yellow")
>>> colours
{'green', 'yellow', 'red'}
>>> colours.add(["black","white"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>>
Of course, an element will only be added, if it is not already contained in the set. If it is
already contained, the method call has no effect.
clear()
All elements will be removed from a set.
>>> cities = {"Stuttgart", "Konstanz", "Freiburg"}
>>> cities.clear()
>>> cities
set()
>>>
copy
Creates a shallow copy, which is returned.
>>> more_cities = {"Winterthur","Schaffhausen","St. Gallen"}
>>> cities_backup = more_cities.copy()
>>> more_cities.clear()
>>> cities_backup
{'St. Gallen', 'Winterthur', 'Schaffhausen'}
>>>
Just in case, you might think, an assignment might be enough:
>>> more_cities = {"Winterthur","Schaffhausen","St. Gallen"}
>>> cities_backup = more_cities
>>> more_cities.clear()
>>> cities_backup
set()
>>>
The assignment "cities_backup = more_cities" just creates a pointer, i.e. another name, to
the same data structure.
difference()
This method returns the difference of two or more sets as a new set.
>>> x = {"a","b","c","d","e"}
>>> y = {"b","c"}
>>> z = {"c","d"}
>>> x.difference(y)
{'a', 'e', 'd'}
>>> x.difference(y).difference(z)
{'a', 'e'}
>>>
Instead of using the method difference, we can use the operator "-":
>>> x - y
{'a', 'e', 'd'}
>>> x - y - z
{'a', 'e'}
>>>

difference_update()
The method difference_update removes all elements of another set from this set.
x.difference_update(y) is the same as "x = x - y"
>>> x = {"a","b","c","d","e"}
>>> y = {"b","c"}
>>> x.difference_update(y)
>>>
>>> x = {"a","b","c","d","e"}
>>> y = {"b","c"}
>>> x = x - y
>>> x
{'a', 'e', 'd'}
>>>

discard(el)
An element el will be removed from the set, if it is contained in the set. If el is not a
member of the set, nothing will be done.
>>> x = {"a","b","c","d","e"}
>>> x.discard("a")
>>> x
{'c', 'b', 'e', 'd'}
>>> x.discard("z")
>>> x
{'c', 'b', 'e', 'd'}
>>>

remove(el)
works like discard(), but if el is not a member of the set, a KeyError will be raised.
>>> x = {"a","b","c","d","e"}
>>> x.remove("a")
>>> x
{'c', 'b', 'e', 'd'}
>>> x.remove("z")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'z'
>>>
union(s)
This method returns the union of two sets as a new set, i.e. all elements that are in either
set.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x.union(y)
{'d', 'a', 'g', 'c', 'f', 'b', 'e'}
This can be abbreviated with the pipe operator "|":
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x | y
{'d', 'a', 'g', 'c', 'f', 'b', 'e'}
>>>
intersection(s)
Returns the intersection of the instance set and the set s as a new set. In other words: A set
with all the elements which are contained in both sets is returned.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x.intersection(y)
{'c', 'e', 'd'}
>>>
This can be abbreviated with the ampersand operator "&":
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x & y
{'c', 'e', 'd'}
>>>
isdisjoint()
This method returns True if two sets have a null intersection.
>>> x = {"a","b","c"}
>>> y = {"c","d","e"}
>>> x.isdisjoint(y)
False
>>>
>>> x = {"a","b","c"}
>>> y = {"d","e","f"}
>>> x.isdisjoint(y)
True
>>>
issubset()
x.issubset(y) returns True, if x is a subset of y. "<=" is an abbreviation for "Subset of" and
">=" for "superset of"
"<" is used to check if a set is a proper subset of a set.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d"}
>>> x.issubset(y)
False
>>> y.issubset(x)
True
>>> x < y
False
>>> y < x # y is a proper subset of x
True
>>> x < x # a set can never be a proper subset of oneself.
False
>>> x <= x
True
>>>

issuperset()
x.issuperset(y) returns True, if x is a superset of y. ">=" is an abbreviation for "issuperset
of"
">" is used to check if a set is a proper superset of a set.

>>> x = {"a","b","c","d","e"}
>>> y = {"c","d"}
>>> x.issuperset(y)
True
>>> x > y
True
>>> x >= y
True
>>> x >= x
True
>>> x > x
False
>>> x.issuperset(x)
True
>>>
pop()
pop() removes and returns an arbitrary set element. The method raises a KeyError
if the set is empty
>>> x = {"a","b","c","d","e"}
>>> x.pop()
'a'
>>> x.pop()
'c'
Python and the Best Novel

This chapter deals with natural languages and literature. It will be also an extensive
example and use case for Python sets. Novices in Python often think that sets are
just a toy for mathematicians and that there is no real use case in programming. The
contrary is true. There are multiple use cases for sets. They are used for example to
get rid of doublets - multiple occurrences of elements - in a list, i.e. to make a list
unique.
In the following example we will use sets to determine the different words occurring
in a novel. Our use case is build around a novel which has been praised by many as
the best novel in the English language and also as the hardest to read. We are
talking about the novel "Ulysses" by James Joyce. We will not talk about or
examine the beauty of the language or the language style. We will study the novel by
having a close look at the words used in the novel. Our approach will be purely
statitically. The claim is that James Joyce used in his novel more words than any
other author. Actually his vocabulary is above and beyond all other authors, maybe
even Shakespeare.
Besides Ulysses we will use the novels "Sons and Lovers" by D.H. Lawrence, "The
Way of All Flesh" by Samuel Butler, "Robinson Crusoe" by Daniel Defoe, "To the
Lighthouse" by Virginia Woolf, "Moby Dick" by Herman Melville and the Short
Story"Metamorphosis" by Franz Kafka.
Before you continue with this chapter of our tutorial it might be a good idea to read
the chapter Sets and Frozen Sets and the two chapter on regular
expressions and advanced regular expressions.
Different Words of a Text
To cut out all the words of the novel "Ulysses" we can use the function findall from
the module "re":
import re

# we don't care about case sensitivity and therefore use lower:


ulysses_txt = open("books/james_joyce_ulysses.txt").read().lower()
words = re.findall(r"\b[\w-]+\b", ulysses_txt)
print("The novel ulysses contains " + str(len(words)))
The novel ulysses contains 272452
This number is the sum of all the words and many words occur multiple time:
for word in ["the", "while", "good", "bad", "ireland", "irish"]:
print("The word '" + word + "' occurs " + \
str(words.count(word)) + " times in the novel!" )
The word 'the' occurs 15112 times in the novel!
The word 'while' occurs 123 times in the novel!
The word 'good' occurs 321 times in the novel!
The word 'bad' occurs 90 times in the novel!
The word 'ireland' occurs 90 times in the novel!
The word 'irish' occurs 117 times in the novel!
272452 surely is a huge number of words for a novel, but on the other hand there are
lots of novels with even more words. More interesting and saying more about the
quality of a novel is the number of different words. This is the moment where we
will finally need "set". We will turn the list of words "words" into a set. Applying
"len" to this set will give us the number of different words:
diff_words = set(words)
print("'Ulysses' contains " + str(len(diff_words)) + " different words!")
'Ulysses' contains 29422 different words!
This is indeed an impressive number. You can see this, if you look at the other novels
in our folder books:
novels = ['sons_and_lovers_lawrence.txt',
'metamorphosis_kafka.txt',
'the_way_of_all_flash_butler.txt',
'robinson_crusoe_defoe.txt',
'to_the_lighthouse_woolf.txt',
'james_joyce_ulysses.txt',
'moby_dick_melville.txt']
for novel in novels:
txt = open("books/" + novel).read().lower()
words = re.findall(r"\b[\w-]+\b", txt)
diff_words = set(words)
n = len(diff_words)
print("{name:38s}: {n:5d}".format(name=novel[:-4], n=n))
sons_and_lovers_lawrence : 10822
metamorphosis_kafka : 3027
the_way_of_all_flash_butler : 11434
robinson_crusoe_defoe : 6595
to_the_lighthouse_woolf : 11415
james_joyce_ulysses : 29422
moby_dick_melville : 18922
Special Words in Ulysses
We will subtract all the words occurring in the other novels from "Ulysses" in the
following little Python program. It is amazing how many words are used by James
Joyce and by none of the other authors:
words_in_novel = {}
for novel in novels:
txt = open("books/" + novel).read().lower()
words = re.findall(r"\b[\w-]+\b", txt)
words_in_novel[novel] = words

words_only_in_ulysses = set(words_in_novel['james_joyce_ulysses.txt'])
novels.remove('james_joyce_ulysses.txt')
for novel in novels:
words_only_in_ulysses -= set(words_in_novel[novel])

with open("books/words_only_in_ulysses.txt", "w") as fh:


txt = " ".join(words_only_in_ulysses)
fh.write(txt)
print(len(words_only_in_ulysses))

15314
By the way, Dr. Seuss wrote a book with only 50 different words: Green Eggs and
Ham
The file with the words only occurring in Ulysses contains strange or seldom used
words like:
huntingcrop tramtrack pappin kithogue pennyweight undergarments scission
nagyaságos wheedling begad dogwhip hawthornden turnbull calumet covey
repudiated pendennis waistcoatpocket nostrum
Common Words
It is also possible to find the words which occur in every book. To accomplish this,
we need the set intersection:
# we start with the words in ulysses
common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
common_words &= set(words_in_novel[novel])

print(len(common_words))
1745
Doing it Right
We made a slight mistake in the previous calculations. If you look at the texts, you
will notice that they have a header and footer part added by Project Gutenberg,
which doesn't belong to the texts. The texts are positioned between the lines:
***START OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL
FLESH***
and
***END OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL
FLESH***
or
*** START OF THIS PROJECT GUTENBERG EBOOK ULYSSES ***
and
*** END OF THIS PROJECT GUTENBERG EBOOK ULYSSES ***
The function read_text takes care of this:
def read_text(fname):
beg_e = re.compile(r"\*\*\* ?start of (this|the) project gutenberg
ebook[^*]*\*\*\*")
end_e = re.compile(r"\*\*\* ?end of (this|the) project gutenberg
ebook[^*]*\*\*\*")
txt = open("books/" + fname).read().lower()
beg = beg_e.search(txt).end()
end = end_e.search(txt).start()
return txt[beg:end]
words_in_novel = {}
for novel in novels + ['james_joyce_ulysses.txt']:
txt = read_text(novel)
words = re.findall(r"\b[\w-]+\b", txt)
words_in_novel[novel] = words
words_in_ulysses = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
words_in_ulysses -= set(words_in_novel[novel])

with open("books/words_in_ulysses.txt", "w") as fh:


txt = " ".join(words_in_ulysses)
fh.write(txt)

print(len(words_in_ulysses))
15341
# we start with the words in ulysses
common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
common_words &= set(words_in_novel[novel])

print(len(common_words))
1279
The words of the set "common_words" are words belong to the most frequently
used words of the English language. Let's have a look at 30 arbitrary words of this
set:
counter = 0
for word in common_words:
print(word, end=", ")
counter += 1
if counter == 30:
break
ancient, broke, breathing, laugh, divided, forced, wealth, ring, outside, throw,
person, spend, better, errand, school, sought, knock, tell, inner, run, packed,
another, since, touched, bearing, repeated, bitter, experienced, often, one,

You might also like