Professional Documents
Culture Documents
Introduction
The philosopher Wittgenstein didn't like the set theory and complained mathematics
is "ridden through and through with the pernicious idioms of set theory," He dismissed
the set theory as "utter nonsense", as being "laughable" and "wrong". His criticism
appeared years after the death of the German mathematician Georg Cantor, the founder of
the set theory.
Cantor defined a set at the beginning of his "Beiträge zur Begründung der transfiniten
Mengenlehre":
"A set is a gathering together into a whole of definite, distinct objects of our perception
and of our thought - which are called elements of the set." Nowadays, we can say in
"plain" English: A set is a well defined collection of objects.
The elements or members of a set can be anything: numbers, characters, words, names,
letters of the alphabet, even other sets, and so on. Sets are usually denoted with capital
letters. This is not the exact mathematical definition, but it is good enough for the
following.
Sets in Python
The data type "set", which is a collection type, has been part of Python since
version 2.4. A set contains an unordered collection of unique and immutable objects. The
set data type is, as the name implies, a Python implementation of the sets as they are
known from mathematics. This explains, why sets unlike lists or tuples can't have
multiple occurrences of the same element.
Creating Sets
If we want to create a set, we can call the built-in set function with a sequence or
another iterable object.
In the following example, a string is singularized into its characters to build the resulting
set x:
We can pass a list to the built-in set function, as we can see in the following:
Immutable Sets
Sets are implemented in a way, which doesn't allow mutable objects. The
following example demonstrates that we cannot include, for example, lists as elements:
Frozensets
Though sets can't contain mutable objects, sets are mutable:
Frozensets are like sets except that they cannot be changed, i.e. they are immutable:
Simplified Notation
We can define sets (since Python2.6) without using the built-in set function. We
can use curly braces instead:
Set Operations
add(element)
A method which adds an element, which has to be immutable, to a set.
clear()
All elements will removed from a set.
>>> cities = {"Stuttgart", "Konstanz", "Freiburg"}
>>> cities.clear()
>>> cities
set([])
>>>
Copy()
Creates a shallow copy, which is returned.
>>> more_cities = {"Winterthur","Schaffhausen","St. Gallen"}
>>> cities_backup = more_cities.copy()
>>> more_cities.clear()
>>> cities_backup
set(['St. Gallen', 'Winterthur', 'Schaffhausen'])
>>>
difference_update()
The method difference_update removes all elements of another set from this set.
x.difference_update(y) is the same as "x = x - y"
>>> x = {"a","b","c","d","e"}
>>> y = {"b","c"}
>>> x.difference_update(y)
>>>
>>> x = {"a","b","c","d","e"}
>>> y = {"b","c"}
>>> x = x - y
>>> x
set(['a', 'e', 'd'])
>>>
discard(el)
An element el will be removed from the set, if it is contained in the set. If el is not
a member of the set, nothing will be done.
>>> x = {"a","b","c","d","e"}
>>> x.discard("a")
>>> x
set(['c', 'b', 'e', 'd'])
>>> x.discard("z")
>>> x
set(['c', 'b', 'e', 'd'])
>>>
remove(el)
works like discard(), but if el is not a member of the set, a KeyError will be
raised.
>>> x = {"a","b","c","d","e"}
>>> x.remove("a")
>>> x
set(['c', 'b', 'e', 'd'])
>>> x.remove("z")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'z'
>>>
intersection(s)
Returns the intersection of the instance set and the set s as a new set. In other
words: A set with all the elements which are contained in both sets is returned.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x.intersection(y)
set(['c', 'e', 'd'])
>>>
This can be abbreviated with the ampersand operator "&":
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x.intersection(y)
set(['c', 'e', 'd'])
>>>
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x & y
set(['c', 'e', 'd'])
>>>
isdisjoint()
This method returns True if two sets have a null intersection.
issubset()
x.issubset(y) returns True, if x is a subset of y. "<=" is an abbreviation for "Subset
of" and ">=" for "superset of"
"<" is used to check if a set is a proper subset of a set.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d"}
>>> x.issubset(y)
False
>>> y.issubset(x)
True
>>> x < y
False
>>> y < x # y is a proper subset of x
True
>>> x < x # a set can never be a proper subset of oneself.
False
>>> x <= x
True
>>>
issuperset()
x.issuperset(y) returns True, if x is a superset of y. ">=" is an abbreviation for
"issuperset of"
">" is used to check if a set is a proper superset of a set.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d"}
>>> x.issuperset(y)
True
>>> x > y
True
>>> x >= y
True
>>> x >= x
True
>>> x > x
False
>>> x.issuperset(x)
True
>>>
pop()
pop() removes and returns an arbitrary set element. The method raises a KeyError
if the set is empty
>>> x = {"a","b","c","d","e"}
>>> x.pop()
'a'
>>> x.pop()
'c'
Python 3
Sets and Frozensets
Introduction
In this chapter of our tutorial, we are dealing with Python's implementation of sets.
Though sets are nowadays an integral part of modern mathematics, this has not always
been like this. The set theory had been rejected by many, even great thinkers. One of
them was the philosopher Wittgenstein. He didn't like the set theory and complained
mathematics is "ridden through and through with the pernicious idioms of set theory...".
He dismissed the set theory as "utter nonsense", as being "laughable" and "wrong". His
criticism appeared years after the death of the German mathematician Georg Cantor, the
founder of the set theory. David Hilbert defended it from its critics by famously
declaring: "No one shall expel us from the Paradise that Cantor has created.
Cantor defined a set at the beginning of his "Beiträge zur Begründung der transfiniten
Mengenlehre":
"A set is a gathering together into a whole of definite, distinct objects of our perception
and of our thought - which are called elements of the set." Nowadays, we can say in
"plain" English: A set is a well defined collection of objects.
The elements or members of a set can be anything: numbers, characters, words, names,
letters of the alphabet, even other sets, and so on. Sets are usually denoted with capital
letters. This is not the exact mathematical definition, but it is good enough for the
following.
The data type "set", which is a collection type, has been part of Python since version 2.4.
A set contains an unordered collection of unique and immutable objects. The set data type
is, as the name implies, a Python implementation of the sets as they are known from
mathematics. This explains, why sets unlike lists or tuples can't have multiple
occurrences of the same element.
Sets
If we want to create a set, we can call the built-in set function with a sequence or another
iterable object:
In the following example, a string is singularized into its characters to build the resulting
set x:
>>> x = set("A Python Tutorial")
>>> x
{'A', ' ', 'i', 'h', 'l', 'o', 'n', 'P', 'r', 'u', 't', 'a', 'y', 'T'}
>>> type(x)
<class 'set'>
>>>
We can pass a list to the built-in set function, as we can see in the following:
>>> x = set(["Perl", "Python", "Java"])
>>> x
{'Python', 'Java', 'Perl'}
>>>
Now, we want to show what happens, if we pass a tuple with reappearing elements to the
set function - in our example the city "Paris":
Frozensets
Though sets can't contain mutable objects, sets are mutable:
Frozensets are like sets except that they cannot be changed, i.e. they are immutable:
>>> cities = frozenset(["Frankfurt", "Basel","Freiburg"])
>>> cities.add("Strasbourg")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'frozenset' object has no attribute 'add'
>>>
Improved notation
We can define sets (since Python2.6) without using the built-in set function. We can use
curly braces instead:
difference_update()
The method difference_update removes all elements of another set from this set.
x.difference_update(y) is the same as "x = x - y"
>>> x = {"a","b","c","d","e"}
>>> y = {"b","c"}
>>> x.difference_update(y)
>>>
>>> x = {"a","b","c","d","e"}
>>> y = {"b","c"}
>>> x = x - y
>>> x
{'a', 'e', 'd'}
>>>
discard(el)
An element el will be removed from the set, if it is contained in the set. If el is not a
member of the set, nothing will be done.
>>> x = {"a","b","c","d","e"}
>>> x.discard("a")
>>> x
{'c', 'b', 'e', 'd'}
>>> x.discard("z")
>>> x
{'c', 'b', 'e', 'd'}
>>>
remove(el)
works like discard(), but if el is not a member of the set, a KeyError will be raised.
>>> x = {"a","b","c","d","e"}
>>> x.remove("a")
>>> x
{'c', 'b', 'e', 'd'}
>>> x.remove("z")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'z'
>>>
union(s)
This method returns the union of two sets as a new set, i.e. all elements that are in either
set.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x.union(y)
{'d', 'a', 'g', 'c', 'f', 'b', 'e'}
This can be abbreviated with the pipe operator "|":
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x | y
{'d', 'a', 'g', 'c', 'f', 'b', 'e'}
>>>
intersection(s)
Returns the intersection of the instance set and the set s as a new set. In other words: A set
with all the elements which are contained in both sets is returned.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x.intersection(y)
{'c', 'e', 'd'}
>>>
This can be abbreviated with the ampersand operator "&":
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d","e","f","g"}
>>> x & y
{'c', 'e', 'd'}
>>>
isdisjoint()
This method returns True if two sets have a null intersection.
>>> x = {"a","b","c"}
>>> y = {"c","d","e"}
>>> x.isdisjoint(y)
False
>>>
>>> x = {"a","b","c"}
>>> y = {"d","e","f"}
>>> x.isdisjoint(y)
True
>>>
issubset()
x.issubset(y) returns True, if x is a subset of y. "<=" is an abbreviation for "Subset of" and
">=" for "superset of"
"<" is used to check if a set is a proper subset of a set.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d"}
>>> x.issubset(y)
False
>>> y.issubset(x)
True
>>> x < y
False
>>> y < x # y is a proper subset of x
True
>>> x < x # a set can never be a proper subset of oneself.
False
>>> x <= x
True
>>>
issuperset()
x.issuperset(y) returns True, if x is a superset of y. ">=" is an abbreviation for "issuperset
of"
">" is used to check if a set is a proper superset of a set.
>>> x = {"a","b","c","d","e"}
>>> y = {"c","d"}
>>> x.issuperset(y)
True
>>> x > y
True
>>> x >= y
True
>>> x >= x
True
>>> x > x
False
>>> x.issuperset(x)
True
>>>
pop()
pop() removes and returns an arbitrary set element. The method raises a KeyError
if the set is empty
>>> x = {"a","b","c","d","e"}
>>> x.pop()
'a'
>>> x.pop()
'c'
Python and the Best Novel
This chapter deals with natural languages and literature. It will be also an extensive
example and use case for Python sets. Novices in Python often think that sets are
just a toy for mathematicians and that there is no real use case in programming. The
contrary is true. There are multiple use cases for sets. They are used for example to
get rid of doublets - multiple occurrences of elements - in a list, i.e. to make a list
unique.
In the following example we will use sets to determine the different words occurring
in a novel. Our use case is build around a novel which has been praised by many as
the best novel in the English language and also as the hardest to read. We are
talking about the novel "Ulysses" by James Joyce. We will not talk about or
examine the beauty of the language or the language style. We will study the novel by
having a close look at the words used in the novel. Our approach will be purely
statitically. The claim is that James Joyce used in his novel more words than any
other author. Actually his vocabulary is above and beyond all other authors, maybe
even Shakespeare.
Besides Ulysses we will use the novels "Sons and Lovers" by D.H. Lawrence, "The
Way of All Flesh" by Samuel Butler, "Robinson Crusoe" by Daniel Defoe, "To the
Lighthouse" by Virginia Woolf, "Moby Dick" by Herman Melville and the Short
Story"Metamorphosis" by Franz Kafka.
Before you continue with this chapter of our tutorial it might be a good idea to read
the chapter Sets and Frozen Sets and the two chapter on regular
expressions and advanced regular expressions.
Different Words of a Text
To cut out all the words of the novel "Ulysses" we can use the function findall from
the module "re":
import re
words_only_in_ulysses = set(words_in_novel['james_joyce_ulysses.txt'])
novels.remove('james_joyce_ulysses.txt')
for novel in novels:
words_only_in_ulysses -= set(words_in_novel[novel])
15314
By the way, Dr. Seuss wrote a book with only 50 different words: Green Eggs and
Ham
The file with the words only occurring in Ulysses contains strange or seldom used
words like:
huntingcrop tramtrack pappin kithogue pennyweight undergarments scission
nagyaságos wheedling begad dogwhip hawthornden turnbull calumet covey
repudiated pendennis waistcoatpocket nostrum
Common Words
It is also possible to find the words which occur in every book. To accomplish this,
we need the set intersection:
# we start with the words in ulysses
common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
common_words &= set(words_in_novel[novel])
print(len(common_words))
1745
Doing it Right
We made a slight mistake in the previous calculations. If you look at the texts, you
will notice that they have a header and footer part added by Project Gutenberg,
which doesn't belong to the texts. The texts are positioned between the lines:
***START OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL
FLESH***
and
***END OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL
FLESH***
or
*** START OF THIS PROJECT GUTENBERG EBOOK ULYSSES ***
and
*** END OF THIS PROJECT GUTENBERG EBOOK ULYSSES ***
The function read_text takes care of this:
def read_text(fname):
beg_e = re.compile(r"\*\*\* ?start of (this|the) project gutenberg
ebook[^*]*\*\*\*")
end_e = re.compile(r"\*\*\* ?end of (this|the) project gutenberg
ebook[^*]*\*\*\*")
txt = open("books/" + fname).read().lower()
beg = beg_e.search(txt).end()
end = end_e.search(txt).start()
return txt[beg:end]
words_in_novel = {}
for novel in novels + ['james_joyce_ulysses.txt']:
txt = read_text(novel)
words = re.findall(r"\b[\w-]+\b", txt)
words_in_novel[novel] = words
words_in_ulysses = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
words_in_ulysses -= set(words_in_novel[novel])
print(len(words_in_ulysses))
15341
# we start with the words in ulysses
common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
common_words &= set(words_in_novel[novel])
print(len(common_words))
1279
The words of the set "common_words" are words belong to the most frequently
used words of the English language. Let's have a look at 30 arbitrary words of this
set:
counter = 0
for word in common_words:
print(word, end=", ")
counter += 1
if counter == 30:
break
ancient, broke, breathing, laugh, divided, forced, wealth, ring, outside, throw,
person, spend, better, errand, school, sought, knock, tell, inner, run, packed,
another, since, touched, bearing, repeated, bitter, experienced, often, one,