You are on page 1of 132

ST115 Managing and Visualising Data

Lecture 5: Introduction to databases

2022/23 Lent Term


Today's plan
Databases
Relational databases
Querying data using Structured Query Language (SQL)
Database normalisation
Note
In today's lecture we only provide an introduction to databases.
At LSE, we have a separate course ST207 Databases to talk about
databases
So far...
Managing Data:
How different types of data are represented in computer
Numerical calculation via numpy
Data wrangling via pandas
Visualising Data:
EDA via matplotlib and seaborn
So far...
Small tabular datasets
Size: KB to MB
Current data storage: a file or a collection of files
text files
csv files
Possible issues of storing data in a csv / text file
Not reliable - may lose the data as a result of disk failure
Not designed for large amount of data
Difficult to handle access right (e.g. read only)
Cannot prevent wrong input in the data
Not designed for concurrent operations on data
Multiple read and update
In organisations, data are often stored in databases.
Database
Database is an organised collection of related data, typically stored
electronically in a computer system
Designed and built to hold certain kinds of data
Usually controlled by a database management system (DBMS)
Database-management system (DBMS)
A database management system (DBMS) is a software system designed to
control access and store to, manage and retrieve data from databases.

Above shows DBMS and databases are different, but in real world the term
"database" is often used casually to refer to both a database and the DBMS
used to manipulate it.
Advantages of DBMS (over storing data in plain files)
Volume: Designed for storing and processing large amount of data
Security: Allow you to authorise and control who can access and/or update
the data
Reliability: Backup and recovery
Integrity: For example, DBMS can enforce constraints on the data and
prevent data anomalies
Concurrency: Multiple users can access and modify the data at the same
time in a controlled way
Unfortunately, we will only be able to demonstrate the integrity in this course
due to the time constraint.
Why are we learning database in this course?
Database is an important part of data management and data analysis. By
knowing about databases, it allows you to:
Store data collected for future use
More secure and reliable
Safe concurrent access
Quick data retrieval and merging data from different data sources
Access large amounts of (internal) data
Large organisations often have their data stored in a database to
support daily operations
By knowing about how to query a database, you can retrieve data
from them
In today's lecture you will learn how to do both on relational databases.
Relational databases
A relational database is a collection of data with pre-defined relationships
between them.
Data are organised as a set of tables with columns and rows
Each table is used for a different type of entity
Each table contains a fixed number of columns containing the
attributes of the entity
There are many types of databases but we only provide an overview of them in
the next lecture.
Table example: course
code name quota
ST101 Programming for Data Science 90
ST115 Managing and Visualising Data 60
MA214 Algorithms and Data Structures NULL
ST207 Databases NULL
ST310 Machine Learning 60
ST311 Artificial Intelligence 60
ST445 Managing and Visualising Data 90
Examples of relational DBMS (RDBMS)
There are many RDBMS available. Here are some examples:
Oracle Database
MySQL
PostgreSQL
SQLite
In this course, we will use SQLite as it is easy to set up. See here for the DB-
Engines Ranking according to their popularity.
Structured Query Language (SQL)
SQL is a computer language used to communicate with a RDBMS.
Pronounced as S-Q-L or "sequel"
It is used to store, manipulate and query data and control access to
databases
For students with some programming background:
SQL is a specific purpose, declarative programming language
(Python is a general purpose language which supports multiple
programming paradigms)
Note: SQL syntax and set of functionality implemented by different RDBMS can
be slightly different. We will use SQLite in this course.
RDBMS terminology
Each table is called a relation
Each row of relation is called a record or tuple
Rows do not have names
Each column of a relation is called an attribute or field
So a relational database is a set of relations, with each relation consists of
records and attributes.
Attributes
Attributes have:
Names (e.g. name , code , quota )
Data types (e.g. INTEGER, TEXT)
Attributes may also have constraints (e.g. must be non-negative)
This helps to avoid wrong input
Attributes may be marked as primary or foreign keys to identify each record
and show how data in different tables are linked
Data types
Data
Type
types available
Description
in SQLite:
NULL The value is a NULL value
INTEGER Signed integer, stored in 0, 1, 2, 3, 4, 6, or 8 bytes depending on
the magnitude of the value
REAL The value is a floating point value, stored in 8 bytes
TEXT The value is a text string, stored using the database encoding
(UTF-8, UTF-16)
BLOB Binary Large Object, can store a large chunk of data, document
types and even media files like video
Note other RDBMS may have different sets of data types.
Database schema
A database schema defines how data is organised within a relational database.
It is the definition for the entire database, for example:
Relations
Views
Indexes
In this course we only consider schema for relations.
Relation schema example
The following SQL code defines and creates the relation course :
CREATE TABLE course(
code TEXT NOT NULL CHECK(length(code) = 5),
name TEXT NOT NULL,
quota INTEGER CHECK(quota >= 0),
PRIMARY KEY(code)
);
Schema: enforce guarantees
The schema for course enforces:
Uniqueness:
Forbid a record to have the same code as another
Correct type:
e.g. Each record must have quota to be integer
Constraints:
e.g. Each record must not have quota to be negative
e.g. Each record must have some values for code and name
e.g. Each record must have code to have length 5
For some RDBMS, we can also set restriction on the form of the code to be
CCddd where C is some capitalised letter and d is some digit using regular
expression (regex).
We will not talk about how to do it in this course, but we will learn about
regex in week 7. Stay tuned!
Primary key
A primary key is an attribute or a set of attributes that contain values that
uniquely identify each record in a table.
Primary key must be unique
Used to cross-references between tables
Example:
code in the course table is a primary key
name in the course table is not a primary key
SQL syntax: create a table

CREATE TABLE <table name>(


<attribute name 1> <data type 1> [constraints],
<attribute name 2> <data type 2> [constraints],
...
<attribute name n> <data type n> [constraints],
PRIMARY KEY(<some attribute(s)>)
);
More examples of table

id
student table:
name program department year
202022333 Harry BSc Data Science Statistics 2
202012345 Ron BSc Data Science Statistics 2
202054321 Hermione BSc Economics Economics 2
202101010 Ginny BSc Data Science Statistics 1
202155555 Dobby BSc Actuarial Science Statistics 1
202124680 Harry MSc Data Science Statistics 1
More examples of table
registration
courseCode studentId
table:
mark
ST207 202022333 72
MA214 202022333 NULL
ST207 202012345 66
MA214 202012345 NULL
EC220 202054321 NULL
ST101 202054321 93
ST115 202054321 NULL
ST101 202101010 70
ST115 202101010 NULL
More examples of database schema

CREATE TABLE student(


id TEXT NOT NULL CHECK (length(id) = 9),
name TEXT NOT NULL,
department TEXT,
program TEXT,
year INTEGER CHECK (year >= 1),
PRIMARY KEY(id)
);

CREATE TABLE registration(


courseCode TEXT NOT NULL CHECK(length(courseCode) = 5),
studentId TEXT NOT NULL CHECK(length(studentId) = 9),
mark INTEGER,
PRIMARY KEY(studentId, courseCode)
);

Question: Why we need two attributes as primary key for the table
registration ?
Foreign key
A foreign key is an attribute or a set of attributes in a table that refers to the
primary key of another table.
The foreign key shows how tables are linked
Example: studentId in registration table is a foreign key
You can also explicitly specify foreign key relationships in the schema, but we
will not cover it in this course.
SQL insert
We can add a row by:
INSERT INTO <table> VALUES (<value 1>, <value 2>, ..., <value n>);

Example:
INSERT INTO course VALUES ('ST101', 'Programming for Data Science', 90);
SQL in Jupyter notebook
In this lecture, we will run sql scripts in Jupyter Notebook using %sql (or
%%sql ) magic.
%sql for one line and %%sql for multiple lines
In order to do so, please install ipython-sql by using the following
command:
conda install -c conda-forge ipython-sql

and run the following line:


In [1]:
%load_ext sql

Now we will be able to run SQL by having %sql or %%sql at the beginning of
your code cell.
Connect / create to a database
The following command connect (and create if it does not exist) to the
school.db in the folder where your jupyter notebook is:

In [2]:
%sql sqlite:///school.db

Out[2]:

'Connected: @school.db'

Alternatives:
sqlite:// : temporary connection
sqlite:////Users/yyy/xxx.db : absolute path to the database
Create a table
In [3]:
%%sql

DROP TABLE IF EXISTS course; -- remove the table if already exists


CREATE TABLE course(
code TEXT NOT NULL CHECK(length(code) = 5),
name TEXT NOT NULL,
quota INTEGER CHECK(quota >= 0),
PRIMARY KEY(code)
);

* sqlite:///school.db
Done.
Done.

Out[3]:

[]
Insert rows into the table
In [4]:
%%sql

INSERT INTO course VALUES ('ST101', 'Programming for Data Science', 90);
INSERT INTO course VALUES ('ST115', 'Managing and Visualising Data', 60);
INSERT INTO course VALUES ('MA214', 'Algorithms and Data Structures', NULL);
INSERT INTO course VALUES ('ST207', 'Databases', NULL);
INSERT INTO course VALUES ('ST310', 'Machine Learning', 60);
INSERT INTO course VALUES ('ST311', 'Artificial Intelligence', 30);
INSERT INTO course VALUES ('ST445', 'Managing and Visualising Data', 60);

* sqlite:///school.db
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.

Out[4]:

[]
View the table
Check if we have created the table and inserted the rows properly:
In [5]:
%sql SELECT * FROM course;

* sqlite:///school.db
Done.

Out[5]:
code name quota
ST101 Programming for Data Science 90
ST115 Managing and Visualising Data 60
MA214 Algorithms and Data Structures None
ST207 Databases None
ST310 Machine Learning 60
ST311 Artificial Intelligence 30
ST445 Managing and Visualising Data 60
We will talk about the syntax in the next section.
Test the constraints
Try to insert ST101 again:
In [6]:
%sql INSERT INTO course VALUES ('ST101', 'Programming for Data Science', 90);

* sqlite:///school.db

---------------------------------------------------------------
------------
IntegrityError Traceback (most recen
t call last)
~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_context(self, dialect, constructor, statemen
t, parameters, execution_options, *args, **kw)
1818 if not evt_handled:
-> 1819 self.dialect.do_execute(
1820 cursor, statement, parameters,
context

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/d
efault.py in do_execute(self, cursor, statement, parameters, co
ntext)
731 def do_execute(self, cursor, statement, parameters,
context=None):
--> 732 cursor.execute(statement, parameters)
733

IntegrityError: UNIQUE constraint failed: course.code

The above exception was the direct cause of the following excep
tion:

IntegrityError Traceback (most recen


t call last)
/var/folders/y4/tkxp25f55d125kkq5rfnkkn00000gn/T/ipykernel_8016
4/2307895583.py in <module>
----> 1 get_ipython().run_line_magic('sql', "INSERT INTO course
VALUES ('ST101', 'Programming for Data Science', 90);")

~/opt/anaconda3/lib/python3.9/site-packages/IPython/core/intera
ctiveshell.py in run_line_magic(self, magic_name, line, _stack_
depth)
2362 kwargs['local_ns'] = self.get_local_sco
pe(stack_depth)
2363 with self.builtin_trap:
-> 2364 result = fn(*args, **kwargs)
2365 return result
2366

~/opt/anaconda3/lib/python3.9/site-packages/decorator.py in fun
(*args, **kw)
230 if not kwsyntax:
231 args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)
233 fun.__name__ = func.__name__
234 fun.__doc__ = func.__doc__

~/opt/anaconda3/lib/python3.9/site-packages/IPython/core/magic.
py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):

~/opt/anaconda3/lib/python3.9/site-packages/decorator.py in fun
(*args, **kw)
230 if not kwsyntax:
231 args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)
233 fun.__name__ = func.__name__
234 fun.__doc__ = func.__doc__

~/opt/anaconda3/lib/python3.9/site-packages/IPython/core/magic.
py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):

~/opt/anaconda3/lib/python3.9/site-packages/sql/magic.py in exe
cute(self, line, cell, local_ns)
93
94 try:
---> 95 result = sql.run.run(conn, parsed['sql'], s
elf, user_ns)
96
97 if result is not None and not isinstance(re
sult, str) and self.column_local_vars:

~/opt/anaconda3/lib/python3.9/site-packages/sql/run.py in run(c
onn, sql, config, user_namespace)
338 else:
339 txt = sqlalchemy.sql.text(statement)
--> 340 result = conn.session.execute(txt, user
_namespace)
341 _commit(conn=conn, config=config)
342 if result and config.feedback:

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in execute(self, statement, *multiparams, **params)
1304 )
1305 else:
-> 1306 return meth(self, multiparams, params, _EMP
TY_EXECUTION_OPTS)
1307
1308 def _execute_function(self, func, multiparams, para
ms, execution_options):

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/sql/elem
ents.py in _execute_on_connection(self, connection, multiparam
s, params, execution_options, _force)
330 ):
331 if _force or self.supports_execution:
--> 332 return connection._execute_clauseelement(
333 self, multiparams, params, execution_op
tions
334 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_clauseelement(self, elem, multiparams, param
s, execution_options)
1496 linting=self.dialect.compiler_linting | com
piler.WARN_LINTING,
1497 )
-> 1498 ret = self._execute_context(
1499 dialect,
1500 dialect.execution_ctx_cls._init_compiled,

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_context(self, dialect, constructor, statemen
t, parameters, execution_options, *args, **kw)
1860
1861 except BaseException as e:
-> 1862 self._handle_dbapi_exception(
1863 e, statement, parameters, cursor, conte
xt
1864 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _handle_dbapi_exception(self, e, statement, parameter
s, cursor, context)
2041 util.raise_(newraise, with_traceback=ex
c_info[2], from_=e)
2042 elif should_wrap:
-> 2043 util.raise_(
2044 sqlalchemy_exception, with_tracebac
k=exc_info[2], from_=e
2045 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/util/com
pat.py in raise_(***failed resolving arguments***)
206
207 try:
--> 208 raise exception
209 finally:
210 # credit to

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_context(self, dialect, constructor, statemen
t, parameters, execution_options, *args, **kw)
1817 break
1818 if not evt_handled:
-> 1819 self.dialect.do_execute(
1820 cursor, statement, parameters,
context
1821 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/d
efault.py in do_execute(self, cursor, statement, parameters, co
ntext)
730
731 def do_execute(self, cursor, statement, parameters,
context=None):
--> 732 cursor.execute(statement, parameters)
733
734 def do_execute_no_params(self, cursor, statement, c
ontext=None):

IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint fail


ed: course.code
[SQL: INSERT INTO course VALUES ('ST101', 'Programming for Data
Science', 90);]
(Background on this error at: https://sqlalche.me/e/14/gkpj)

Try to insert ST2189 (wrong code length):


In [7]:
%sql INSERT INTO course VALUES ('ST2189', 'Introduction to Data Science', NULL);

* sqlite:///school.db

---------------------------------------------------------------
------------
IntegrityError Traceback (most recen
t call last)
~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_context(self, dialect, constructor, statemen
t, parameters, execution_options, *args, **kw)
1818 if not evt_handled:
-> 1819 self.dialect.do_execute(
1820 cursor, statement, parameters,
context

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/d
efault.py in do_execute(self, cursor, statement, parameters, co
ntext)
731 def do_execute(self, cursor, statement, parameters,
context=None):
--> 732 cursor.execute(statement, parameters)
733

IntegrityError: CHECK constraint failed: length(code) = 5

The above exception was the direct cause of the following excep
tion:

IntegrityError Traceback (most recen


t call last)
/var/folders/y4/tkxp25f55d125kkq5rfnkkn00000gn/T/ipykernel_8016
4/3272314806.py in <module>
----> 1 get_ipython().run_line_magic('sql', "INSERT INTO course
VALUES ('ST2189', 'Introduction to Data Science', NULL);")

~/opt/anaconda3/lib/python3.9/site-packages/IPython/core/intera
ctiveshell.py in run_line_magic(self, magic_name, line, _stack_
depth)
2362 kwargs['local_ns'] = self.get_local_sco
pe(stack_depth)
2363 with self.builtin_trap:
-> 2364 result = fn(*args, **kwargs)
2365 return result
2366
~/opt/anaconda3/lib/python3.9/site-packages/decorator.py in fun
(*args, **kw)
230 if not kwsyntax:
231 args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)
233 fun.__name__ = func.__name__
234 fun.__doc__ = func.__doc__

~/opt/anaconda3/lib/python3.9/site-packages/IPython/core/magic.
py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):

~/opt/anaconda3/lib/python3.9/site-packages/decorator.py in fun
(*args, **kw)
230 if not kwsyntax:
231 args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)
233 fun.__name__ = func.__name__
234 fun.__doc__ = func.__doc__

~/opt/anaconda3/lib/python3.9/site-packages/IPython/core/magic.
py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):

~/opt/anaconda3/lib/python3.9/site-packages/sql/magic.py in exe
cute(self, line, cell, local_ns)
93
94 try:
---> 95 result = sql.run.run(conn, parsed['sql'], s
elf, user_ns)
96
97 if result is not None and not isinstance(re
sult, str) and self.column_local_vars:

~/opt/anaconda3/lib/python3.9/site-packages/sql/run.py in run(c
onn, sql, config, user_namespace)
338 else:
339 txt = sqlalchemy.sql.text(statement)
--> 340 result = conn.session.execute(txt, user
_namespace)
341 _commit(conn=conn, config=config)
342 if result and config.feedback:

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in execute(self, statement, *multiparams, **params)
1304 )
1305 else:
-> 1306 return meth(self, multiparams, params, _EMP
TY_EXECUTION_OPTS)
1307
1308 def _execute_function(self, func, multiparams, para
ms, execution_options):
~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/sql/elem
ents.py in _execute_on_connection(self, connection, multiparam
s, params, execution_options, _force)
330 ):
331 if _force or self.supports_execution:
--> 332 return connection._execute_clauseelement(
333 self, multiparams, params, execution_op
tions
334 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_clauseelement(self, elem, multiparams, param
s, execution_options)
1496 linting=self.dialect.compiler_linting | com
piler.WARN_LINTING,
1497 )
-> 1498 ret = self._execute_context(
1499 dialect,
1500 dialect.execution_ctx_cls._init_compiled,

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_context(self, dialect, constructor, statemen
t, parameters, execution_options, *args, **kw)
1860
1861 except BaseException as e:
-> 1862 self._handle_dbapi_exception(
1863 e, statement, parameters, cursor, conte
xt
1864 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _handle_dbapi_exception(self, e, statement, parameter
s, cursor, context)
2041 util.raise_(newraise, with_traceback=ex
c_info[2], from_=e)
2042 elif should_wrap:
-> 2043 util.raise_(
2044 sqlalchemy_exception, with_tracebac
k=exc_info[2], from_=e
2045 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/util/com
pat.py in raise_(***failed resolving arguments***)
206
207 try:
--> 208 raise exception
209 finally:
210 # credit to

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_context(self, dialect, constructor, statemen
t, parameters, execution_options, *args, **kw)
1817 break
1818 if not evt_handled:
-> 1819 self.dialect.do_execute(
1820 cursor, statement, parameters,
context
1821 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/d
efault.py in do_execute(self, cursor, statement, parameters, co
ntext)
730
731 def do_execute(self, cursor, statement, parameters,
context=None):
--> 732 cursor.execute(statement, parameters)
733
734 def do_execute_no_params(self, cursor, statement, c
ontext=None):

IntegrityError: (sqlite3.IntegrityError) CHECK constraint faile


d: length(code) = 5
[SQL: INSERT INTO course VALUES ('ST2189', 'Introduction to Dat
a Science', NULL);]
(Background on this error at: https://sqlalche.me/e/14/gkpj)

Try to insert a without course name:


In [8]:
%sql INSERT INTO course VALUES ('ST102', NULL, NULL);

* sqlite:///school.db

---------------------------------------------------------------
------------
IntegrityError Traceback (most recen
t call last)
~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_context(self, dialect, constructor, statemen
t, parameters, execution_options, *args, **kw)
1818 if not evt_handled:
-> 1819 self.dialect.do_execute(
1820 cursor, statement, parameters,
context

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/d
efault.py in do_execute(self, cursor, statement, parameters, co
ntext)
731 def do_execute(self, cursor, statement, parameters,
context=None):
--> 732 cursor.execute(statement, parameters)
733

IntegrityError: NOT NULL constraint failed: course.name

The above exception was the direct cause of the following excep
tion:

IntegrityError Traceback (most recen


t call last)
/var/folders/y4/tkxp25f55d125kkq5rfnkkn00000gn/T/ipykernel_8016
4/747399068.py in <module>
----> 1 get_ipython().run_line_magic('sql', "INSERT INTO course
VALUES ('ST102', NULL, NULL);")

~/opt/anaconda3/lib/python3.9/site-packages/IPython/core/intera
ctiveshell.py in run_line_magic(self, magic_name, line, _stack_
depth)
2362 kwargs['local_ns'] = self.get_local_sco
pe(stack_depth)
2363 with self.builtin_trap:
-> 2364 result = fn(*args, **kwargs)
2365 return result
2366

~/opt/anaconda3/lib/python3.9/site-packages/decorator.py in fun
(*args, **kw)
230 if not kwsyntax:
231 args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)
233 fun.__name__ = func.__name__
234 fun.__doc__ = func.__doc__

~/opt/anaconda3/lib/python3.9/site-packages/IPython/core/magic.
py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):

~/opt/anaconda3/lib/python3.9/site-packages/decorator.py in fun
(*args, **kw)
230 if not kwsyntax:
231 args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)
233 fun.__name__ = func.__name__
234 fun.__doc__ = func.__doc__

~/opt/anaconda3/lib/python3.9/site-packages/IPython/core/magic.
py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):

~/opt/anaconda3/lib/python3.9/site-packages/sql/magic.py in exe
cute(self, line, cell, local_ns)
93
94 try:
---> 95 result = sql.run.run(conn, parsed['sql'], s
elf, user_ns)
96
97 if result is not None and not isinstance(re
sult, str) and self.column_local_vars:

~/opt/anaconda3/lib/python3.9/site-packages/sql/run.py in run(c
onn, sql, config, user_namespace)
338 else:
339 txt = sqlalchemy.sql.text(statement)
--> 340 result = conn.session.execute(txt, user
_namespace)
341 _commit(conn=conn, config=config)
342 if result and config.feedback:

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in execute(self, statement, *multiparams, **params)
1304 )
1305 else:
-> 1306 return meth(self, multiparams, params, _EMP
TY_EXECUTION_OPTS)
1307
1308 def _execute_function(self, func, multiparams, para
ms, execution_options):

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/sql/elem
ents.py in _execute_on_connection(self, connection, multiparam
s, params, execution_options, _force)
330 ):
331 if _force or self.supports_execution:
--> 332 return connection._execute_clauseelement(
333 self, multiparams, params, execution_op
tions
334 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_clauseelement(self, elem, multiparams, param
s, execution_options)
1496 linting=self.dialect.compiler_linting | com
piler.WARN_LINTING,
1497 )
-> 1498 ret = self._execute_context(
1499 dialect,
1500 dialect.execution_ctx_cls._init_compiled,

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_context(self, dialect, constructor, statemen
t, parameters, execution_options, *args, **kw)
1860
1861 except BaseException as e:
-> 1862 self._handle_dbapi_exception(
1863 e, statement, parameters, cursor, conte
xt
1864 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _handle_dbapi_exception(self, e, statement, parameter
s, cursor, context)
2041 util.raise_(newraise, with_traceback=ex
c_info[2], from_=e)
2042 elif should_wrap:
-> 2043 util.raise_(
2044 sqlalchemy_exception, with_tracebac
k=exc_info[2], from_=e
2045 )

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/util/com
pat.py in raise_(***failed resolving arguments***)
206
207 try:
--> 208 raise exception
209 finally:
210 # credit to

~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/b
ase.py in _execute_context(self, dialect, constructor, statemen
t, parameters, execution_options, *args, **kw)
1817 break
1818 if not evt_handled:
-> 1819 self.dialect.do_execute(
1820 cursor, statement, parameters,
context
1821 )
~/opt/anaconda3/lib/python3.9/site-packages/sqlalchemy/engine/d
efault.py in do_execute(self, cursor, statement, parameters, co
ntext)
730
731 def do_execute(self, cursor, statement, parameters,
context=None):
--> 732 cursor.execute(statement, parameters)
733
734 def do_execute_no_params(self, cursor, statement, c
ontext=None):

IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint fa


iled: course.name
[SQL: INSERT INTO course VALUES ('ST102', NULL, NULL);]
(Background on this error at: https://sqlalche.me/e/14/gkpj)
Create the student table and insert rows
In [9]:
%%sql

DROP TABLE IF EXISTS student;


CREATE TABLE student(
id TEXT NOT NULL CHECK (length(id) = 9),
name TEXT NOT NULL,
program TEXT,
department TEXT,
year INTEGER CHECK (year >= 1),
PRIMARY KEY(id)
);

INSERT INTO student VALUES ('202022333', 'Harry', 'BSc Data Science', 'Statistics', 2);
INSERT INTO student VALUES ('202012345', 'Ron', 'BSc Data Science', 'Statistics', 2);
INSERT INTO student VALUES ('202054321', 'Hermione', 'BSc Economics', 'Economics', 2);
INSERT INTO student VALUES ('202101010', 'Ginny', 'BSc Data Science', 'Statistics', 1);
INSERT INTO student VALUES ('202155555', 'Dobby', 'BSc Actuarial Science', 'Statistics', 1);
INSERT INTO student VALUES ('202124680', 'Harry', 'MSc Data Science', 'Statistics', 1);

* sqlite:///school.db
Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.

Out[9]:
[]
View the student table
Check if we have created the table and inserted the rows properly:
In [10]:
%sql SELECT * FROM student;

* sqlite:///school.db
Done.

Out[10]:
id name program department year
202022333 Harry BSc Data Science Statistics 2
202012345 Ron BSc Data Science Statistics 2
202054321 Hermione BSc Economics Economics 2
202101010 Ginny BSc Data Science Statistics 1
202155555 Dobby BSc Actuarial Science Statistics 1
202124680 Harry MSc Data Science Statistics 1
Create the registration table and insert rows
In [11]:
%%sql

DROP TABLE IF EXISTS registration;


CREATE TABLE registration(
courseCode TEXT NOT NULL CHECK(length(courseCode) = 5),
studentId TEXT NOT NULL CHECK(length(studentId) = 9),
mark INTEGER,
PRIMARY KEY(studentId, courseCode)
);

INSERT INTO registration VALUES ('ST207', '202022333', 72);


INSERT INTO registration VALUES ('MA214', '202022333', NULL);
INSERT INTO registration VALUES ('ST207', '202012345', 66);
INSERT INTO registration VALUES ('MA214', '202012345', NULL);
INSERT INTO registration VALUES ('EC220', '202054321', NULL);
INSERT INTO registration VALUES ('ST101', '202054321', 93);
INSERT INTO registration VALUES ('ST115', '202054321', NULL);
INSERT INTO registration VALUES ('ST101', '202101010', 70);
INSERT INTO registration VALUES ('ST115', '202101010', NULL);

* sqlite:///school.db
Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
Out[11]:

[]
View the registration table
Check if we have created the table and inserted the rows properly:
In [12]:
%sql SELECT * FROM registration;

* sqlite:///school.db
Done.

Out[12]:
courseCode studentId mark
ST207 202022333 72
MA214 202022333 None
ST207 202012345 66
MA214 202012345 None
EC220 202054321 None
ST101 202054321 93
ST115 202054321 None
courseCode studentId mark
ST101 202101010 70
ST115 202101010 None
SQL query
More on SQL
SQL is a computer language used to communicate with a RDBMS and it allows
us to store, update, query data and control access of a relational database. It
consists of many types of statements:
Data query
Data manipulation (insert, update and delete)
Data definition (schema creation and modification)
Data access control
We have previously seen how we can use SQL to create tables and insert data.
Now we will learn how to query data from a database using SQL.
In this course we will not cover how to use SQL to control access or modify
the data
SQL query syntax

SELECT [DISTINCT] <column expression list>


FROM <list of tables>
[WHERE <predicate>]
[GROUP BY <column list>]
[HAVING <predicate>]
[ORDER BY <column list>]
[LIMIT <number of rows>];
SQL simple query demo 1: use * and LIMIT
Example: See the first 3 rows from the course table
In [13]:
%%sql
SELECT *
FROM course
LIMIT 3;

* sqlite:///school.db
Done.

Out[13]:
code name quota
ST101 Programming for Data Science 90
ST115 Managing and Visualising Data 60
MA214 Algorithms and Data Structures None
Use SELECT * to select all attributes
Use LIMIT to specify the number of rows to get
SQL simple query demo 2: use WHERE and ORDER BY
Example (filtering): Select the name of the students who are from BSc Data
Science, and order their name in descending order
In [14]:
%%sql
SELECT name
FROM student
WHERE program = 'BSc Data Science'
ORDER BY name DESC;

* sqlite:///school.db
Done.

Out[14]:
name
Ron
Harry
Ginny
Use WHERE to filter data based on some condition - Note = is used for
equality!
Use ORDER BY to order the data based on some attribute(s)
Use DESC if you want the data to be ordered in descending order
(default is ascending order)
SQL simple query demo 3: use DISTINCT
Example (Unique values): Get the list of unique departments from
student .

In [15]:
%%sql
SELECT DISTINCT department
FROM student;

* sqlite:///school.db
Done.

Out[15]:
department
Statistics
Economics
Use DISTINCT to return unique results
SQL simple query demo 4: check NULL
Example (Find NULL values): Find out the register with mark missing.
In [16]:
%%sql
SELECT *
FROM registration
WHERE mark IS NULL;

* sqlite:///school.db
Done.

Out[16]:
courseCode studentId mark
MA214 202022333 None
MA214 202012345 None
EC220 202054321 None
ST115 202054321 None
ST115 202101010 None
Note you must use IS NULL to check NULL . The following does not
work:
In [17]:
%sql SELECT * FROM registration WHERE mark = NULL;

* sqlite:///school.db
Done.

Out[17]:
courseCode studentId mark
SQL simple query demo 5: use AND / OR to chain up conditions
Example: Get the rows from the registration table for which it is for
ST101 and the mark is ≥ 90
In [18]:
%%sql
SELECT *
FROM registration
WHERE courseCode = 'ST101' AND mark >= 90;

* sqlite:///school.db
Done.

Out[18]:
courseCode studentId mark
ST101 202054321 93
Example: Get the courses with code is ST101 or ST115 from the
course table

In [19]:
%%sql
SELECT *
FROM course
WHERE code = 'ST101' OR code = 'ST115';

* sqlite:///school.db
Done.

Out[19]:
code name quota
ST101 Programming for Data Science 90
ST115 Managing and Visualising Data 60
SQL simple query demo 6: built-in predicates
The last example can also be checked with the use of some built-in predicates.
Example: Use the LIKE keyword to test whether a string matches a pattern
with wildcards:
% : zero, one, or multiple characters
_ : one, single character

In [23]:
%%sql
SELECT *
FROM course
WHERE code LIKE 'ST1__';

* sqlite:///school.db
Done.

Out[23]:
code name quota
ST101 Programming for Data Science 90
code name quota
ST115 Managing and Visualising Data 60
SQL simple query demo 6: built-in predicates (continue)
Example: Use the IN keyword to specify multiple values, which is a
shorthand for multiple OR conditions
In [19]:
%%sql
SELECT *
FROM course
WHERE code IN ('ST101', 'ST115');

* sqlite:///school.db
Done.

Out[19]:
code name quota
ST101 Programming for Data Science 90
ST115 Managing and Visualising Data 60
Note on SQL syntax case
SQL syntax in SQLite is case insensitive. For example, the following code will
give you the same result as the previous slide:
In [20]:
%%sql
select *
from COURSE
where CODE in ('ST101', 'ST115');

* sqlite:///school.db
Done.

Out[20]:
code name quota
ST101 Programming for Data Science 90
ST115 Managing and Visualising Data 60
However, it is the convention to use CAPITAL letters for the SQL keywords and
functions, and pascalCase for the table and attribute names.
Note on SQL syntax case (continue)
Note text comparison is case sensitive:
In [21]:
%%sql
select *
from COURSE
where CODE in ('st101', 'ST115');

* sqlite:///school.db
Done.

Out[21]:
code name quota
ST115 Managing and Visualising Data 60
A Grammar of data manipulation
Similar
functionality
functionality provided
SQL
by Pandas: Pandas
Select/filter data SELECT , WHERE loc[] , iloc[]
Sort ORDER BY sort_values()
Join JOIN merge()
Aggregate AVG() , COUNT() mean() , count()
GROUP BY groupby()
Aggregation
SQL provides aggregate functions to return a single value from a set of values
Examples:
COUNT()
SUM()
AVG()
,
MIN() MAX()
See here for a list of aggregate functions provided by SQLite.
SQL aggregation demo 2: COUNT()
Example: count the number of students registered for ST101
In [24]:
%%sql
SELECT COUNT(*)
FROM registration
where courseCode = 'ST101'

* sqlite:///school.db
Done.

Out[24]:
COUNT(*)
2
Example: count the number of unique departments from student
In [25]:
%%sql
SELECT COUNT(DISTINCT department)
FROM student;
* sqlite:///school.db
Done.
Out[25]:
COUNT(DISTINCT department)
2
SQL aggregation demo: MAX()
Example: find the maximum mark for ST207 in registration :
In [26]:
%%sql
SELECT MAX(mark) AS maxMark
FROM registration
WHERE courseCode = 'ST207';

* sqlite:///school.db
Done.

Out[26]:
maxMark
72
The AS keyword is used to rename a column or table with an alias.
SQL aggregation demo: AVG() (with the use of GROUP BY )
Example: find the average mark for each course:
In [25]:
%%sql
SELECT courseCode, AVG(mark) AS avgMark
FROM registration
GROUP BY courseCode;

* sqlite:///school.db
Done.

Out[25]:
courseCode avgMark
EC220 None
MA214 None
ST101 81.5
ST115 None
ST207 69.0
SQL aggregation demo: HAVING
To filter the groups based on some specified conditions, use the HAVING
clause.
Example: find the courses such that the average mark is ≥ 70:
In [26]:
%%sql
SELECT courseCode, AVG(mark) AS avgMark
FROM registration
GROUP BY courseCode
HAVING AVG(mark) >= 70;

* sqlite:///school.db
Done.

Out[26]:
courseCode avgMark
ST101 81.5
SQL joins
SQL joins allow you to combine and retrieve data from two or more tables.
Inner join demo
Example: combine the student and registration table using the inner
join.
In [27]:
%%sql
SELECT *
FROM student
INNER JOIN registration
ON student.id = registration.studentId;

* sqlite:///school.db
Done.

Out[27]:
id name program department year courseCode studentId mark

202022333 Harry BSc Data Statistics 2 ST207 202022333 72


Science
202022333 Harry BSc Data Statistics 2 MA214 202022333 None
Science
202012345 Ron BSc Data Statistics 2 ST207 202012345 66
Science
id name program department year courseCode studentId mark

202012345 Ron BSc Data Statistics 2 MA214 202012345


Science None
202054321 Hermione Economics BSc Economics 2 EC220 202054321 None
202054321 Hermione Economics BSc Economics 2 ST101 202054321 93
202054321 Hermione Economics BSc Economics 2 ST115 202054321 None
202101010 Ginny BSc Data Statistics 1 ST101 202101010
Science 70
202101010 Ginny BSc Data Statistics 1 ST115 202101010
Science None
The primary key id of student and the foreign key studentId of
registration are used to combine the data.
Inner join demo (continue)
Example: get the list of names and programs for students registered in
Statistics courses.
In [28]:
%%sql

SELECT DISTINCT studentId, name, program


FROM student
INNER JOIN registration
ON student.id = registration.studentId
WHERE courseCode LIKE "ST___";

* sqlite:///school.db
Done.

Out[28]:
studentId name program
202012345 Ron BSc Data Science
202022333 Harry BSc Data Science
202054321 Hermione BSc Economics
studentId name program
202101010 Ginny BSc Data Science
Inner join demo (continue)
Example: For each student who has registered to some Statistics courses, get
the number of Statistics courses the student is in.
In [29]:
%%sql

SELECT studentId, name, COUNT(*)


FROM student
INNER JOIN registration
ON student.id = registration.studentId
WHERE courseCode LIKE "ST___"
GROUP BY studentId;

* sqlite:///school.db
Done.

Out[29]:
studentId name COUNT(*)
202012345 Ron 1
202022333 Harry 1
202054321 Hermione 2
202101010 Ginny 2
By joining the tables, it allows us to aggregate the information and answer
some more complicated questions.
Left join
Example: Same as the previous inner join example, but with the left join:
In [30]:
%%sql
SELECT *
FROM student
LEFT JOIN registration
ON id = studentId
ORDER BY courseCode

* sqlite:///school.db
Done.

Out[30]:
id name program department year courseCode studentId mark
BSc
202155555 Dobby Actuarial Statistics 1 None None None
Science
202124680 Harry MSc Data Statistics 1 None
Science None None
BSc Economics 2 EC220 202054321 None
202054321 Hermione Economics
id name program department year courseCode studentId mark

202022333 Harry BSc Data Statistics 2 MA214 202022333


Science None
202012345 Ron BSc Data Statistics 2 MA214 202012345
Science None
BSc Economics 2 ST101 202054321
202054321 Hermione Economics 93
202101010 Ginny BSc Data Statistics 1 ST101 202101010
Science 70
BSc Economics 2 ST115 202054321
202054321 Hermione Economics None
202101010 Ginny BSc Data Statistics 1 ST115 202101010
Science None
202022333 Harry BSc Data Statistics 2 ST207 202022333
Science 72
202012345 Ron BSc Data Statistics 2 ST207 202012345
Science 66
Right join
There is no join in SQLite, but we can get the same result by swapping the
tables in a left join:
In [31]:
%%sql

SELECT *
FROM registration
LEFT JOIN student
ON id = studentId
ORDER BY studentId

* sqlite:///school.db
Done.

Out[31]:
courseCode studentId mark id name program department year

MA214 202012345 None 202012345 Ron BSc Data Statistics 2


Science
ST207 202012345 66 202012345 Ron BSc Data Statistics 2
Science
MA214 202022333 None 202022333 Harry BSc Data Statistics 2
Science
courseCode studentId mark id name program department year

ST207 202022333 72 202022333 Harry BSc Data Statistics 2


Science
EC220 BSc Economics 2
202054321 None 202054321 Hermione Economics
ST101 BSc Economics 2
202054321 93 202054321 Hermione Economics
ST115 BSc Economics 2
202054321 None 202054321 Hermione Economics
ST101 202101010 70 202101010 Ginny BSc Data Statistics 1
Science
ST115 202101010 None 202101010 Ginny BSc Data Statistics 1
Science
Cross join
A cross join returns the Cartesian product of rows from the tables, i.e., it
combines each row from the left table with each row from the right table.
In [32]:
%%sql

SELECT *
FROM registration, student
LIMIT 15; -- only show the first 15 rows

* sqlite:///school.db
Done.

Out[32]:
courseCode studentId mark id name program department year

ST207 202022333 72 202022333 Harry BSc Data Statistics 2


Science
ST207 202022333 72 202012345 Ron BSc Data Statistics 2
Science
ST207 202022333 BSc Economics 2
72 202054321 Hermione Economics
courseCode studentId mark id name program department year

ST207 202022333 72 202101010 Ginny BSc Data


Science Statistics 1
BSc
ST207 202022333 72 202155555 Dobby Actuarial Statistics 1
Science
ST207 202022333 72 202124680 Harry MSc Data
Science Statistics 1
MA214 202022333 None 202022333 Harry BSc Data
Science Statistics 2
MA214 202022333 None 202012345 Ron BSc Data
Science Statistics 2
MA214 202022333 None 202054321 BSc
Hermione Economics Economics 2
MA214 202022333 None 202101010 Ginny BSc Data
Science Statistics 1
BSc
MA214 202022333 None 202155555 Dobby Actuarial Statistics 1
Science
MA214 202022333 None 202124680 Harry MSc Data
Science Statistics 1
courseCode studentId mark id name program department year

ST207 202012345 66 202022333 Harry BSc Data Statistics 2


Science
ST207 202012345 66 202012345 Ron BSc Data Statistics 2
Science
ST207 202012345 BSc Economics 2
66 202054321 Hermione Economics
Relations between cross join and inner join
Note the following cross join with WHERE clause provides the same result as
the first inner join that we have seen:
In [33]:
%%sql
SELECT *
FROM registration, student
WHERE id = studentId;

* sqlite:///school.db
Done.

Out[33]:
courseCode studentId mark id name program department year

ST207 202022333 72 202022333 Harry BSc Data Statistics 2


Science
MA214 202022333 None 202022333 Harry BSc Data Statistics 2
Science
ST207 202012345 66 202012345 Ron BSc Data Statistics 2
Science
courseCode studentId mark id name program department year

MA214 202012345 None 202012345 Ron BSc Data Statistics 2


Science
EC220 202054321 None BSc Economics 2
202054321 Hermione Economics
ST101 202054321 93 BSc Economics 2
202054321 Hermione Economics
ST115 202054321 None BSc Economics 2
202054321 Hermione Economics
ST101 202101010 70 202101010 Ginny BSc Data Statistics 1
Science
ST115 202101010 None 202101010 Ginny BSc Data Statistics 1
Science
Mini Summary
We have learned how to use SQL to retrieve data, including:
Filter data
Aggregate data
Combine two tables
We will talk about the following in the workshop:
Join more than two tables
Subquery: query inside another query
Stay tuned!
Reading resources for SQL query
SQLite documentation
Visual Representation of SQL Joins
SQL parts from the pre-sessional course
Introduction to SQL, Summary Statistics, Group Summary
Statistics, Joining Data in SQL
Normalisation
Normalisation
Database normalisation is a process of structuring a database in order to
reduce data redundancy and improve data integrity.
There are a few rules for database normalisation. Each rule is called a
normal form:
First normal form (1NF)
Second normal form (2NF)
Third normal form (3NF)
...
Sixth normal form (6NF)
In this lecture, we will cover only 1NF to 3NF.
Motivative example
In this section we will consider the following data, for which it is similar to (but
slightly
code
different
name
from) the previous example:
quota studentId studentName program department
BSc
Economics
Programming 202054321 Hermione BSc Data Economi
ST101 for Data 90 202101010 Ginny Science Statistics
Science 202100000 Dobby BSc Statistics
Actuarial
Science
Managing BSc
ST115 and Visualising 60 202054321 Hermione Economics Economi
202101010 Ginny BSc Data Statistics
Data Science
Algorithms BSc Data
MA214 and Data 30 202124680 202012345 Ron
Harry Science Statistics
BSc Data Statistics
Structures Science
code name quota studentId studentName program department
BSc Data
ST207 Databases 30 202124680 Harry
202012345 Ron
Science
BSc Data
Statistics
Statistics
Science
ST310 Machine 60 NULL NULL NULL NULL
Learning
ST311 Artificial
Intelligence 60 NULL NULL NULL NULL
Managing
ST445 and
Visualising 90 NULL NULL NULL NULL
Data
Motivative example: observations
Duplicate data in the table
e.g. the information about students' program is duplicated
Issues:
May not be the best use of memory
Undesirable update dependencies:
If a student has changed the program, such
information needs to be updated in multiple places
and updates may result in inconsistencies
The course that have no students have to put "NULL" into the columns
students , program and marks
Undesirable insertion dependency
Not be the best use of memory
More than one value in each cell
More difficult to search
More difficult to add/delete/modify
Example of normalised tables
Tables belows contains the same information as the previous table:
course :
code name quota
ST101 Programming for Data Science 90
ST115 Managing and Visualising Data 60
MA214 Algorithms and Data Structures 30
ST207 Databases 30
ST310 Machine Learning 60
ST311 Artificial Intelligence 60
ST445 Managing and Visualising Data 90
student :
id name program
202054321 Hermione BSc Economics
202101010 Ginny BSc Data Science
202100000 Dobby BSc Actuarial Science
202124680 Harry BSc Data Science
id name program
202012345 Ron BSc Data Science
Example of normalised tables (continue)
program
program department
BSc Economics Economics
BSc Data Science Statistics
BSc Actuarial Science Statistics
registration :
code studentId
ST101 202054321
ST101 202101010
ST101 202100000
ST115 202054321
ST115 202101010
MA214 202124680
MA214 202012345
ST207 202124680
code studentId
ST207 202012345
First Normal Form (1NF)
A table is in 1NF if it satisfies:
Every row-and-column intersection contains exactly one value
Unique column names
Primary key and no duplicate rows
Order does not matter (in terms of both rows and columns)
No null values
Continue with the example table
The example table that we have is not in 1NF as:
Some row-and-column intersection contains more than one values
There are null values
Let us first make sure every row-and-column intersection contains exactly
one value by separate rows with multiple values in student , program
code
and department into different rows:
name quota studentId studentName program department
Programming
ST101 for Data 90 202054321 Hermione BSc Economics Economi
Science
Programming
ST101 for Data 90 202101010 Ginny BSc Science
Data Statistics
Science
Programming BSc
ST101 for Data 90 202120000 Dobby Actuarial Statistics
Science Science
ST115 Managing 60 202054321 Hermione BSc Statistics
and Economics
code name quota studentId studentName program department
Visualising
Data
Managing
ST115 and 60 202101010 Ginny BSc Data Statistics
Visualising Science
Data
Algorithms BSc Data
MA214 and Data 30 202124680 Harry Science Statistics
Structures
Algorithms BSc Data
MA214 and Data 30 202012345 Ron Science Statistics
Structures
ST207 Databases 30 202124680 Harry BSc Data Statistics
Science
ST207 Databases 30 202012345 Ron BSc Data Statistics
Science
ST310 Machine 60 NULL NULL NULL NULL
Learning
ST311 Artificial 60 NULL NULL NULL NULL
Intelligence
ST445 Managing 90 NULL NULL NULL NULL
and
code name quota studentId studentName program department
Visualising
Data
Now the table has exactly one value for every row-and-column intersection
contains.
Normalise the example table to 1NF
The table in the previous slide still has null value
To avoid NULL values, separate the information between course
and registration into two tables:
course :
code name quota
ST101 Programming for Data Science 90
ST101 Programming for Data Science 90
ST101 Programming for Data Science 90
ST115 Managing and Visualising Data 60
ST115 Managing and Visualising Data 60
MA214 Algorithms and Data Structures 30
MA214 Algorithms and Data Structures 30
ST207 Databases 30
ST207 Databases 30
ST310 Machine Learning 60
ST311 Artificial Intelligence 60
code name quota
ST445 Managing and Visualising Data 90
Normalise the example table to 1NF
registration
courseCode studentId
: name program department
ST101 202054321 Hermione BSc Economics Economics
ST101 202101010 Ginny BSc Data Science Statistics
ST101 202100000 Dobby BSc Actuarial Science Statistics
ST115 202054321 Hermione BSc Economics Statistics
ST115 202101010 Ginny BSc Data Science Statistics
MA214 202124680 Harry BSc Data Science Statistics
MA214 202012345 Ron BSc Data Science Statistics
ST207 202124680 Harry BSc Data Science Statistics
ST207 202012345 Ron BSc Data Science Statistics
Now there are no more NULL values.
Normalise the example table to 1NF
The course table in the previous slide now has some duplicate rows
code
Remove
name
duplication to normalise the
quota
table into 1NF:
ST101 Programming for Data Science 90
ST115 Managing and Visualising Data 60
MA214 Algorithms and Data Structures 30
ST207 Databases 30
ST310 Machine Learning 60
ST311 Artificial Intelligence 60
ST445 Managing and Visualising Data 90
Normalised example table (1NF)
For 1NF we have the following tables:
course(code, name, quota)
registration(courseCode, studentId, name, program, department)
with the primary key in bold.
Now it is easier to insert new data:
If another student registered to a course: add a new row to
registration
If there is a new course (even if there is no student registered to the
course yet): add a new row to course
Second normal form (2NF)
A table is in the 2NF if it fulfills the following two requirements:
It is in 1NF
There is no partial dependency
A partial dependency happens if one or more non-key attribute
depends only on part of the candidate key
Candidate key: any column or a combination of columns
that can qualify as unique key in database. There can be
multiple candidate keys in one table. Each candidate key
can qualify as primary key
In another word, any non-key attributes should be dependent on "the whole
key". 2NF helps to eliminate redundant data.
Normalise the example table to 2NF
course : Already in 2NF as all other attributes depend on code
registration : Not in 2NF as the primary key is (courseCode,
studentId) but program , student and department only depend
on studentId (not courseCode )
Convert registration to 2NF by separating information into
two separate tables:
registration :
courseCode studentId
ST101 202054321
ST101 202101010
ST101 202100000
ST115 202054321
ST115 202101010
MA214 202124680
MA214 202012345
ST207 202124680
courseCode studentId
ST207 202012345
Normalise the example table to 2NF (continue)
student
id
: name program department
202054321 Hermione BSc Economics Economics
202101010 Ginny BSc Data Science Statistics
202100000 Dobby BSc Actuarial Science Statistics
202124680 Harry BSc Data Science Statistics
202012345 Ron BSc Data Science Statistics
Now we do not have multiple locations with the information about the program
that students belong to.
Normalised example table (2NF)
For 2NF we have the following tables:
course(code, name, quota)
register(code, studentId)
student(id, name, program, department)
with the primary key in bold.
Third normal form (3NF)
A table is in 3NF if and only if both of the following conditions hold:
It is in 2NF
Every non-key attribute of the table is non-transitively dependent on
every key of the table
A transitive dependency is a functional dependency in which
X → Y (X determines Y)
Y →Z
X→Z (where it is not the case that Y → X)
3NF helps to eliminate data not dependent on key.
Normalise the example table to 3NF
In student , name → program, program → department, but program ↛ name.
We can remove the transitive dependency by separating the data into two
separate tables:
student :
id name program
202054321 Hermione BSc Economics
202101010 Ginny BSc Data Science
202100000 Dobby BSc Actuarial Science
202124680 Harry BSc Data Science
202012345 Ron BSc Data Science
program :
name department
BSc Economics Economics
BSc Data Science Statistics
BSc Actuarial Science Statistics
Normalised example table (3NF)
For 3NF we have the following tables:
course(code, name, quota)
register(code, studentId)
student(id, name, program)
program(name, department)
with the primary key in bold.
Reasons for not to normalise the data
Normalisation requires additional tables
It also requires the effort and knowledge to do so
Each time adding data into the database, data need to be "dissemble" to
different tables
May require a lot of join in queries, and make the query more complicated
and may be slower
Again, it also requires more effort and skills to do so
Normalisation may not be as important if the database is used for reporting
and data analysis rather than supporting daily operations. Dimensional
approach may be preferred
Summary
Introduction to databases.
How to normalise data
How to use SQL to
create tables
insert data
query required data (and answer questions)
NoSQL
Preview of workshop
Tips on the first summative problem set
Query with sqlite3
Subquery
Reading resources
Lake, Peter. Concise Guide to Databases: A Practical Introduction.
Springer, 2013. Chapters 4-5, Relational Databases and NoSQL databases
Nield, Thomas. Getting Started with SQL: A hands-on approach for
beginners. O’Reilly, 2016
SQLite documentation

You might also like