You are on page 1of 34

💾

COMP101
Lecture 1 - Information Theory I
Information systems embrace information and communication technology (ICT)
and involve collecting, processing, storing, and acting on information.
Components of information systems include software, hardware,
telecommunications, people, procedures, and databases.

Information is, as defined by OED, “knowledge


communicated concerning some particular
fact, subject, or event; that of which one is
apprised or told; intelligence, news.” It forms
the input for decision making. As information
provides insight into decision making, the
importance/size of information can be
measured in terms of ‘surprise’. A rare state is
more surprising, and information surprise is
related to the inverse frequency of occurrence
of its states.

A common state (high p) is less informative, so surprise is low. A deterministic


state (p=1) is completely uninformitive so surprise is 0. An impossible state (p=0)
is undefined. As information must have variation, the lowest number of states you
can have is 2. This term was given a name: binary digit (bit).

COMP101 1
A bit can represent two states (0,1). To represent more than two states, multiple
bits can be chained together. A bit is the basic unit of information, and the ‘size’ of
information is the number of bits required to represent all of its states.
Lecture 2 - Information Theory II
Size is often measured as expected surprise. For a source of information with N 
states, the surprise of each state ican be measured as:

1
S i = log 2 ​ = − log 2 pi
​ ​ ​

pi ​

The expected/average surprise that we get from this source is:

S = ΣN N
i=1 pi × si = −Σi=1 pi log 2 pi
​ ​ ​ ​ ​ ​ ​

The value S is called the Shannon Entropy and is measured in bits. This means
that in a two-state system, an entropy where p = 0.5will require 1 bit. The more
random the states are, the more bits will be required. If states are equally likely
1
then Shannon Entropy is maximised (pi = N ) and S can be simplified to S =
​ ​

log2 N .

Calculating entropy is useful as it enables further calculation of storage


requirements, computational effort, decision rules in machine learning, and acts
as a basis for file compression.

To calculate how many bits are


required to store a value in memory,
one must remember that entropy
defines the lower bound on the
number of required bits yet fractions of
a bit cannot exist - they must be
rounded up to the next whole number.
As entropy defines an exponential
scale of work (each additional bit
doubles the number of states in the
system) entropy can be used to If balance is <943 or balance is between
measure the difficulty or required work 1612.5 and 1652.5, then we assume default
for a problem. is “no”.

Entropy’s use in decision making is by partitioning data into smaller sets and
using the average outcome to inform the decision. Splitting data this way results

COMP101 2
in a loss of entropy (and therefore a gain in information). If there are multiple ways
to split the data, split it in the way that results in the largest difference in bits; a
method that forms the basis of decision tree machine learning.
S
The efficiency of encoding can be measured as #bits . Compression decreases

the storage required as because entropy and probability are closely related it is
possible to give each state a potentially different number of bits to encode
determined by the frequency of the state. Encoding is determined by a Huffman
tree:

1. Add every state to empty set T.

2. Where there are multiple “trees” in a set:

a. Remove the lowest probability tree from set T and call this tree A.

b. Remove the lowest probability tree from set T and call this tree B.

c. Make new tree C by joining A & B and equate its probability to p(A)+p(B).

d. Add the new tree C to set T.

3. Return the only tree in set T as the Huffman Tree.

Huffman codes are rarely used in practise, but varients are used zip compression,
jpeg files, etc.
Lecture 3 - Data Modelling I
A database is a collection of persistent data that is used by the application system
of a given enterprise. These applications typically work by users sending queries
to the database and receiving results - the only direct human to database
interaction is usually done by a database administrator. Database administrators
maintain uniformity of data to provide consistency and reduce duplication. It
allows data to exist indepentendly of an application and thus persist if it the
application is closed. Database management systems maintian security and
integrity.

Storing tabular data in a flat file has limitations: it increases the potential for error
when entering & removing data, and can lead to significant duplication. This can
be fixed by creating a seperate table and relating the two tables through some
variable.
Because data is all connected, databases must also be able to model these
connections and relationships. All data in a database must first be modelled by
representing entities of interest, identifying the attributes of each entity, and by

COMP101 3
showing the relationships between various entities. Attributes of an entity must
hold a single value of a well-defined type, and hold a property of the entity.
Lecture 4 - Data Modelling II
Attributes are a single property of an entity. In an Entity Relationship Diagram
(ERD, a common type of data model), they are represented in compartments of
the entities name. Attributes can be a “unique ID”, optional or required, and must
hold a single value.
ERDs can be thought of as a design for the database.

❗ Think of an entity like a class object, and data items as instances of


that object.

Entities usually have a unique identifier - a combination of one or more attributes


that uniquely define an instance. Notated in an ERD they are often shown in a
seperate compartment.

Relationships between entities are drawn as


lines. A relationship cardinality describes how
many instances of one entity can be related to
instances of another entity, they can either be
one to one (1:1), one to many (1:M), or many to
many (M:N). The N is deliberately different from
the M to indicate that they do not need to equal
each other. Participation symbols are also
added, either a 1 or a 0, to indicate if an entity

COMP101 4
requires another entity to be related to it or not.
Details can be added to an M:N relationship by
splitting it in seperate directions and inserting
an associate identity.

Data models are for conceptual design and communication


and are usually some form of diagram. A data management
system is configured using a schema - a definition of the
database’s structure in a language the DBMS understands
(usually SQL).

In summary, ERDs provide a high level notation for modelling the structure and
relationships between different types of data. They capture database
requirements graphically and are only design artefacts. To create a database,
they need to be translated into a database schema.
Lecture 5 - Structured Query Language I
The relational model of data is the predominant data model for databases. It has
a high correlation with ERDs, though they do not relate to each other. In the
relational data model, all information is represented by relations; values within
attributes within tuples within relations.

Everything is identified by name or


value.

All collections are sets of unordered,


distinct values.

Unnecessary duplication of data is


prevented.

Attributes are “value containers”. Each attribute contains a single value from a
known domain, describing aspects of an object. Tuples are unordered sets of
attributes that group together to describe an object. Relationships are unordered
sets of tuples that describe a collection of similar objects.
Tuples are uniquely defined by a special set of attributes called keys. A relation
has one primary key - tuples in one relation are connected through foreign keys
that reference the primary key of another relation.

COMP101 5
Because of the uniqueness of the relational data model, manipulation and
selection of data can be rigorously defined.

SQL (Scructured Query Language) is designed for creating (DDL - Data Definition
Language), managing (DCL - Data Control Language), manipulating and querying
(DML - Data Manipulation Language) databases. The four main data operations
can be remembered with C.R.U.D: create, read, update, and delete.

SQL has various types of schema “objects” (different from objects in Java):

Logical (Schema) Administrative Physical

TABLE USER INDEX


VIEW ROLE CLUSTER
DOMAIN SCHEMA TABLESPACE

In general, each has its own CREATE, ALTER, and DROP statement (the
essential DDL statements). The SQL script for creating a table is as follows;
things inside angled brackets should be replaced appropriately. Statements inside
square brackets are optional.

CREATE TABLE <name> (


<column-name> <data-type>
[DEFAULT <expression>]
[<inline-constraint>],
<column-name> <data-type>
[default]
[<inline-constraint>],
...,
[<out-of-line-constraint>, ...]
)

SQL syntax is not case sensitive, however string comparison is case sensitive.
Strings are written using ‘single quotes’ as opposed to “double quotes”.

Whitespace is insignificant however there are still generic conventions that should
be followed.
- - is used for line comments, as opposed to //.

List syntax (a, b, c) is the convention.

It is fairly simple to map an ERD to an SQL table.

ERDs Relational Model SQL

COMP101 6
Entity type Relation Table
Entity instance Tuple Row

Attribute Attribute Column

Unique identifier Primary key Primary key


Relationship Foreign key Foreign key

Lecture 6 - Structured Query Language II


There are many SQL conventions developed over the past 50 years, including
use of clear, singular nouns (Employee instead of Employees); use of proper
nouns (capitalised!); replacement of spaces between nouns with underscores;
writing SQL keywords in all caps.

There are many data types that SQL standard defines. Most database
management systems (DMBSs) don’t implement all of the standard types, and
most DBMSs add their own proprietary non-standard types.

CHAR (<n>)

String with a fixed length of <n>, padded with blanks.

VARCHAR (<n>)

String with a maximum length <n>.

TEXT

Reference type, Character Large Object (CLOB).

SMALLINT < INTEGER < BIGINT

Numerical data types should only be used if calculations are required.

NUMERIC (<p>,<s>), DECIMAL (<p>,<s>)

<p> = precision, the number of significant digits.

<s> = scale, the number of digits allows after the decimal point.

REAL, DOUBLE, PRECISION

Floating point variables are always an approximation and shouldn’t be


used for accurate calculations.

BOOLEAN

Either TRUE or FALSE, can also be represented by ‘yes’/’no’, ‘t’/’f’, ‘y’/’n’,


‘1’/’0’, and ‘on’/’off’.

COMP101 7
DATE, TIME, TIMESTAMP

DATE has the precision of one day, and ranges from 1/1/4713 BCE →
31/12/294276 CE.

CURRENT_DATE returns the current date.

Database integrity refers to the accuracy and consistency of data. Integrity within
the database is important to be maintained, as users cannot be relied on to do the
right thing - they might type too fast to notice errors, might be inexperienced, may
try to bend the rules, etc.

It is laborious to implement integrity into programs that access the database: they
might be used by multiple programs (redundancy), might be difficult to implement
in some languages, and may introduce inconsistency if done per-program. The
solution is SQL constraints: statements that grant requirements onto data entries
and provide consistent definition.
There are two types of contraints: key constraints (primary or foreign) and check
constraints (checking that the supplied value matches the specified criteria).
Primary keys ensure each row is retrievable and uniquely defined, and can never
contain nulls. Foreign keys provide referential integrity, they reference existing
data. Unless overriden by another constraint, foreign keys may contain nulls.

Defined in either CREATE TABLE or ALTER TABLE, their general syntax is below,
followed by an example syntax for creating a primary & foreign key:

CONSTRAINT <name> <details>

CREATE TABLE <name> (


...

CONSTRAINT <name>
PRIMARY KEY (<variable_name>),
CONSTRAINT <name>
FOREIGN KEY (<variable_name>) REFERENCES (<table_name>)
);

Where possible, they should be defined in an empty table.

Some basic column integrity includes making variables not nullable, providing a
default value, ensuring values are unique, or adding check constraints.

Some general check constrain examples include:

COMP101 8
//Relational Comparison
CHECK (A > 0)
CHECK ((A IS NOT NULL AND B IS NULL)
OR (A IS NULL AND B IS NOT NULL))
//Value withing inclusive range
CHECK (A BETWEEN -10 AND 10)
//Values appear in a specified set
CHECK (A IN ('Larry', 'Moe', 'Curly'))

Lecture 7 - Structured Query Language III


Querying in SQL is based on underlying operators in the relational model. This is
relational algebra - relational “arithmetic” using some set of operators that
produce relations. Some core relational operators are PROJECT (extract a subset
of attributes), RESTRICT (extract a subset of tuples that match specified criteria),
and JOIN (combine related tuples of two relations). There are many other types of
relational operators that aren’t covered in this paper.
The basic SELECT statement in SQL is:

SELECT <columns or expressions>


FROM <table(s)>
[WHERE <condition>];

//To project all columns:


SELECT *
FROM <table(s)>

The WHERE statement that restricts data retrieval is irrelevant for INSERT, but
usually essential for DELETE, UPDATE, and SELECT.

Joining tables in SQL produces a result that relies on multiple rows from multiple
tables. There are different types of JOIN operators, such as an INNER JOIN
(which combines the overlapping areas of different tables).

SELECT *
FROM <table1>
INNER JOIN <table2>
USING (<column>);

SELECT *
FROM <table1>
Any rows in Table2 that contain
INNER JOIN <table2> attributes that lack a column in
ON (<table1.column = table2.column>);
Table1 will be excluded from the
final table, i.e. INNER JOIN

COMP101 9
shows only matching rows from
both tables.

LEFT OUTER JOIN will let the resulting table have matching rows from both
tables plus non-matching rows from the left table, while RIGHT OUTER JOIN will
have matching rows from both tables plus non-matching rows from the right table.
FULL OUTER JOIN will show all rows, and would typically result in a number of
NULL values.
Lecture 8 - Structured Query Language IV
When selecting outputs, there are a number of additional operations that can
make retrieved data easier to manage or process, such as sorting results in
ascending or descending order, returning only unique values, or creating
subqueries.

SELECT DISTINCT * SELECT *


FROM <table1>; FROM <table1>
ORDER BY <value1> DESC, <value2>;

SELECT * FROM <table1> WHERE <value1> NOT IN (


SELECT <value1> FROM <table2>
);

There are a number of aggregate functions that SQL provides, as we often want
to compute some aggregate value of data. Some examples of common aggregate
functions include:

SELECT COUNT (<column1>) FROM <table1>; --counts the number of rows in the column
SELECT COUNT (DISTINCT <column1>) FROM <table1>; --counts total unique values
SELECT SUM (<numeric_column1>) FROM <table1>; --returns sum of each value
SELECT MIN (<numeric_column1>) FROM <table1>; --returns minumum value
SELECT MAX (<numeric_column1>) FROM <table1>; --returns maximum value
SELECT AVG (<numeric_column1>) FROM <table1>; --returns average value
--The output of an aggregate query can be grouped by column:
SELECT COUNT (*)
FROM <table1>
GROUP BY <column1>;
--Aggregate queries can also be refined, like a WHERE clause
SELECT COUNT (*)
FROM <table1>
HAVING COUNT (*) > 20;

COMP101 10
--In relation to the water quality example from our labs:
SELECT Region, count (Catchment_Area)
FROM Site
GROUP BY Region

SELECT Scientist_Num, Scientist.First_Name, Scientist.Last_Name


FROM sample RIGHT OUTER JOIN Scientist USING (Scientist_Num)
WHERE scientist_num NOT IN (
SELECT Scientist_Num
FROM sample LEFT OUTER JOIN scientist USING (Scientist_Num)
GROUP BY Scientist_Num
HAVING count(scientist_num) > 0
);

SQL has an order of operations:

JOIN/UNION (create data set) → WHERE (filter unwanted data) → GROUP BY


(aggregate data by dimensions) → SELECT… FROM… (select data with possible
calcs) → HAVING (filter unwanted data) → ORDER BY (sort results).

When joining two tables, it is good practise to rename different columns of the
same name using AS. This lets you assign names to on-the-fly columns.

Complex subqueries can be named using WITH clauses, similar to creating


methods in Java. They are useful for referencing a temporary table multiple times
in a single query, for performing multi-level aggregations (such as finding the
average of minimums), performing identical calculations multiple times within the
context of a larger query, and using it as an alternative to creating a view in the
database.

--Example of a WITH
WITH Region_Sampled AS (
SELECT Site_ID, Region, Scientist_Num
FROM Site INNER JOIN Sample USING (Site_ID)
);

SELECT Scientist_Num, Region COUNT(*) AS Number_Sampled


FROM Scientist INNER JOIN Region_Sampled USING (Scientist_Num)
GROUP BY Scientist_Num, Region;

If queries want to be reused, a VIEW can be created. This is a stored, named,


query. The body of a view is a select expression; only the expression is stored,
not physical data. VIEWs can be queried the same way “normal” base tables can.

COMP101 11
--Example of a VIEW
CREATE OR REPLACE VIEW Dunedin_Samples AS
SELECT *
FROM Sample INNER JOIN Site USING (Site_ID)
WHERE Region = 'Dunedin';

--Querying the VIEW


SELECT *
FROM Dunedin_Samples
WHERE Recorded_On >= DATE '2022-01-01';

Lecture 9 - Database Application Architecture


Clients of databases rarely interact directly with them in raw code like SQL allows,
they will access them with application software that is more user friendly and
reduces potentional for data corruption. The link between user interface and data
processing code to persistent data stored in databases is provided by
middleware. An example of middleware is the code library JDBI (Java Database
Interface), which turns what would be large processes for an application software
to run into trivial tasks. Middleware also needs to be capable of managing multiple
clients at once, so that data corruption doesn’t occur at the same time.

e.g. if multiple clients are accessing the same bank details


stored in a database, it is the job of the middleware to ensure
that any changes made by one client are immediately updated
for the other client, or perhaps only allowing one client to
access that account at a time.

COMP101 12
The Front End The Back End

The front end is driver by user The back end is responsible for data
experience, such as the user interface management: retrieval, storage,
(UI), data entry and validation, and manipulation, and validation. The
presentation of information. If it does processing of data is typically
any data processing it is usually heavyweight, and security wise it is
relatively lightweight. It is also responsible for authentication,
encrypts and authenticates authorisation, and auditing.
information for security. There are many different ways to build
The front end is typically built in one of the back end; monolithic vs.
three ways, a web application (like component architecture, premises vs.
with javascript, allowing it to be used cloud deployment, the type of server
across multiple platforms), a graphical used (web service, workflow
development tool (like Microsoft management system, custom server
Access - often platform limited and not application), and the language it’s
ideal for complex applications), or a written in (e.g. Java, Python, PHP,
“Native” application (compiled code, Ruby, SQL, JavaScipt).
better performance but less portable). Some back end technologies that
relate to deployment include Jakarta
EE and APEX (Oracle Application
Express), while cloud deployment
technologies include Amazon Web
Services (AWS), Google Cloud,
Microsoft Azure, Oracle Cloud, etc.

“Middleware offers programming abstractions that hide some


of the complexities of building a distributed application. Instead
of the programmer having to deal with every aspect of a
distributed application, it is the middleware that takes care of
them. Through these programming abstractions, the developer
has access to functionality that otherwise would have to be
implemented from scratch.” - G. Alonso et al., Web Services,
2004

COMP101 13
Programming abstraction is a model of SQL middleware works by:
some aspect of computing that hides low-
1. Opening a connection to the
level details. This allows programmers to
database
think and program in terms of the high level
model, while the middleware handles 2. Sending SQL statements as
tedious details. strings to the DBMS

3. Passing results back to the


application by either:

a. returning a “result set”


that we iterate through

b. converging the results


into data structures that
the program can more
easily handle (e,g, data
structures)

4. Closing the connection to the


database.

Impedance mismatches are when applications recieve data structures that they
aren’t able to process, such as if a Python program recieved tuples of values from
SQL software. If the Python programmer wanted to store the data as objects, they
would use some form of SQL middleware which can do the translation for the
programmer.
Lecture 10 - NoSQL Databases & “Big Data”

“Big data should be defined at any point in time as ‘data whose


size forces us to look beyond the tried-and-true methods that
are prevalent at the time.’”

Big data introduces processing challenges, as it is sometimes too slow to process


sequentially (using one CPU) and too large to process in one go (can’t fit in
memory). Therefore new technologies and database models must be developed
to catch up with modern data requirements.
The 4 Dimensions of Big Data:

Volume Velocity

COMP101 14
Volume is the scale of the data, Velocity is the speed of the data;
often considered big when the the requirements for faster
scale of data is beyond what can processing times (the speed of
be handled by conventional business) and getting data to
means; in the early ‘80s 100GB arrive faster than traditional
required a ‘tape monkey’ to swap processing methods.
thousands of tapes in and out. Currently, data is processed and
Nowadays extremely large contextualised as it is generated
datasets are distributed across without storing it all (steam
multiple computers (or even data processing).
centres).
Veracity
Variety Veracity is the uncertainty of data,
Variety is the diversity of data, its if it contain inconsistencies
multiple forms and its multiple (conflicting sources),
formats. The increased ability to incompleteness (data loss,
store larger data sets has overload), and noise (erronous
increased variety. sensor readings). It can be
caused by approximations -
causing inaccurate results and
misinterpretation - and by
deliberate tampering and
deception.

Another consideration related to big data is the value of the information; e.g.
analysing streaming patient data leads to a 20% decrease in patient mortality.

With this taken into account, relational database management systems


(RDBMSs) have several disadvantages for use with big data.

1. Cost: The cost of computers and enormous disc drives able to handle
RDBMSs and very large data sets is significantly higher than the cost of
buying several ‘commodity’ servers than a single system.

2. Structure: RDBMSs have a strong assumption that stored data will be


tabular and joined, which limits variety, i.e. RDBMSs are not ‘natural’ for text
documents, or video or audio formats. Another problem with their structure is
that relational models say the order of data is unimportant, where many data
sets have an important temporal order and need fast sequential access -
velocity.

COMP101 15
3. Emphasis on Consistency: RDBMSs strongly emphasise data integrity and
transaction based interaction, and have a strong desire to ensure data
remains consistent.

4. Scalability: There is a strong assumption of sequential access, and hence


performance of RDBMS servers tends to drop as the number of concurrent
users grows. It is also strongly assumed that the RDBMS is centralised as
distributed RDBMSs are complex to coordinate.

This introduces the dilemma of consistency vs. availability: it is impossible to have


the best of both worlds.

Consistency - that reads are always up to date, and any client making a
request to the database will get the same read of the data regardless of which
data center they contact.

A reason you might value consistency is if you want multiple clients to


have the same view of data. It is highly valued when dealing with financial
and personal information.

Availability - that valid requests recieve a response when it’s made.

Availability would be preffered if data accumulation was the priority of the


database, if you want to capture as much information as possible but it
isn’t critical that the database is constantly up to date; such as the
growing demand for offline application.

While inconsistency should be avoided, many businesses prefer to compromise


consistency in favour of avaliability (revenue), such as ATM networks. It is also
impossible to avoid network outages. These would opt instead of eventual
consistency - all copies will eventually be updated.

NoSQL (not only SQL) databases started in the early ‘90s, where RDBMSs were
unneccessarily complex for many needs, such as social network data where
scalability is more important than data integrity. They symbolize a shift away from
logically-designed, uniformly queried databases to application-specific data
stores.

Key-value store
Key-value stores typically have operations such as put(key, value), get(key) .
Values can be structured records - data is stored as a collection of key-value
pairs where the key is a unique identifier. This simplistic approach allows keys
and values to be anything.

COMP101 16
Document store
Document stores are souped-up key-value stores with seperate ‘collections’
of data and more flexible ways to query.

Column store

Column stores are more flexible versions of tables. Tables can have an
unlimited number of columns and each row may have values for only some
columns. The data is stored on disk by column or column group rather than
by row, so sequential retrieval is very fast.

Graph databases
Graph databases store graph structures such as social networks directly, and
support graph-orientated queries and algorithms.

The advantages to NoSQL databases are that the storage model that best fits an
application can be used, they provide flexibility as to the schema as there is often
no pre-defined schema, direct programming interface queries eliminate
impedance mismatch, and have a massive scalability from distribution and
parallel processing.
The disadvantages are that because queries are done via direct programming
interface there is usually no general query language, and the programmer must
do a lot of work previously done by the DBMS. There is also often lower data
consistency, because of schema flexibility (no integrity rules) and replication and
distribution (inconsistent copies).
Lecture 11 - The Internet
Many seperate applications may need
to read/write data concurrently across
different platforms and users, and over
many different networks. Business
processes also often involve
coordinated processing and passing of
information between applications. This
creates need for computer networks -
two or more computers that ‘talk’ to
each other to share data and
processing. They are typically
represented as centralised,
decentralised, or distributed.

COMP101 17
How to build a network:

1. Assemble several computers and networking hardware (network interface


cards, switches, routers)

2. Agree on networking protocols (in practise choose existing ones, the internet
uses the TCP/IP protocol stack)

3. Ensure all devices have firmware/software to implement networking protocols

4. Connect hardware with cables/WiFi/Bluetooth

The desirable properties of a network are that it is heterogenous (computing


involves many different operating systems, types of hardware, languages, and
applications), it is decoupled (not tied to a single machine type, software platform,
or application vendor), future proofed (open for applications that may be run in the
future), and has a good performance - high bandwidth & low latency (data transfer
rate & communication delays).

The OSI (open systems


interconnection) Model describes the
order that activities are performed in
order to transfer data from one
terminal to another in a network.
Layered models such as this one
simplify application development and
network design, increase the utility
and reusability of a network and
isolate required changes.

Application-level protocols (e.g. HTTP)


can communicate “directly” with a
remote computer - the dashed arrow
is a fiction that allows application
programmers to not worry about any
other levels.

The physical layer is concerned with electrically or optically transmitting raw &
unstructured data bits across the network; fibre optics and copper wires can be
assisted by network hubs, repeaters, network adapters, and modems.

COMP101 18
The data link layer uses directly connected nodes to perform node-to-node data
transfer of data packaged into frames.
The network layer recieves frames from the data link layer, and delivers them to
their intended destinations based on the headers that were assigned to them,
which includes logical addresses such as IP (internet protocol).
The transport layer regulates delivery and error checking of data packets,
managing the size, sequencing, and essentially transport of data from between
systems and hosts. A common example is TCP (Transmission Control Protocol).

The session layer controls “conversations” between two computers. This is where
a session or connection is set up, managed, and terminated. It also serves for
authentication and reconnection.
The presentation layer formats and translates data based on the syntax that the
application accepts. It can also handle encryption and decryption.
The application layer, and the end user, interact directly with the software
application. It identifies communication patterns, resource availability, and
synchronizes communication.

The internet is a decentralised and distributed network built upon the ideas of
packets of data, routing (addressing), and transmission of data through packet
switching. It is independant of physical media (copper wire, fibre optics, pigeons,
etc.) for transmission. It is build upon the TCP/IP model, which is essentially a
rationalisation of the OSI model.

TCP/IP Model Protocols and OSI Model


Application Services Application
HTTP, FTTP Presentation
Telnet, NTP
Session
Transport DHCP, PING Transport
Network TPC, UDP Network
IP, ARP, ICMP, IGMP
Network Interface Data Link
Ethernet Physical

Packets are smaller chunks of messages sent between internet nodes - their aim
is to help the internet adapt to faults and changing loads on each link. Packets

COMP101 19
sent are sent seperately and may take different routes between sender and
recipient. Packets all have headers (assigned in the transport & network layer)
which detail the source and destination, sequence number, and checksum info.
The destination host then reassembles the message from packets in the opposite
order that they were made (network interface → application). Recieved packets
are acknowledged, if they go unackowledged then they are resent.

Packet switching of
three packets that all
take the most
desirable route at the
time they’re sent,
determined by routing
rules, network load,
status, etc. A direct
connection between
the source and
destination would be
circuit switching.

TCP (transmission control protocol) deals with breaking down data into packets
and providing instructions to reassemble the original data, as well as ensuring
that all data is recieved. IP (internet protocol) uses a “best effort” model to deal
with the addressing of machines and the routing of packages.

An example application level protocol is DNS (Domain Name System). This


involves every node on the internet having a numeric address (IP address) which
is all that is needed for the internet to work. Domain names are human-friendly
computer and network names that they can remember, e.g. otago.ac.nz. Domain
names are hierarchial and read backwards (i.e. read .nz before .ac before otago).
The internet is open and robust. It is open because TCP (and UDP - less reliable
but faster transfer time) provides a standard method of data transmission and
nobody can stop new application-level protocols being developed. It is robust as it
has IP defined rules for routing, and as long as a destination exists, there is a
almost certainly a route to get there. Because pathways between nodes are
typically small and incredibly numerous, it is protected against random node
failures.
Lecture 12 - Human-Computer Interfaces

COMP101 20
Effective implementation of Human-Computer Interaction (HCI) requires extensive
interdisciplinary co-operation; computer and information scientists, psychologists,
designers, technical writers, ergonomics experts, domain experts, etc. This is all
done in the name of usability:

“The extend to which a product can be used by specified users


to achieve specified goals with effectiveness, efficiency, and
satisfaction in a specified context.”

Usability is evaluated by assessing the extent of the systems functionality, the


effect of the interface on the user, and by identifying specific problems with the
system. The evaluation method should be chosen in context to select factors,
such as the current stage of the system (design/implementation), the level of
subjectivity, the type of measurement (qualitative/quantitative), the immediacy of
the response, and the resources required.
One common method used is the Heuristic evaluation. This is formed from the
argument that it usually only takes 5 evaluators to find 75% of the overall usability
problems. Some heuristics that can be used to guide evaluators include:

1. Visibility of system status

2. Match between system and real world

3. User controll & freedom

4. Consistency

5. Error prevention

6. Recognition > recall

7. Flexibility & efficiency of use

COMP101 21
8. Aesthetic and minimalist design

9. Help & documentation

Some design guidelines include


putting priority items where people will
look at them first, or putting text in
prime reading positions - ‘above the
fold’ on a newspaper.
Some general principles outlined in
2000 by Wickens and Holland for
getting the users attention include:

Two levels of intensity with a limited use of high intensity

Markings underlined, boxed, or indicated to with arrows

Up to 4 different sizes (with the largest being the most attractive)

Up to 3 different fonts

Blinking used with great care and between 2-4 Hz

Up to 4 standard colours
Lecture 13 - Security I
Security and Information Assurance covers multiple facets of information security:

Integrity - correctness and consistency of information

Availability - accessability of information

Authenticity - trusted identification (only available to those who should have


access)

Non-repudiation - trusted provenance

Confidentiality - protection of sensitive information

Information assurance is more than protecting against ‘attacks’. Data can also be
compromised by natural disasters, physical access, policies, and human error. It
is easily overlooked and misrepresented by the media with an overemphasis on
“hacking”. Information assurance provides a framework that focuses on the whole
process of information security (physical, policy & education, software
engineering, backups and redundancy, logging and monitoring).

COMP101 22
Many tools in an information assurance framework are underpinned by encryption
and hashing. Hashing is the conversion of a message into a fixed-length string
that should in theory be irreversible, only able to be decoded with use of an
authentication key.

Symmetric (shared key) encryption allows all users to use the same key to
encrypt and decrypt a message, the key being securely transmitted to all users
before it can be used. The problem with this is that since the sender and receiver
are required to have the same key, sharing the key becomes a central
vulnerability. The solution is to generate new keys through a known method,
allowing secure exchange over an unsecured medium.
Assymetric (public key) encryption gives all users two keys, a public key used to
encrypt messages and a private key needed to decrypt - only the public key is
required to be transmitted. The message is then “signed” with the private key to
prove that the message was sent by the same user.

The drawbacks of encryption are that it is computationally expensive, adds a layer


of conceptual complexity, is mathematically complex to understand & develop
from scratch (hence the use of libraries), and instills users with a false sense of
confidence.
Lecture 14 - Security II
Email privacy is somewhat of a myth - it’s not just you that has access to your
email but also: your internet service provider, the reciever, the reciever’s internet
service provider, backbone providers, your employer/agency, the NSA (and any
other well-funded government/intelligence organisation), Microsoft/Google etc.,
third party companies served by Microsoft and Google.

Denial of service (DoS) attacks saturate the target machine in an attempt to


consume avaliable bandwidth, which can deny its ability to serve legitimate
requests. Once identified, a DoS attack can be easily blocked by filtering
offending requests from the source.
A distributed denial of service (DDoS) attack often uses computers infected with
malware to form a “botnet” and causes thousands of coordinated attacks, making
it very hard to filter requests without compromising legitimate users.
SQL Injection attacks introduce malicious SQL commands, allowed by improper
coding or input checking. It can be avoided with the use of prepared statements.

Many threats to data integrity are ‘human’. A simple fix to many problems is that of
authentication - proving that people are who they say they are. Common

COMP101 23
examples include passwords, tokens, and biometrics. Biometric authentication is
fast and convenient, but shouldn’t be used in isolation (you can’t change
biometrics very easily). Passwords are the most common form of authentication,
however they have a big problem: people! Human-generated passwords are
easier to crack, while long computer-generated passwords are hard for humans to
remember.

Password cracking works by obtaining a password list from a server (which are
typically stored in hashed form) and which hashes are one-way, they can be
attacked by brute-force (on commodity hardware at an alarming rate!). The simple
solution to this is to increase the entropy of passwards, by doing things such as
using a large alphabet, random passwords, password salting, and a
computationally-intensive hash.
Another common and effective method of securing information is the use of multi-
factor authentication, that chances of all factors being compromised
simultaneously is very unlikely.

“All computing involves a trade-off between effectiveness and


acceptable risk.”
Lecture 15 - Data Science I
Tacit knowledge is the instinctual decision making that comes from experiences
and competence; because it is difficult to codify/describe is it difficult to replicate.
Explicit knowledge aims to extract patterns from data and use them to inform
decisions.

Data science relies on the integration of appropriate tools


(statistics, machine learning and computational techniques)
into information systems to improve decision making.

Artificial intelligence is understood as a collection of interrelated technologies


used to solve problems that would otherwise require human cognition.

Narrow AI: the ability for a machine to General AI: the ability for a machine
behave intelligently on a single task, to learn from previous experience,
without flexibility to adapt to new adapt to new domains, and react
tasks. intelligently in previous unseen
environments.

COMP101 24
Evaluated through performance on Evaluated typically through the Turing
specific tasks using standard data test, or by evaluating its ability to
sets. perform different, unrelated tasks.

Rule-based AI encodes human knowledge


into rules, and computers then use these
rules to make decisions. This was the only
form of AI from the 50s to the 80s.

The next approach is evolutionary


computation, where principles of biological
evolution are used to search for the best
solutions to a problem.

Statistical and machine learning gets labelled


training data, which trains a neural network
by adjusting network “weights” to reduce
error, and is then able to classify new data.

Pros of AI: Cons of AI:

Improved efficiency of business, Wide-spread job loss


industry, healthcare, etc. Biased output due to biased training
Better chances of handling crucial data
challenges to society
Loss of privacy
Human/AI partnerships Increased automation of decision-
making in social services.

Lecture 16 - Data Science II


Learning from data is an inductive process; in inductive reasoning, one makes a
series of observations and infers a new claim based on them. Machine learning
today is mostly about the process of moving from specific observations to general
predictive models - probabilistic in nature. Data is the lead determining factor in
machine learning as (in a perfect world) data doesn’t lie. It ideally removes human
bias from decision making and allows more complex modelling than by our
senses and limited reasoning capabilities alone, potentially uncovering valuable
insights about a problem that aren’t immediately apparent to us. There are 4
primary modes of machine learning that take place:

Supervised Learning: Unsupervised Self- Reinforcement


Learning: supervised Learning:

COMP101 25
Data examples are Data Learning: AI system acts
provided with the examples Data examples in the world
desired output (labels), are are provided and earns
which trains a model to presented with existing “rewards”, and
predict the outcome for without labels that are learns which
unseen instances. labels, which implicit or actions lead to
Supervised learning is trains a explicit in the the most
either regression model to data, requires rewards.
(predicting a value) or group related training model
classification items in a with a different
(predicting a label) data set into task to real
clusters of application.
similar items.

For learning to happen, computers need a set a data items, a predictive model
that can be adjusted to fit the data, a measure of how well a specific model
predicts outcomes, and an algorithm that adjusts the model until the error is
minimized.

Issues presented by machine learning can include the data quality (a model is
only as good as the data used to train it) and quantity (lots of data is required),
and for supervised learning they also require high quality labels.

Model generalisation is important to control the model complexity, if it is too


simple the model is a poor fit while if it is too complex then the model is too
closely fitted to training data and will perform poorly with new data.

COMP101 26
Lecture 17 - Data Representation I
Binary representation is denoted in subscript, 101110 is a decimal number and

10112 is a binary number. When working with data types, operations are limited

to fixed length binary strings, so smaller results are ‘padded’, e.g. 1510 in 8-bit

binary is 000011112 . Careless use of fixed length binary representation can still

cause problems however, the abstractions that mean we don’t interact directly
with binary are there for convenience, not ignorance.
In order to also store the states of negative numbers, we use two’s complement
coding: the initial value of a fixed length binary representation indicates if the
number of positive or negative. The left-most value is hence called a sign bit.

2’s Complement Encoding: 2’s Complement Decoding:

1. Obtain corresponding positive 1. Flip the bits in the encoding


encoding
2. Add 1 to the result of the flipped
2. Left pad the result with 0’s until n bits
bits long
3. Convert to decimal form
3. Flip the bits in the encoding
4. Add negative sign
4. Add 1 to the result of the flipped
bits
E.g. to decode 1012 :​

1. Flip bits = 0102 


E.g. to encode −10010 in 8 bits:

2. Add one = 0112 


10010 = 11001002 

1. ​ ​

3. Decimal = 310 
= 011001002 

2. Left pad result ​

4. Negative sign = −310 


= 100110112 

3. Flipped ​

4. Add 1 = 100111002  ​

COMP101 27
Provided there are enough bits, it is a straightforward process. It is important to
ensure that enough space is left over for the sign bit.

Subtraction in 2’s complement is simply the addition of a negative quality (A - B =


A + (-B)), therefore for subtraction, compute the 2’s complement of the number to
be subtracted and then perform addition.

Overflow is when the sign bit is flipped and its


interpretation is changed upon being decoded,
caused by an operation producing a result that
is outside the range of values defined by the
representation. Overflow is a common source
of computer bugs. Overflow can be typically
managed by using more bits to store numbers
or by preventing programmatically.

The range of positive and negative numbers that can be represented using 2’s
complement is, given nbits, positive: 0to (2n−1 − 1)and negative: −1to
−2n−1 .

Lecture 18 - Data Representation II


One method of representing real numbers is fixed-point representation. This uses
a predetermined number of symbols for fraction and integer parts, though has
significant drawbacks: it has a limited scale, the radix point doesn’t “float” (leading
to an inefficient use of bits) and it can’t represent all fractions (though no
representation can).
Floating point numbers are where the radix point floats. By representing the
scientific notation of numbers, the scale can be shifted from very large to very
small using the same fixed-length representation. It is the normalised form for
storage, though lacks precision for arithmetic purposes.

COMP101 28
Standardised floating-point encoding is called IEEE 754, and it works through
encoding three components of a scientific number: the sign (s), the exponent (e),
and the coefficient (c) from the equation (−1)s × 1.c × 2K +e .

The sign of mantissa (s) is simply a 1 to indicate a negative value or 0 to indicate


a positive one.

The biased exponent (e) is created to allow for both positive and negative
exponents, where the exponenet is subtracted from the bias depending on the
precision of the encoding.

E.g. an exponent of 8 with single precision will be the binary version of 13510  ​

(because K for single precision floating points is 127, and 127 + 8 is 135).

An exponent of -3 with half precision will be the binary version of 1210  ​

(because K for half precision floating points is 15, and 15 - 3 is 12).

The normalised mantissa (c) is the floating point number consisting of its
significant digits, that is those with only one 1 to the left of the radix point.
Floating-point encoding:

1. Make note of the sign of the number to determine s

2. Convert decimal to binary

3. Normalise the binary, taking note of the exponent for cand e

4. Add K to e, converting into binary

5. Combine into result, following required precision

E.g. to encode 263.310 in IEEE 754 single precision:


1. Number is positive so s = 0
2. Number in binary = 1000 0011 1.010 0110 0110….

3. Scientific notation = 1.000 0011 1010 0110 0110…. ×28 

COMP101 29
a. e = 8, c = 0000 0111 0100 1100 1100
4. 810 + 12710 = 13510 = 100001112 
​ ​ ​ ​

5. s + e + c = 0100 0011 1000 0011 1010 0110 0110 0110

As the scale increases with


floating-point notation, the
precision decreases. Single
increments in value depend upon
the scale of the number (which is
determined by the exponent).

Some encoded values are reserved for indicating exceptions;


Infinity: sign = 0, exponent is all 1’s, coefficient = 0

-Infinity: sign = 1, exponent is all 1’s, coefficient = 0

NaN: exponent is all 1’s, coefficient isn’t 0


Because IEEE 754 is fixed precision representation, it needs rules for rounding in
case of excess bits, the default rounding is “to nearest, ties to even”. Rounding
applied during arithmetic can yield confusing results (e.g. in double precision,
0.110 + 0.210 =

 0.310 ).
​ ​

The primary problems with floating-point representation is that rounding errors


can propogate with arithmetic and that it has both a positive and a negative 0.
Lecture 19 - Data Representation III
In order to make binary structures more comprehensible to
humans, they were historically rewritten in octal form, but more
common nowadays is hexadecimal. This is because it gives
fewer symbols to manipulate or interpret - 2 hexadecimal units
are able to represent 1 byte.

Non-numeric information (text, time, sound, colour, images) are


encoded by indentifying unique information states and
representing that state with a number. The way this
representation is ordered is often optimised as much as
possible to understand the information better and speed up
processing times.

ASCII (American Standard Code for Information Interchange)


maps 128 characters into 7-bit encoding, with a particular

COMP101 30
arrangement to make certain operations be performed quickly
and efficiently - e.g. replacing a lowercase letter with its
corresponding capital only requires one digit to be changed.

Text is encoded with ASCII through sequences of characters called strings, often
concattenated together to form an array. Drawbacks of ASCII are that it is tied to
the english language and is only 7-bit encoding (8 with extensions for other latin
languages). This is solved with UniCode.

UniCode has 1,112,064 assignable code points, often represented in


hexadecimal as 4 or more digits and prefixed with U+.

UTF-8 encoding is used by 98% of the web and is backwards compatible with
ASCII. It encodes each point in a variable length format (between one and four
bytes) embedding the length of the representation in the data through special
sequences.

COMP101 31
The left-most digits of the first byte indicate the length of encoding in bytes, while
the remaining and ensueing free digits are packed with the left-most digits of the
code point.

One solution of encoding dates is as a single integer, representing the number of


microseconds away from a reference point (called the epoch), 01-01-1970
00:00:00.
Another solution is use of floating point numbers, with the fractional parts
representing time and the integer part representing the days since epoch.

Issues with temporal representations include the lack of a universal way to


encode dates, the problem of integer overflow (Y2K, Year 2038 Problem),
timezone and calendar coordination, and value drift (e.g. leap years, leap
seconds).

Audio is based on pulse code


modification (PCM) which samples
sound at regular intervals with a
quantisation. Each sample is
represented by an n-bit signed
integer representing 2n quantisation
levels.

Colours are encoded with 32-bit RBG-αcolour encoding. It has 256 levels per
colour channel which lead to ~16.8 million discrete colours. The αchannel
represents the colours transparency.
Lecture 20 - Algorithms I
Algorithms in computing are a precise set of instructions to perform a calculation
or logical decision. They are fundamental to all computing tasks as they form a
bridge between human creativity and logical processes.
As defined by Donald Knuth, there are five properties of algorithms:

Finiteness

An algorithm must always terminate after a finite number of steps. This can
be difficult to prove.

Definiteness

COMP101 32
Each step of an algorithm must be precisely defined; the actions to be carried
out should be rigorously and unambiguously specified for each case.

Input
Inputs are quantities given initially before the algorithm begins or dynamically
as it runs.

Output

Output quantities have a specific relation to the inputs.

Effectiveness
All of the operations performed in an algorithm should be sufficiently basic so
that they can in principle be done exactly and in a finite amount of time by a
person using paper and pencil.

Other key properties include “correctness”, efficiency, and intuitiveness.


Correctness is an algorithms ability to solve a problem in several cases. Efficiency
is an algorithms complexity in relation to space and time and the tradeoff between
those two. Intuitiveness is how easy an algorithm is to understand - in the choice
between two algorithms for completing the same work one should choose the
simpler approach.

Algorithms are specified for humans in the way of diagrams, natural language,
and psuedocode. They are specified for machines in precise notation as
programming languages.

Pen and paper testing uses a table to record changes to internal data states over
subsequent iterations, starting at time zero.
Lecture 21 - Algorithms II
Common algorithmic activities are based on three types of control structures:
performing tasks (sequencing), selecting tasks (branching), and repeating
sequences (iteration). The structured program theorem states that all algorithms
can be implemented through a combination of branching and iterating over
sequences.

When comparing algorithm performance, it is important to take into account


factors that complicate comparison: the machine being used (CPU speed), the
programming language, the platform version, the programmer’s experience, and
any implementation ‘tweaks’.

Algorithm performance shouldn’t be measured emperically (with a stopwatch) as


it doesn’t account for variables in implementation and hardware. It is usually

COMP101 33
measured by analogy - the number of units of work performed and the number of
objects manipulated.

Sorting algorithms are used to solve the problem of requiring information in a


specific order. Along with bubble sort, insertion sort iterates consuming one input
element each repetition, and growing a sorted output list.
Lecture 22 - Algorithms III
As long as an algorithm is effective, it should be able to be exchanged. For
multiple algorithms that serve the same purpose, the most effective one should be
selected. Each instruction in an algorithm performs some work on some things.
The most effective algorithm should accomplish the same task by doing less work
on less things (take less time while using less memory).

A standard framework for the comparison of algorithm efficiency is Big O notation.


It answers the question “How do the space and time requirements grow as the
input size (N) grows?”.

The Big O for common sorting algorithms are:


Bubble sort: O(N 2 )

Insertion sort: O(N 2 )

Merge sort: O(N log 2 N) ​

These sorting algorithms will have a best, worst, and average case complexity.
Big O notation typically describes the worst-case complexity, while oftentimes we
want to describe an algorithms average or best-case complexity.

Algorithms should ideally be reduced to binary decisions, where 1-of-N decisions


can be reduced to N − 1binary decisions.

COMP101 34

You might also like