Professional Documents
Culture Documents
Data Analytics
1. Data: facts, observations, perceptions. Lowest level of abstraction from which information and then
knowledge are derived.
2. Information: a subset of data, possess context, relevance, purpose. Middle level of abstraction.
5. Analysis of data: inspecting, cleaning, transforming, modelling, goal of highlighting information, suggest
conclusions and support decision making
6. Data mining: Process of extracting information from large databases and using it to make decisions.
KDD -Knowledge Discovery in Databases. Methods:
a. Data Gathering
b. Data preparation and cleansing: detect, correct errors and inconsistencies to ensure data
quality.
d. Visualisation
10. Clustering: summarise data for a better overview by forming groups of similar cases
11. Association: find correlations or associations that describe the interdependence of attributes
12. Visualisation: the use of computer-supported, interactive, visual representation of abstract data to
amplify cognition. Depends on type of data being viewed and questions we want to ask
e. Uses: explore and gain understanding of dataset, finding patterns and oddities, generating
further questions.
a. Project understanding
b. Data understanding
c. Data preparation
d. Modelling
e. Evaluation
f. Deployment
c. ND (3+): scatterplot matrix, parallel coordinates, radar plot, star plot, MDS
d. Planning: maps
16. Histogram number of bins: Sturges’ rule k = [log2(n) + 1], where n = sample size. Histograms show overall
shape of a variable’s distribution, outliers are hard to find. Patterns are visible in the heights of the bars.
17. Boxplots show centre and range and individual outliers are clearly shown
18. Data understanding: determine quality of data, outliers, missing values, dependencies and correlations,
compare statistics to expected behaviour
19. Data exploration: first stage of data analysis. Aims: understand data, find range and limitations, check
that data is consistent, understand relationships
25. Steven’s Power Law: Trueness = how close is human perceptual judgement to some objective
measurement of the stimulus
26. JND – Just Noticeable Difference: smallest different between 2 stimuli. Determines our ability to
understand the data. Depends on size, colour, contrast, texture…
28. Gestalt principles. Gestalt works by grouping features by visual search, salience and context.
a. Simplicity
b. Symmetry
h. Familiarity
i. Proximity
29. Reification: imagined or illusory shapes. Perception is constructed from smaller components.
30. Preattentive visual features: ability of low-level human visual system to rapidly identify certain basic
visual properties e.g. a unique visual property – something in red pops out
32. Effectiveness principle: encode the most important information in the most effective way:
c. Separability: can the visual channel be judged independently of other visual channels?
d. Importance ordering: the importance of the attribute should match the salience of the visual
channel
33. Graph: set of vertices (nodes) and edges (links). Clusters are based on connectivity. Groups on
attributes and paths between two nodes.
a. Force directed
b. Constraint-based
d. Layered layouts
e. Non-standard layouts
36. Graph drawing issues: scale, computationally expensive, readability is hard – leads to occlusion, run out
of screen space. Solutions: parallel processing on the GPU, larger screens, reduction techniques
(clustering), use of alternative representation (matrix)
37. Matrix: nodes arranged on x and y axis. Connections made between rows and columns. Scale well, no
clutter, easy to calculate layout, outperform node links on most tasks but less intuitive, hard to follow
paths, require ordering to show clusters
38. Node-link diagrams intuitive, good for path-finding tasks, showing structure in sparse graphs, do not
scale well
39. Tree: understandable but not very space efficient. Types: node-link, adjacency or tree map
40. Animation to show time: showing images sequentially to convey change over time. Good for transitions,
simple comparisons of adjacent items, storytelling. Not so good for comparisons, exploring data or
when conveying complex systems.
a. Sequential: showing images sequentially to convey change over time. More screen space ->
clutter reduction. Can spot local changes between timescales. Tasks take longer and are less
accurate.
b. Juxtaposed: times lives overlaid, opacity used to distinguish time, colour used to highlight
specific dates
41. Data Quality: fitness for use “garbage in, garbage out”. Reasons: cost savings, increased efficiency,
protection of organisation’s efficiency. If data is not accurate, relations based on this data will be
misleading.
42. Quality dimensions: there is an error in the dataset if any one of the dimensions is violated
44. Data cleaning: detect, correct errors and inconsistencies to ensure data quality. General approach:
b. Selection of algorithms
c. Selection of methods/approaches
d. Correction of errors
45. Clean data: all data in a column match the metadata description of that column. It is suggested to
have a header row to indicate the meaning of each column.
a. Outliers: a value that is far away or very different from all or most of other data. Can be
discovered by using visualisation methods, clustering, distance-based or projection-based.
b. Duplicates: identify records that refer to the same real-world entity. Record linkage or record
matching. Field matching techniques:
c. Misspelling
e. Meaningless values
a. Statistical methods
b. Data transformation
d. Duplicate elimination
48. Selecting data: only select data that is relevant for the given problem, creating a sub sample of the
data and use it for the analysis. Reasons:
a. Select the model class: general structure of the analysis result (e.g. linear or quadratic
function for a regression problem)
b. Select the score function: evaluate possible models using a score function
d. Validate the results: choose best model among the chosen ones
50. Model class: form or structure of the analysis result. Parameters are not defined. Only type is selected.
(Eg. Linear models, mean, rule based models…)
a. Global models: provide a (not necessarily good) description for the whole day set (regression
line)
b. Local models: provide a description for only a subset of the data set (association rules)
51. Fitting criteria and score function: find an objective function (mean squared error vs mean absolute
error) which evaluates the quality of your model in order to detect the best model
52. Error functions for classification: misclassification rate = wrongly classified / total classified. A low
misclassification rate does not tell anyone about the quality of a classifier. When classes are
unbalanced the rate is unbalanced (if 99% of the data is classed as good, a classifier always predicting
good will have a misclassification rate of 1%)
53. Scoring function / objective function: does not tell us how to find the best method. Provides a means for
comparing models.
c. for discrete problems with a finite search space, combinatorial optimisation strategies are
needed
d. In principle exhaustive search of the finite domain is possible, however it is not since the
dataset is too large. That’s why heuristic strategies are needed.
56. Error = experimental error + sample error + model error + algorithmic error. These are the 4 parts of
potential error origins, which sum up to the overall error measure
57. Experimental error/pure error/intrinsic error/Bayes error: inherent in the data due to noise, random
variations, imprecise measurements. Impossible to overcome this error by the choice of a suitable
model.
58. Confusion matrix: table where rows are true classes and columns the predicted classes.
59. Sample error: small sample = small probability for a perfect model
61. Algorithmic error: can often not be measured. Normality: we assume that algorithm is good enough.
62. Model validation: error for unseen data will most probably always be bigger than for the data used for
training.
63. Training and test data: split data, train model on training data, test model quality on test data. Or
training, test and validation: test the best models on the validation data set. Splitting strategies:
64. Cross validation: split data multiple times to validate the results to reduce the effect of the single
estimation.
65. K-fold Cross-Validation: k subsets, test data is always another subset, training data the rest
66. Leave one out method: small data sets. Use everything for training except one point used for testing.
b. Inductive learning: detecting patterns and working out a rule for themselves.
69. OneR (one attribute rule): fine one attribute to use that makes the fewest prediction errors.
a. Find most frequent class of attribute, make and calculate error rate of rule. Pick attribute with
lowest error rate.
a. Entropy: a measure of the degree of doubt. The higher, the more doubt about the possible
conclusions. Attribute with lowest entropy is the most useful determiner.
b. Algorithm: compute each attribute’s entropy. Select lowest entropy attribute. Divide data by
attribute classes. Build the tree.
c. Advantages: detailed decision tree, always works, easy to implement, easy to read, simple,
running time increases only linearly with the complexity of the problem
d. Limitations: takes no account of any meaning that the data it works on may have, considers
just one attribute at a time, cannot handle uncertain rules.
73. Propositional rules: atomic facts combined using only logical operators with no variables.
c. Support of an association rule (divided by total): not possible to determine support for all item
sets, because their number grows exponentially with the number of items
75. Association rules two steps: find frequent item sets with minimum support, form rules and select those
that have at least the minimum confidence
b. Prune to a certain depth, rules with too many items are difficult to interpret
c. Support based pruning: no superset of an infrequent item set can de frequent. No counters
for item sets having an infrequent subset are needed.
77. Eclat: depth first search
78. Apriori: breadth first search: traverse the tree for each transaction and find the item set it contains. Top-
down.
a. Algorithm: Set minimum coverage. Find one-attribute associations that reach it, then two,
until end. Set minimum confidence. Generate rules that reach it.
b. Advantages: rules are statistically supported and simple to understand. Easy to implement.
c. Disadvantages: high accuracy, low coverage, no numeric values, big search space, models
generated might be too big.
80. Clustering:
a. Approaches:
1. Agglomeration: bottom up, every leaf is a cluster, join 2 similar ones, stop
when too different to join or enough small clusters.
ii. Partitioning
iii. Density based clustering: numeric data, best results, DBScan. Cluster where data
density is high.
b. Measuring dissimilarity:
i. Centroid: of 2 clusters
i. Simplest approach: specify minimum desired distance, stop when 2 closest ones
reach it
ii. Visual approach: agglomeration, draw dendrogram and find good cut level
iii. More sophisticated approaches: analyse distances, try to find big step difference
81. K-means clustering: k number of clusters selected randomly, assign each data point to most similar
cluster and compute new cluster centre
b. Aim: find parameters for normal distribution and how much each normal distribution
contributes to the data
83. Web mining: uncover patterns in web content, structure and usage. Application of data mining
techniques to discover patterns from the web. Web data à Web knowledge
a. Challenges: too big, too complex/variety/data types, too dynamic, not specific to a domain,
has everything
84. Web content mining: mining, extraction and integration of useful data, information and knowledge
from web page content. Provides content-based access to web. Organise by content (clustering or
classification)
85. Web structure mining: using graph theory to analyse node and connections structure of a website
86. Web usage mining: application of data mining techniques to discover interesting usage patterns from
web data to understand and bette serve the needs of web-based applications. Reflects behaviour of
humans as they interact with the internet
b. Click stream analysis: aggregate sequence of page visits executed by a particular user
navigating through a website. Consists of logs, cookies, meta tags and other data
c. Variables:
i. Number of visit actions: count of web log entries for each session ID. Summary
statistics to find out whether website is healthy or not.
ii. Session duration: time per session spent on website. No information about how long
spent on last page. Only sessions that contain more than one actions are considered.
Underestimate of total session duration since time in last page not counted.
iii. Average time per page: session duration / number of visit actions -1
iv. Duration for individual pages: first few pages are the most important ones. Tend to be
short.
87. Exploratory data analysis for web usage mining: allows analyst to probe deeper into dataset, inspect
interrelationships and reveal interesting subsets