You are on page 1of 10

lOMoARcPSD|4323350

Data analytics PDF maria

Data Analytics (Edinburgh Napier University)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by Shameer Babu Thonnan Thodi (ttshameerbabu@yahoo.com)
lOMoARcPSD|4323350

Data Analytics
1. Data: facts, observations, perceptions. Lowest level of abstraction from which information and then
knowledge are derived.

2. Information: a subset of data, possess context, relevance, purpose. Middle level of abstraction.

3. Knowledge: understanding of information. Highest level of abstraction.

4. Decisions: made based on information and knowledge

5. Analysis of data: inspecting, cleaning, transforming, modelling, goal of highlighting information, suggest
conclusions and support decision making

6. Data mining: Process of extracting information from large databases and using it to make decisions.
KDD -Knowledge Discovery in Databases. Methods:

a. Predictive methods: classification, regression

b. Descriptive methods: clustering, association rule discovery

7. Data mining process:

a. Data Gathering

b. Data preparation and cleansing: detect, correct errors and inconsistencies to ensure data
quality.

c. Pattern extraction and discovery (mining)

d. Visualisation

e. Analysis and evaluation of results

8. Classification: predict nominal outcome with a finite number of possible results

9. Regression: prediction task too, but with numerical values

10. Clustering: summarise data for a better overview by forming groups of similar cases

11. Association: find correlations or associations that describe the interdependence of attributes

12. Visualisation: the use of computer-supported, interactive, visual representation of abstract data to
amplify cognition. Depends on type of data being viewed and questions we want to ask

a. Understandable to a wider audience

b. Aesthetics: more attractive than statistics

c. Convincing: helps persuade yourself and others of value of the data

d. Trust: more trusted than statistics

e. Uses: explore and gain understanding of dataset, finding patterns and oddities, generating
further questions.

13. Visual encoding: mapping a data attribute to a visual variable.

14. CRoss Industry Standard Process for Data Mining

a. Project understanding

b. Data understanding

c. Data preparation

Downloaded by Shameer Babu Thonnan Thodi (ttshameerbabu@yahoo.com)


lOMoARcPSD|4323350

d. Modelling

e. Evaluation

f. Deployment

15. Types of visualisation:

a. 1D: histograms, bar charts, box plots

b. 2D: scatterplots + colour/shape, bar charts

c. ND (3+): scatterplot matrix, parallel coordinates, radar plot, star plot, MDS

d. Planning: maps

16. Histogram number of bins: Sturges’ rule k = [log2(n) + 1], where n = sample size. Histograms show overall
shape of a variable’s distribution, outliers are hard to find. Patterns are visible in the heights of the bars.

17. Boxplots show centre and range and individual outliers are clearly shown

18. Data understanding: determine quality of data, outliers, missing values, dependencies and correlations,
compare statistics to expected behaviour

19. Data exploration: first stage of data analysis. Aims: understand data, find range and limitations, check
that data is consistent, understand relationships

20. Distribution: rate of occurrence of values within a variable

21. Juxtaposition: side by side to compare

22. Superposition: on top of each other

23. Accuracy: deviation from target

24. Precision: deviation of guesses from each other

25. Steven’s Power Law: Trueness = how close is human perceptual judgement to some objective
measurement of the stimulus

26. JND – Just Noticeable Difference: smallest different between 2 stimuli. Determines our ability to
understand the data. Depends on size, colour, contrast, texture…

27. Law of Parsimony: simplest explanation wins

28. Gestalt principles. Gestalt works by grouping features by visual search, salience and context.

a. Simplicity

b. Symmetry

c. Similarity (squares vs circles)

d. Closure: we perceptually complete objects that aren’t (dashed line)

e. Connectedness: things that are physically connected are perceived as a unit

f. Continuity: points connected in a line – together

g. Common fate: moving in the same direction

h. Familiarity

i. Proximity

29. Reification: imagined or illusory shapes. Perception is constructed from smaller components.

Downloaded by Shameer Babu Thonnan Thodi (ttshameerbabu@yahoo.com)


lOMoARcPSD|4323350

30. Preattentive visual features: ability of low-level human visual system to rapidly identify certain basic
visual properties e.g. a unique visual property – something in red pops out

31. Salience: areas of the image that attract our attention

32. Effectiveness principle: encode the most important information in the most effective way:

a. Accuracy: how close is human perceptual judgement to some objective measurement of


the stimuli

b. Discriminability: what is the JND of the visual channel

c. Separability: can the visual channel be judged independently of other visual channels?

d. Importance ordering: the importance of the attribute should match the salience of the visual
channel

33. Graph: set of vertices (nodes) and edges (links). Clusters are based on connectivity. Groups on
attributes and paths between two nodes.

34. Nodes represent entities, edges and relationships.

35. Layout algorithms:

a. Force directed

b. Constraint-based

c. Multi scale approaches

d. Layered layouts

e. Non-standard layouts

36. Graph drawing issues: scale, computationally expensive, readability is hard – leads to occlusion, run out
of screen space. Solutions: parallel processing on the GPU, larger screens, reduction techniques
(clustering), use of alternative representation (matrix)

37. Matrix: nodes arranged on x and y axis. Connections made between rows and columns. Scale well, no
clutter, easy to calculate layout, outperform node links on most tasks but less intuitive, hard to follow
paths, require ordering to show clusters

38. Node-link diagrams intuitive, good for path-finding tasks, showing structure in sparse graphs, do not
scale well

39. Tree: understandable but not very space efficient. Types: node-link, adjacency or tree map

40. Animation to show time: showing images sequentially to convey change over time. Good for transitions,
simple comparisons of adjacent items, storytelling. Not so good for comparisons, exploring data or
when conveying complex systems.

a. Sequential: showing images sequentially to convey change over time. More screen space ->
clutter reduction. Can spot local changes between timescales. Tasks take longer and are less
accurate.

b. Juxtaposed: times lives overlaid, opacity used to distinguish time, colour used to highlight
specific dates

41. Data Quality: fitness for use “garbage in, garbage out”. Reasons: cost savings, increased efficiency,
protection of organisation’s efficiency. If data is not accurate, relations based on this data will be
misleading.

Downloaded by Shameer Babu Thonnan Thodi (ttshameerbabu@yahoo.com)


lOMoARcPSD|4323350

42. Quality dimensions: there is an error in the dataset if any one of the dimensions is violated

a. Accuracy: recorded value vs actual value

b. Timeliness: not out of data

c. Completeness: all values are recorded

d. Consistency: data is uniform

e. Uniqueness: only one record of each unique entity

43. Data quality difficulties:

a. Lack of validation routines

b. Lack of referential integrity checks

c. Valid, but not correct

d. Mismatched syntax, formats, structures

e. Unexpected changes in source systems

f. Poor system design

g. Data conversion errors

h. Data integration errors

44. Data cleaning: detect, correct errors and inconsistencies to ensure data quality. General approach:

a. Identification of error types

b. Selection of algorithms

c. Selection of methods/approaches

d. Correction of errors

45. Clean data: all data in a column match the metadata description of that column. It is suggested to
have a header row to indicate the meaning of each column.

46. Dirty data types:

a. Outliers: a value that is far away or very different from all or most of other data. Can be
discovered by using visualisation methods, clustering, distance-based or projection-based.

b. Duplicates: identify records that refer to the same real-world entity. Record linkage or record
matching. Field matching techniques:

i. Character based similarity methods: smiht-> smith

ii. Token-based similarity methods: John Smith -> Smith, John

c. Misspelling

d. Missing values: ignorance/deletion, imputation (replace missing value by estimation/mode/


mean/value determined by other attributes) or leave it as missing but some methods can’t
deal with missing values

e. Meaningless values

f. Wrong data types

47. Detection methods (time consuming)

Downloaded by Shameer Babu Thonnan Thodi (ttshameerbabu@yahoo.com)


lOMoARcPSD|4323350

a. Statistical methods

b. Data transformation

c. Integrity constraint enforcement

d. Duplicate elimination

e. Data mining techniques – association rules

48. Selecting data: only select data that is relevant for the given problem, creating a sub sample of the
data and use it for the analysis. Reasons:

a. Timeliness: use recent data

b. Representativeness: sometimes population of interest is different from population available in


dataset

c. Rare events: predicting that the event will not occur

49. Four steps of Modelling:

a. Select the model class: general structure of the analysis result (e.g. linear or quadratic
function for a regression problem)

b. Select the score function: evaluate possible models using a score function

c. Apply the algorithm: compare models through the score function

d. Validate the results: choose best model among the chosen ones
50. Model class: form or structure of the analysis result. Parameters are not defined. Only type is selected.
(Eg. Linear models, mean, rule based models…)

a. Global models: provide a (not necessarily good) description for the whole day set (regression
line)

b. Local models: provide a description for only a subset of the data set (association rules)

51. Fitting criteria and score function: find an objective function (mean squared error vs mean absolute
error) which evaluates the quality of your model in order to detect the best model

52. Error functions for classification: misclassification rate = wrongly classified / total classified. A low
misclassification rate does not tell anyone about the quality of a classifier. When classes are
unbalanced the rate is unbalanced (if 99% of the data is classed as good, a classifier always predicting
good will have a misclassification rate of 1%)

53. Scoring function / objective function: does not tell us how to find the best method. Provides a means for
comparing models.

54. Optimisation algorithms: find the best model needed

55. Algorithm problems:

a. will only find local optima

b. parameters must be adjusted

c. for discrete problems with a finite search space, combinatorial optimisation strategies are
needed

d. In principle exhaustive search of the finite domain is possible, however it is not since the
dataset is too large. That’s why heuristic strategies are needed.

Downloaded by Shameer Babu Thonnan Thodi (ttshameerbabu@yahoo.com)


lOMoARcPSD|4323350

56. Error = experimental error + sample error + model error + algorithmic error. These are the 4 parts of
potential error origins, which sum up to the overall error measure

57. Experimental error/pure error/intrinsic error/Bayes error: inherent in the data due to noise, random
variations, imprecise measurements. Impossible to overcome this error by the choice of a suitable
model.

58. Confusion matrix: table where rows are true classes and columns the predicted classes.

59. Sample error: small sample = small probability for a perfect model

60. Model error:

a. simpler model = bigger error

b. more complex model = overfitting and larger error on new data

c. type of model affects the fit to data

61. Algorithmic error: can often not be measured. Normality: we assume that algorithm is good enough.

62. Model validation: error for unseen data will most probably always be bigger than for the data used for
training.

63. Training and test data: split data, train model on training data, test model quality on test data. Or
training, test and validation: test the best models on the validation data set. Splitting strategies:

a. Random: roughly same distribution is both samples

b. Stratification: distribution of one class should remain

64. Cross validation: split data multiple times to validate the results to reduce the effect of the single
estimation.

65. K-fold Cross-Validation: k subsets, test data is always another subset, training data the rest

66. Leave one out method: small data sets. Use everything for training except one point used for testing.

67. Types of learning:

a. Deductive learning: given general rules and the applied

b. Inductive learning: detecting patterns and working out a rule for themselves.

i. Supervised: input and output are known

1. Classification: predict outcome with a nominal target class attribute

2. Regression: predict outcome with a numerical target class attribute

ii. Unsupervised: only input is known

iii. Reinforcement learning: input, output and evaluation of output

68. Minimum coverage: minimum number of instances

69. OneR (one attribute rule): fine one attribute to use that makes the fewest prediction errors.

a. Find most frequent class of attribute, make and calculate error rate of rule. Pick attribute with
lowest error rate.

70. ID3 Algorithm: decision tree. Rules = leaf nodes.

a. Entropy: a measure of the degree of doubt. The higher, the more doubt about the possible
conclusions. Attribute with lowest entropy is the most useful determiner.

Downloaded by Shameer Babu Thonnan Thodi (ttshameerbabu@yahoo.com)


lOMoARcPSD|4323350

b. Algorithm: compute each attribute’s entropy. Select lowest entropy attribute. Divide data by
attribute classes. Build the tree.

c. Advantages: detailed decision tree, always works, easy to implement, easy to read, simple,
running time increases only linearly with the complexity of the problem

d. Limitations: takes no account of any meaning that the data it works on may have, considers
just one attribute at a time, cannot handle uncertain rules.

71. Pruning: to simplify the tree, to avoid overfitting.

a. Idea: replace branches with leaves or biggest branch if it is better

b. Approaches: reduced error pruning, pessimistic pruning, confidence level pruning

72. C4.5: an extension of ID3. Builds decision trees. Improvements:

a. Handle continuous and discrete data

b. Handling missing values

c. Pruning trees after creation

73. Propositional rules: atomic facts combined using only logical operators with no variables.

74. Association rules quality:

a. Support of an item set

b. Fraction of transaction that contain item set

c. Support of an association rule (divided by total): not possible to determine support for all item
sets, because their number grows exponentially with the number of items

d. Confidence of an association rule (divided by “given that”)

75. Association rules two steps: find frequent item sets with minimum support, form rules and select those
that have at least the minimum confidence

76. Item set tree pruning

a. Structural pruning: only one counter for each item set

b. Prune to a certain depth, rules with too many items are difficult to interpret

c. Support based pruning: no superset of an infrequent item set can de frequent. No counters
for item sets having an infrequent subset are needed.
77. Eclat: depth first search

78. Apriori: breadth first search: traverse the tree for each transaction and find the item set it contains. Top-
down.

a. Algorithm: Set minimum coverage. Find one-attribute associations that reach it, then two,
until end. Set minimum confidence. Generate rules that reach it.

b. Advantages: rules are statistically supported and simple to understand. Easy to implement.

c. Disadvantages: high accuracy, low coverage, no numeric values, big search space, models
generated might be too big.

79. Types of frequent item sets

a. Free item set: any with minimal support

Downloaded by Shameer Babu Thonnan Thodi (ttshameerbabu@yahoo.com)


lOMoARcPSD|4323350

b. Closed item set: if no superset has the same support

c. Maximal item set: if no superset is frequent

80. Clustering:

a. Approaches:

i. Linkage based – hierarchical clustering

1. Agglomeration: bottom up, every leaf is a cluster, join 2 similar ones, stop
when too different to join or enough small clusters.

2. Divisive: one cluster into many, top-down

ii. Partitioning

iii. Density based clustering: numeric data, best results, DBScan. Cluster where data
density is high.

b. Measuring dissimilarity:

i. Centroid: of 2 clusters

ii. Average linkage: all pairs of points

iii. Simple linkage: dissimilarity between 2 most similar objects

iv. Complete linkage: dissimilarity between 2 most similar objects

c. Choosing the right cluster:

i. Simplest approach: specify minimum desired distance, stop when 2 closest ones
reach it

ii. Visual approach: agglomeration, draw dendrogram and find good cut level

iii. More sophisticated approaches: analyse distances, try to find big step difference

d. Similarity measures: nominal, binary, numerical (Euclidean, cosine, minkowaski)

81. K-means clustering: k number of clusters selected randomly, assign each data point to most similar
cluster and compute new cluster centre

82. Gaussian mixture models – EM clustering:

a. Assumption: data was generated by sampling a set of normal distributions

b. Aim: find parameters for normal distribution and how much each normal distribution
contributes to the data

c. Algorithm: parameters of normal distribution and likelihood of data points to be generated


by the corresponding normal distributions are estimated

83. Web mining: uncover patterns in web content, structure and usage. Application of data mining
techniques to discover patterns from the web. Web data à Web knowledge

a. Challenges: too big, too complex/variety/data types, too dynamic, not specific to a domain,
has everything

84. Web content mining: mining, extraction and integration of useful data, information and knowledge
from web page content. Provides content-based access to web. Organise by content (clustering or
classification)

Downloaded by Shameer Babu Thonnan Thodi (ttshameerbabu@yahoo.com)


lOMoARcPSD|4323350

85. Web structure mining: using graph theory to analyse node and connections structure of a website

a. Hyperlinks: using URLs

b. Mining the Document structure to describe HTML, XML

86. Web usage mining: application of data mining techniques to discover interesting usage patterns from
web data to understand and bette serve the needs of web-based applications. Reflects behaviour of
humans as they interact with the internet

a. Benefit: provide insight leading to customisation and personalisation of a user’s web


experience

b. Click stream analysis: aggregate sequence of page visits executed by a particular user
navigating through a website. Consists of logs, cookies, meta tags and other data

c. Variables:

i. Number of visit actions: count of web log entries for each session ID. Summary
statistics to find out whether website is healthy or not.

ii. Session duration: time per session spent on website. No information about how long
spent on last page. Only sessions that contain more than one actions are considered.
Underestimate of total session duration since time in last page not counted.

iii. Average time per page: session duration / number of visit actions -1

iv. Duration for individual pages: first few pages are the most important ones. Tend to be
short.

87. Exploratory data analysis for web usage mining: allows analyst to probe deeper into dataset, inspect
interrelationships and reveal interesting subsets

Downloaded by Shameer Babu Thonnan Thodi (ttshameerbabu@yahoo.com)

You might also like