Professional Documents
Culture Documents
Daniel B. Wright
2
DSCI 5240
About Data
3
DSCI 5240
Acquiring Data
• Data acquisition may or may not be your concern within your organization
• In large organizations, there may be teams devoted to extracting relevant information from the data warehouse
• Big Data refers to situations where datasets are so large they cannot be stored or analyzed using traditional methods
• In instances where additional data would be helpful, it can often be acquired from operational systems or third parties
4
DSCI 5240
Data Structure
• Data is almost always organized
in tabular/matrix format
• Rows
• Tuples
• Observations
• Columns
• Variables
• Dimensions
• Features
• Inputs/Targets
5
DSCI 5240
data
types of movies
was spent
Ordinal
• All ordinal data is nominal data
Nominal 6
DSCI 5240
8
DSCI 5240
9
DSCI 5240
Variable Types
• Each column in your dataset represents
a potential variable that may be
included in your model
10
DSCI 5240
Data Preprocessing
11
DSCI 5240
Data Preprocessing
The data contained in modern data warehouses often has significant data quality issues
• Accuracy – Do the data accurately represent what they are intended to represent?
We have a customer record but the income field reflects household, rather than individual income as expected
We have a customer record but the value of the income field was collected 20 years ago
We have a customer record but the value of the income field is $5B
We have a customer record but the value of the income field has been scaled several times and we are not
really sure what it means
12
DSCI 5240
Data Quality
• Data quality has consistently been shown to be a critical factor in the successful use
of BI within organizations
• Some preprocessing tasks related to improving data quality are often completed
before the analyst receives the data, others are completed after
13
DSCI 5240
• Data integration – ensure that the incorporation of data from multiple sources has
not introduced inconsistencies into the data
• Data reduction – identifying a smaller subset of the data which can produce the
same (or similar) analytical results
14
DSCI 5240
Data Cleaning
• Missing data approaches
• Ignore the tuple – Skip it; can result in significant data loss in sparse data sets
• Global constant – Use a placeholder; can get confused with actual data
• Central tendency – Use the mean or median; can alter the variation in the data
• Class-based central tendency – Use the mean or median associated with the class to which this record belongs
• Binning – sort and adjust the value based on those of its neighbors (mean, median, boundary)
15
DSCI 5240
Data Integration
• Entity Identification
• How do we match records in one data source with those in another?
• Redundancy
• Can a given field be derived from others within the data set?
• Duplication
• Can result from data redundancy in underlying data sources
Data Reduction
• The data we work with is often BIG and its
size may inhibit our ability to work with it
• Acquisition
• Storage
• Modeling
• Rows
• Columns
17
DSCI 5240
18
DSCI 5240
19
DSCI 5240
Data Exploration
20
DSCI 5240
Data Exploration
• Having a sound understanding of the data you employ in models is critical
• A lack of understanding on the part of the modeler will result in poor model performance
and/or nonsensical model parameters
21
DSCI 5240
22
DSCI 5240
• Mean
𝐺𝑖𝑣𝑒𝑛
𝑥=5 , 2 ,7 , 2 , 8
• Sum a group of numbers and divide by the number of
𝑚𝑒𝑎𝑛
∑𝑥
observations 𝑥= 𝑥
´=
𝑛
• Represents central tendency but is not robust
¿ 5 +2 +7 + 2+ 8 =4.8
• Median 5
Important Visualizations
• Histogram
• Graphical representation of the distribution of numeric data
• Bins are constructed and the number of observations that fall within each bin is represented on a bar graph
• Box Plot
• Graphical representation of data through quartiles
• Bottom and top of the box represent the first and third quartile, middle bar or sometimes a dot represents
the median, whiskers vary
• Scatter Plot
• Graphical representation of two or more variables in relation to one another
Histogram
Normal distribution
25
DSCI 5240
Box Plot
26
DSCI 5240
Scatter Plot
27