Professional Documents
Culture Documents
Part - 4 - Data Collection and Its Theory
Part - 4 - Data Collection and Its Theory
• Because wrong data will result wrong conclusions: Therefore, check your
data if it is taken from databases for reliability
• The two main ways of summarizing data are by using tables and charts or
graphs.
• A table is the simplest way of summarizing a set of observations
• A table has rows and columns containing data, which can be in the form of absolute n
umbers or percentages, or both.
o Simplest way to summarize data
o Data are presented as absolute numbers or percentages
• Charts and graphs are visual representations of numerical data and, if well designed, wi
ll convey the general patterns of the data
o Visual representation of data
o Data are presented as absolute numbers or percentages
2
BDU: Bahir Dar Institute of Technology: Computing Faculty
Theory of Data
• Data (singular datum) are individual units of information
• Data as a general concept refers to the fact that some existing inform
ation or knowledge is represented or coded in some form suitable for
better usage or processing,
• The data set lists values for each of the variables, such as height
and weight of an object, for each member of the data set,
• In the open data discipline, data set is the unit to measure the inform
ation released in a public open data repository.
BDU: Bahir Dar Institute of Technology: Computing Faculty 6
Techniques of data collection
• Basic approaches are:
Technique Description Tools
Using available Using data that has already been c • Checklist
information ollected by others • Data compilation forms
Specifications • Data collection method, response rate, • Data should be reliable, valid,
& Methodology quality & analysis of data, sampling & generalizable to the
technique & size, questionnaire design, problem.
fieldwork.
Error & • Examine errors in approach, research • Assess accuracy by comparing
accuracy design, sampling, data collection & data from different sources.
analysis, & reporting. •
Census data are updated by
Objective • Time lag between collection & publication, syndicated firms.
frequency of updates. • The objective determines the
• Why were the data collected? relevance of data.
• Reconfigure the data to
Nature • Definition of key variables, units of increase their usefulness.
measurement, categories used, relationship
s examined. • Data should be obtained from
• Expertise, credibility, reputation, & an original source.
Dependability trustworthiness of the source.
9
BDU: Bahir Dar Institute of Technology: Computing Faculty
Process of data Collection
• Cleaning data
• Cleaning represents the set of final editing and imputation
procedures to enhance data quality and to prepare it for analysis
.
• Cleaning is an extremely delicate process that, if not conducted
properly, can seriously compromise the data collected and the
whole analysis
• Cleaning can reveal itself very beneficial since it can:
•
improve or at the very least retain the quality of the data collected
•
make the data more user friendly for analysis
•
increase the credibility of the data collected
• Yet elaborate cleaning can seriously compromise the information
collected since it can:
• significantly change the data collected
• introduce errors in the final data
• destroy evidence of poor quality data
10
BDU: Bahir Dar Institute of Technology: Computing Faculty
Process of data Collection
• Data cleaning approach is based on two fundamental principles:
• The integrity of the data collected is paramount (principal)
• Extensive imputation is not allowed. The 5% ceiling.
• Practical steps in cleaning data
• Structural checks : structural composition of the data file(s) and its
correspondence to the questionnaire. These controls are conducted
to ensure that:
• The data file(s) contain all the sections of the questionnaire
• Each record has a unique ID.
• The IDs used correspond to the selected sample
• Each variable has a unique label.
• Each label in the data corresponds to the labels in the paper questionnaire
• All variables in the data set appear in the questionnaire and vice versa
11
BDU: Bahir Dar Institute of Technology: Computing Faculty
Data presentation
• Tabular method of data presentation
• It can be more easily understood than the facts,
• They help to facilitate statistical treatment of data,
• It helps to avoid any repetition,
•
• Type of tables
• Simple (one way) table: shows one characteristic,
• Two way table: shows two characteristics,
• Higher order table: shows three or more characteristics
13
BDU: Bahir Dar Institute of Technology: Computing Faculty
Feature selection
• In machine learning, data mining, and statistics, feature selection, al
so known as variable selection, attribute selection or variable sub
set selection, is the process of selecting a subset of relevant features
(variables, predictors) for use in model construction,
• Feature selection techniques are used for several reasons:
• Simplification of models to make them easier to interpret by researchers/users,
[1 ]
• The data used to build the final model usually comes from multiple datasets
• Therefore, in particular, three data sets are commonly used in different sta
ges of the creation of the model.
• The model sees and learns from this data that a training dataset is a dataset
of examples used for learning, that is to fit the parameters (e.g., weights) o
f, for example, a classifier.
• This simple procedure is complicated in practice by the fact that the valida
tion dataset's error may fluctuate during training, producing multiple local
minima.
• If a model fit to the training dataset also fits the test dataset well, mi
nimal overfitting has taken place
BDU: Bahir Dar Institute of Technology: Computing Faculty 20
Training, Testing and Validation Data Sets
• Better fitting of the training dataset as opposed to the test dataset usu
ally points to overfitting,
• A test set is therefore a set of examples used only to assess the perfo
rmance (i.e. generalization) of a fully specified classifier,
• A training set (left) and a test set (right) from the same statistical po
pulation are shown as blue points,
• Two predictive models are fit to the training data. Both fitted models
are plotted with both the training and test sets.
BDU: Bahir Dar Institute of Technology: Computing Faculty 21
The Data Split Ratio
• Now that we know what these datasets do, we might be looking for recom
mendations on how to split the dataset into Train, Validation and Test sets…
• Some models need substantial data to train upon, so in this case we would
optimize for the larger training sets,
• Models with very few hyper parameters will be easy to validate and tune, s
o we can probably reduce the size of the validation set, but if the model ha
s many hyper parameters, we would want to have a large validation set as
well (although we should also consider cross validation). Also,
• Researchers must eventually come full circle to their starting point – why they
conducted a research study in the first place and what they hoped to discover –
and relate their results to their initial concerns and questions
24
BDU: Bahir Dar Institute of Technology: Computing Faculty
Interpreting the data
• No research study can be perfect, and its imperfections inevitably cast at least
a hint of doubt on its findings. Good researchers know – and they also report –
the weaknesses along with the strengths of their research
25
BDU: Bahir Dar Institute of Technology: Computing Faculty
Sampling and sampling process
• Type of sample methods Samples
Probability samples
Non-Probability
samples
Simple
random Stratified
Chunk
Systematic
Quota Cluster
Judgment
26
BDU: Bahir Dar Institute of Technology: Computing Faculty
Sampling and sampling process
• Probability sampling
• Subjects of the sample are chosen based on known probabilities
• Simple random sampling
• Every individual or item from the frame has an equal chance of being selected
• Selection may be with replacement or without replacement
• Samples obtained from table of random numbers or computer random number
generators
• Systematic sampling
• Decide on sample size: n
• Divide frame of N individuals into groups of k individuals: k=N/n
• Randomly select one individual from the 1st group
• Select every k-th individual thereafter
27
BDU: Bahir Dar Institute of Technology: Computing Faculty
Sampling and sampling process
• Systematic sampling
N = 64
n=8
k=8 First group
• Stratified samples
• Population divided into two or more groups according to some common characteristic
(mutually exclusive )
• Simple random sample selected from each group
• The two or more samples are combined into one
28
BDU: Bahir Dar Institute of Technology: Computing Faculty
Sampling and sampling process
• Clustering sampling
• Population divided into several “clusters,” each representative of the population
• Simple random sample selected from each
• The samples are combined into one
Population divided
into 4 clusters.
29
BDU: Bahir Dar Institute of Technology: Computing Faculty
30
BDU: Bahir Dar Institute of Technology: Computing Faculty