Part - 4 - Data Collection and Its Theory

Research Design and Methodology
Data theories and processing
Gebeyehu B. (Dr. of Eng.) Associate Professor

gebeyehu2009@gmail.com
BDU: Bahir Dar Institute of Technology: Computing Faculty

General concepts of data
• Data is the core components of research, and it needs good presentation (re
search design),
• Because wrong data will result wrong conclusions: Therefore, check your
data if it is taken from databases for reliability
• The two main ways of summarizing data are by using tables and charts or
graphs.
• A table is the simplest way of summarizing a set of observations
• A table has rows and columns containing data, which can be in the form of absolute n
umbers or percentages, or both.
o Simplest way to summarize data
o Data are presented as absolute numbers or percentages
• Charts and graphs are visual representations of numerical data and, if well designed, wi
ll convey the general patterns of the data
o Visual representation of data
o Data are presented as absolute numbers or percentages
2
Theory of Data
• Data (singular datum) are individual units of information
• It is a datum describes a single quality or quantity of some object or

phenomenon,
• In analytical processes, data are represented by variables, whereas, i

n programming languages, there is two stages to creating a variable:
• the container and stick an identifying label on it; this is called initialization,
• A value into a variable, called assignment, which is like little like boxes or c
ontainers to put different things in.
• We can create an imaginary box with anything we want to put in it,

and define/label that imaginary box, (or variable), by any name we c
hoose with its corresponding value or meaning.

3
Theory of Data
• A variable is a reserved memory that locate in the computer storage
devices to store values,
• It means that create a variable is reserve some space in the memory t

o give a distinct names or labels to identify their location in the mem
ory,
• The labels also use to a reference to interact the computer to access,

edit, save, retrieve or store data,
• Variable creation and management process is more flexible and strai

ghtforward, especially in Python,
• It can easily create or declare by using a syntactically appropriate na

me and assigning a value to it using the assignment operator (=) that
need to inform the language to contain a specific type of data.
BDU: Bahir Dar Institute of Technology: Computing Faculty 4
Theory of Data
• Data is measured, collected, reported, and analyzed, whereupon it
can be visualized using graphs, images or other analysis tools,
• Data as a general concept refers to the fact that some existing inform
ation or knowledge is represented or coded in some form suitable for
better usage or processing,
• Raw data ("unprocessed data") is a collection of numbers or characte

rs before it has been "cleaned" and corrected by researchers,
• Raw data needs to be corrected to remove outliers or obvious instru

ment or data entry errors,
• Data processing commonly occurs by stages, and the "processed dat

a" from one stage may be considered the "raw data" of the next stage
,
Theory of Data
• Data sets
• A data set (or dataset) is a collection of data. In the case of tab
ular data, a data set corresponds to one or more database tables,
• where every column of a table represents a particular variable,

and each row corresponds to a given record of the data set in qu
estion,
• The data set lists values for each of the variables, such as height
and weight of an object, for each member of the data set,
• Each value is known as a datum. Data sets can also consist of a

collection of documents or files.
• In the open data discipline, data set is the unit to measure the inform
ation released in a public open data repository.
Techniques of data collection
• Basic approaches are:
Technique Description Tools
Using available Using data that has already been c • Checklist
information ollected by others • Data compilation forms
Systematically selecting, watching

and recording behavior and chara • Eyes and ears
Observing cteristics of people, objects or even • Data compilation forms
ts
Oral questioning of respondents, e • Interview guide

Interviewing ither individually or as a group • Data compilation forms
Collecting data based on answers

Administering writte • Survey
provided by respondents in written
n questionnaires • Questionnaire
form
Facilitating free discussions on sp

Conducting focus • Flip charts
ecific topics with selected group of
groups participants

7
Techniques of data collection
• Importance of combining different data collection techniques
Qualitative Techniques Quantitative Techniques

(Flexible)
Vs. (Less Flexible)
• Produce qualitative data that is often rec • Structured questionnaires designed to q

orded in narrative form uantify pre- or post-categorized answers
• Useful in answering the "why", "what", a to questions
nd "how" questions • Useful in answering the "how many", "ho
• Typically includes: w often", "how significant", etc. questio
ns
– Loosely structured interviews using
open-ended questions • Answers to questions can be counted an
d expressed numerically
– Focus group discussions
– Observations
• A skillful use of a combination of qualitative and quantitative techniques

will give a more comprehensive understanding of the topic
8
Process of data Collection
• Criteria for evaluating secondary data

Criteria Issues Remarks
Specifications • Data collection method, response rate, • Data should be reliable, valid,
& Methodology quality & analysis of data, sampling & generalizable to the
technique & size, questionnaire design, problem.
fieldwork.
Error & • Examine errors in approach, research • Assess accuracy by comparing
accuracy design, sampling, data collection & data from different sources.
analysis, & reporting. •
Census data are updated by
Objective • Time lag between collection & publication, syndicated firms.
frequency of updates. • The objective determines the
• Why were the data collected? relevance of data.
• Reconfigure the data to
Nature • Definition of key variables, units of increase their usefulness.
measurement, categories used, relationship
s examined. • Data should be obtained from
• Expertise, credibility, reputation, & an original source.
Dependability trustworthiness of the source.
9
• Cleaning data
• Cleaning represents the set of final editing and imputation
procedures to enhance data quality and to prepare it for analysis
.
• Cleaning is an extremely delicate process that, if not conducted
properly, can seriously compromise the data collected and the
whole analysis
• Cleaning can reveal itself very beneficial since it can:
•
improve or at the very least retain the quality of the data collected
•
make the data more user friendly for analysis
•
increase the credibility of the data collected
• Yet elaborate cleaning can seriously compromise the information
collected since it can:
• significantly change the data collected
• introduce errors in the final data
• destroy evidence of poor quality data
10
• Data cleaning approach is based on two fundamental principles:
• The integrity of the data collected is paramount (principal)
• Extensive imputation is not allowed. The 5% ceiling.
• Practical steps in cleaning data
• Structural checks : structural composition of the data file(s) and its
correspondence to the questionnaire. These controls are conducted
to ensure that:
• The data file(s) contain all the sections of the questionnaire
• Each record has a unique ID.
• The IDs used correspond to the selected sample
• Each variable has a unique label.
• Each label in the data corresponds to the labels in the paper questionnaire
• All variables in the data set appear in the questionnaire and vice versa
11
Data presentation
• Tabular method of data presentation
• It can be more easily understood than the facts,
• They help to facilitate statistical treatment of data,
• It helps to avoid any repetition,
•
• Type of tables
• Simple (one way) table: shows one characteristic,
• Two way table: shows two characteristics,
• Higher order table: shows three or more characteristics
• Frequency distribution: steps

• Begin by arranging the data from smallest to largest
• Count values that repeat by making tallies
• Group observations with comparable magnitude
• Stop the classification when you are sure that the first and
• the last classes respectively consist the smallest and larges values
• Indicate how many values are included in a class
12
Data presentation
• Graphic methods of data presentation
• Data in frequency distribution can be presented graphically of diagrammatically,
• Graphs are the natural choice to represent continuous data
• For discrete or qualitative data, the way to present the data as

• Pie chart (multiply relative frequency by 360,
• Pictogram
• Bar graph, etc.
• For continual data:

• Histogram (class boundary and abs. freq.)
• Frequency polygon (Class mark and abs freq.)
• Cumulative frequency graph (class mark and
• cumulative frequency)
13
Feature selection
• In machine learning, data mining, and statistics, feature selection, al
so known as variable selection, attribute selection or variable sub
set selection, is the process of selecting a subset of relevant features
(variables, predictors) for use in model construction,
• Feature selection techniques are used for several reasons:
• Simplification of models to make them easier to interpret by researchers/users,
[1 ]
• shorter training times,

• to avoid the curse of dimensionality,
• enhanced generalization by reducing overfitting (formally, reduction of varian
ce)
• The central premise when using a feature selection technique is that

the data contains some features that are either redundant or irreleva
nt, and can thus be removed without incurring much loss of informat
ion,
Feature selection
• Feature selection techniques should be distinguished from feature
extraction,
• Feature extraction creates new features from functions of the origi

nal features, whereas feature selection returns a subset of the featur
es,
• Feature selection techniques are often used in domains where there a

re many features and comparatively few samples (or data points).
• A feature selection algorithm can be seen as the combination of a se

arch technique for proposing new feature subsets, along with an eval
uation measure which scores the different feature subsets.

Training, Testing and Validation Data Sets
• This is aimed to be a short primer for anyone who needs to know the differ
ence between the various dataset splits while training Machine Learning /
Data Mining models.
• In ML/DM, a common task is the study and construction of algorithms tha

t can learn from and make predictions on data,
• Such algorithms work by making data-driven predictions or decisions, thro

ugh building a mathematical model from input data,
• The data used to build the final model usually comes from multiple datasets
• Therefore, in particular, three data sets are commonly used in different sta
ges of the creation of the model.
• The model is initially fit on a training dataset, that is a set of examples us

ed to fit the parameters (e.g. weights of connections between neurons in art
ificial neural networks) of the model.
• The Training Dataset: The sample of data used to fit the model. Tra
ining Data is labeled data used to train the machine learning algorit
hms and increase accuracy.
• How do we make machines intelligent?

• The answer to this question – make them feed on relevant data. This is also ref
erred to as Training data.
• The model (e.g. a neural net or a naive Bayes classifier) is trained on

the training dataset using a supervised learning method (e.g. gradient
descent or stochastic gradient descent),
• In practice, the training dataset often consist of pairs of an input vect

or (or scalar) and the corresponding output vector (or scalar), which
is commonly denoted as the target (or label).
• The actual dataset that we use to train the model (weights and biases in the
case of Neural Network),
• The model sees and learns from this data that a training dataset is a dataset
of examples used for learning, that is to fit the parameters (e.g., weights) o
f, for example, a classifier.
• The test dataset is a dataset used to provide an unbiased evaluation of a fi

nal model fit on the training dataset,
• The test set is a set of observations used to evaluate the performance of th
e model using some performance metric, which is important that no observ
ations from the training set are included in the test set,
• If the data in the test dataset has never been used in training (for example i
n cross-validation), the test dataset is also called a holdout dataset.
• Cross-validation is A dataset can be repeatedly split into a training dataset
, and holdout dataset is the part of the original dataset can be set aside an
18
d used as a test set
• Successively, the fitted model is used to predict the responses for the obser
vations in a second dataset called the validation dataset,
• The validation dataset provides an unbiased evaluation of a model fit on t

he training dataset while tuning the model's hyper parameters (e.g. the nu
mber of hidden units in a neural network),
• Validation datasets can be used for regularization by early stopping: stop tr

aining when the error on the validation dataset increases, as this is a sign o
f overfitting to the training dataset,
• This simple procedure is complicated in practice by the fact that the valida
tion dataset's error may fluctuate during training, producing multiple local
minima.

• Visualization of the splits
• A test dataset is a dataset that is independent of the training dataset,

but that follows the same probability distribution as the training data
set.
• If a model fit to the training dataset also fits the test dataset well, mi
nimal overfitting has taken place
• Better fitting of the training dataset as opposed to the test dataset usu
ally points to overfitting,
• A test set is therefore a set of examples used only to assess the perfo
rmance (i.e. generalization) of a fully specified classifier,
• A training set (left) and a test set (right) from the same statistical po
pulation are shown as blue points,
• Two predictive models are fit to the training data. Both fitted models
are plotted with both the training and test sets.
The Data Split Ratio
• Now that we know what these datasets do, we might be looking for recom
mendations on how to split the dataset into Train, Validation and Test sets…
• This mainly depends on 2 things.

• First, the total number of samples in the data and
• second, on the actual model we are training.
• Some models need substantial data to train upon, so in this case we would
optimize for the larger training sets,
• Models with very few hyper parameters will be easy to validate and tune, s
o we can probably reduce the size of the validation set, but if the model ha
s many hyper parameters, we would want to have a large validation set as
well (although we should also consider cross validation). Also,
• if we happen to have a model with no hyper parameters or ones that cannot

be easily tuned, we probably don’t need a validation set too!
The Data Split Ratio
• Should training data always be more than testing data ?
• The answer is very intuitive, which it takes multiples inputs of simil

ar kind for the model to identify the common underlying patterns,
• Whereas this ability of identifying could very well be tested using fe

wer samples,
• In general, data showing an Instance, Feature, and Train-Test Datase

ts,
• Instance: A single row of data is called an instance. It is an observati

on from the domain.

Interpreting the data
• Interpreting the data means several things. In particular, it means:
• Relating the findings to the original research problem and to the specific researc
h questions and hypotheses.
• Researchers must eventually come full circle to their starting point – why they
conducted a research study in the first place and what they hoped to discover –
and relate their results to their initial concerns and questions
• Relating the findings to preexisting literature, concepts, theories, a

nd research studies.
• To be useful, research findings must in some way be connected to the larger pi
cture – to what people already know or believe about the topic in question.
• Perhaps the new findings confirm a current theoretical perspective, perhaps th

ey cast doubt on common “knowledge”, or perhaps they simply raise new que
stions that must be addressed before we can truly understand the phenomenon
in question
24
Interpreting the data
 Determining whether the findings have practical significance as well

as statistical significance.
 Statistical significance is one thing; practical significance – whether findings

are actually useful – is something else altogether.
• Identifying limitations of the study. Finally, interpreting the data inv

olves outlining the weaknesses of the study that yielded them.
• No research study can be perfect, and its imperfections inevitably cast at least
a hint of doubt on its findings. Good researchers know – and they also report –
the weaknesses along with the strengths of their research
25
Sampling and sampling process
• Type of sample methods Samples
Probability samples
Non-Probability
samples
Simple
random Stratified
Chunk
Systematic
Quota Cluster
Judgment
26
• Probability sampling
• Subjects of the sample are chosen based on known probabilities
• Simple random sampling
• Every individual or item from the frame has an equal chance of being selected
• Selection may be with replacement or without replacement
• Samples obtained from table of random numbers or computer random number
generators
• Systematic sampling
• Decide on sample size: n
• Divide frame of N individuals into groups of k individuals: k=N/n
• Randomly select one individual from the 1st group
• Select every k-th individual thereafter
27
• Systematic sampling
N = 64
n=8
k=8 First group
• Stratified samples
• Population divided into two or more groups according to some common characteristic
(mutually exclusive )
• Simple random sample selected from each group
• The two or more samples are combined into one
28
• Clustering sampling
• Population divided into several “clusters,” each representative of the population
• Simple random sample selected from each
• The samples are combined into one
Population divided
into 4 clusters.
• Multi stage sampling

• This is where the researcher divides the population into strata, samples the
strata, then resample.
• Repeat the process until the ultimate sampling units are selected.
29
30

Part - 4 - Data Collection and Its Theory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Part - 4 - Data Collection and Its Theory

Uploaded by

Copyright:

Available Formats

Research Design and Methodology

Data theories and processing

Gebeyehu B. (Dr. of Eng.) Associate Professor

BDU: Bahir Dar Institute of Technology: Computing Faculty

• It is a datum describes a single quality or quantity of some object or

• In analytical processes, data are represented by variables, whereas, i

• We can create an imaginary box with anything we want to put in it,

BDU: Bahir Dar Institute of Technology: Computing Faculty

• It means that create a variable is reserve some space in the memory t

• The labels also use to a reference to interact the computer to access,

• Variable creation and management process is more flexible and strai

• It can easily create or declare by using a syntactically appropriate na

• Raw data ("unprocessed data") is a collection of numbers or characte

• Raw data needs to be corrected to remove outliers or obvious instru

• Data processing commonly occurs by stages, and the "processed dat

• where every column of a table represents a particular variable,

• Each value is known as a datum. Data sets can also consist of a

Systematically selecting, watching

Oral questioning of respondents, e • Interview guide

Collecting data based on answers

Facilitating free discussions on sp

BDU: Bahir Dar Institute of Technology: Computing Faculty

• Importance of combining different data collection techniques

Qualitative Techniques Quantitative Techniques

• Produce qualitative data that is often rec • Structured questionnaires designed to q

• A skillful use of a combination of qualitative and quantitative techniques

• Criteria for evaluating secondary data

• Frequency distribution: steps

• For discrete or qualitative data, the way to present the data as

• For continual data:

• shorter training times,

• The central premise when using a feature selection technique is that

• Feature extraction creates new features from functions of the origi

• Feature selection techniques are often used in domains where there a

• A feature selection algorithm can be seen as the combination of a se

BDU: Bahir Dar Institute of Technology: Computing Faculty 15

• In ML/DM, a common task is the study and construction of algorithms tha

• Such algorithms work by making data-driven predictions or decisions, thro

• The model is initially fit on a training dataset, that is a set of examples us

• How do we make machines intelligent?

• The model (e.g. a neural net or a naive Bayes classifier) is trained on

• In practice, the training dataset often consist of pairs of an input vect

• The test dataset is a dataset used to provide an unbiased evaluation of a fi

• The validation dataset provides an unbiased evaluation of a model fit on t

• Validation datasets can be used for regularization by early stopping: stop tr

BDU: Bahir Dar Institute of Technology: Computing Faculty 19

• A test dataset is a dataset that is independent of the training dataset,

• This mainly depends on 2 things.

• if we happen to have a model with no hyper parameters or ones that cannot

• The answer is very intuitive, which it takes multiples inputs of simil

• Whereas this ability of identifying could very well be tested using fe

• In general, data showing an Instance, Feature, and Train-Test Datase

• Instance: A single row of data is called an instance. It is an observati

BDU: Bahir Dar Institute of Technology: Computing Faculty 23

• Relating the findings to preexisting literature, concepts, theories, a

• Perhaps the new findings confirm a current theoretical perspective, perhaps th

 Determining whether the findings have practical significance as well

 Statistical significance is one thing; practical significance – whether findings

• Identifying limitations of the study. Finally, interpreting the data inv

• Multi stage sampling

You might also like