You are on page 1of 30

Research Design and Methodology

Data theories and processing

Gebeyehu B. (Dr. of Eng.) Associate Professor


gebeyehu2009@gmail.com

BDU: Bahir Dar Institute of Technology: Computing Faculty


General concepts of data
• Data is the core components of research, and it needs good presentation (re
search design),

• Because wrong data will result wrong conclusions: Therefore, check your
data if it is taken from databases for reliability

• The two main ways of summarizing data are by using tables and charts or
graphs.
• A table is the simplest way of summarizing a set of observations
• A table has rows and columns containing data, which can be in the form of absolute n
umbers or percentages, or both.
o Simplest way to summarize data
o Data are presented as absolute numbers or percentages

• Charts and graphs are visual representations of numerical data and, if well designed, wi
ll convey the general patterns of the data
o Visual representation of data
o Data are presented as absolute numbers or percentages
2
BDU: Bahir Dar Institute of Technology: Computing Faculty
Theory of Data
• Data (singular datum) are individual units of information

• It is a datum describes a single quality or quantity of some object or


phenomenon,

• In analytical processes, data are represented by variables, whereas, i


n programming languages, there is two stages to creating a variable:
• the container and stick an identifying label on it; this is called initialization,
• A value into a variable, called assignment, which is like little like boxes or c
ontainers to put different things in.

• We can create an imaginary box with anything we want to put in it,


and define/label that imaginary box, (or variable), by any name we c
hoose with its corresponding value or meaning.

BDU: Bahir Dar Institute of Technology: Computing Faculty


3
Theory of Data
• A variable is a reserved memory that locate in the computer storage
devices to store values,

• It means that create a variable is reserve some space in the memory t


o give a distinct names or labels to identify their location in the mem
ory,

• The labels also use to a reference to interact the computer to access,


edit, save, retrieve or store data,

• Variable creation and management process is more flexible and strai


ghtforward, especially in Python,

• It can easily create or declare by using a syntactically appropriate na


me and assigning a value to it using the assignment operator (=) that
need to inform the language to contain a specific type of data.
BDU: Bahir Dar Institute of Technology: Computing Faculty 4
Theory of Data
• Data is measured, collected, reported, and analyzed, whereupon it
can be visualized using graphs, images or other analysis tools,

• Data as a general concept refers to the fact that some existing inform
ation or knowledge is represented or coded in some form suitable for
better usage or processing,

• Raw data ("unprocessed data") is a collection of numbers or characte


rs before it has been "cleaned" and corrected by researchers,

• Raw data needs to be corrected to remove outliers or obvious instru


ment or data entry errors,

• Data processing commonly occurs by stages, and the "processed dat


a" from one stage may be considered the "raw data" of the next stage
,
BDU: Bahir Dar Institute of Technology: Computing Faculty 5
Theory of Data
• Data sets
• A data set (or dataset) is a collection of data. In the case of tab
ular data, a data set corresponds to one or more database tables,

• where every column of a table represents a particular variable,


and each row corresponds to a given record of the data set in qu
estion,

• The data set lists values for each of the variables, such as height
and weight of an object, for each member of the data set,

• Each value is known as a datum. Data sets can also consist of a


collection of documents or files.

• In the open data discipline, data set is the unit to measure the inform
ation released in a public open data repository.
BDU: Bahir Dar Institute of Technology: Computing Faculty 6
Techniques of data collection
• Basic approaches are:
Technique Description Tools
Using available Using data that has already been c • Checklist
information ollected by others • Data compilation forms

Systematically selecting, watching


and recording behavior and chara • Eyes and ears
Observing cteristics of people, objects or even • Data compilation forms
ts

Oral questioning of respondents, e • Interview guide


Interviewing ither individually or as a group • Data compilation forms

Collecting data based on answers


Administering writte • Survey
provided by respondents in written
n questionnaires • Questionnaire
form

Facilitating free discussions on sp


Conducting focus • Flip charts
ecific topics with selected group of
groups participants

BDU: Bahir Dar Institute of Technology: Computing Faculty


7
Techniques of data collection

• Importance of combining different data collection techniques

Qualitative Techniques Quantitative Techniques


(Flexible)
Vs. (Less Flexible)

• Produce qualitative data that is often rec • Structured questionnaires designed to q


orded in narrative form uantify pre- or post-categorized answers
• Useful in answering the "why", "what", a to questions
nd "how" questions • Useful in answering the "how many", "ho
• Typically includes: w often", "how significant", etc. questio
ns
– Loosely structured interviews using
open-ended questions • Answers to questions can be counted an
d expressed numerically
– Focus group discussions
– Observations

• A skillful use of a combination of qualitative and quantitative techniques


will give a more comprehensive understanding of the topic
8
BDU: Bahir Dar Institute of Technology: Computing Faculty
Process of data Collection

• Criteria for evaluating secondary data


Criteria Issues Remarks

Specifications • Data collection method, response rate, • Data should be reliable, valid,
& Methodology quality & analysis of data, sampling & generalizable to the
technique & size, questionnaire design, problem.
fieldwork.
Error & • Examine errors in approach, research • Assess accuracy by comparing
accuracy design, sampling, data collection & data from different sources.
analysis, & reporting. •
Census data are updated by
Objective • Time lag between collection & publication, syndicated firms.
frequency of updates. • The objective determines the
• Why were the data collected? relevance of data.
• Reconfigure the data to
Nature • Definition of key variables, units of increase their usefulness.
measurement, categories used, relationship
s examined. • Data should be obtained from
• Expertise, credibility, reputation, & an original source.
Dependability trustworthiness of the source.

9
BDU: Bahir Dar Institute of Technology: Computing Faculty
Process of data Collection
• Cleaning data
• Cleaning represents the set of final editing and imputation
procedures to enhance data quality and to prepare it for analysis
.
• Cleaning is an extremely delicate process that, if not conducted
properly, can seriously compromise the data collected and the
whole analysis
• Cleaning can reveal itself very beneficial since it can:

improve or at the very least retain the quality of the data collected

make the data more user friendly for analysis

increase the credibility of the data collected
• Yet elaborate cleaning can seriously compromise the information
collected since it can:
• significantly change the data collected
• introduce errors in the final data
• destroy evidence of poor quality data
10
BDU: Bahir Dar Institute of Technology: Computing Faculty
Process of data Collection
• Data cleaning approach is based on two fundamental principles:
• The integrity of the data collected is paramount (principal)
• Extensive imputation is not allowed. The 5% ceiling.
• Practical steps in cleaning data
• Structural checks : structural composition of the data file(s) and its
correspondence to the questionnaire. These controls are conducted
to ensure that:
• The data file(s) contain all the sections of the questionnaire
• Each record has a unique ID.
• The IDs used correspond to the selected sample
• Each variable has a unique label.
• Each label in the data corresponds to the labels in the paper questionnaire
• All variables in the data set appear in the questionnaire and vice versa

11
BDU: Bahir Dar Institute of Technology: Computing Faculty
Data presentation
• Tabular method of data presentation
• It can be more easily understood than the facts,
• They help to facilitate statistical treatment of data,
• It helps to avoid any repetition,

• Type of tables
• Simple (one way) table: shows one characteristic,
• Two way table: shows two characteristics,
• Higher order table: shows three or more characteristics

• Frequency distribution: steps


• Begin by arranging the data from smallest to largest
• Count values that repeat by making tallies
• Group observations with comparable magnitude
• Stop the classification when you are sure that the first and
• the last classes respectively consist the smallest and larges values
• Indicate how many values are included in a class
12
BDU: Bahir Dar Institute of Technology: Computing Faculty
Data presentation
• Graphic methods of data presentation
• Data in frequency distribution can be presented graphically of diagrammatically,
• Graphs are the natural choice to represent continuous data

• For discrete or qualitative data, the way to present the data as


• Pie chart (multiply relative frequency by 360,
• Pictogram
• Bar graph, etc.

• For continual data:


• Histogram (class boundary and abs. freq.)
• Frequency polygon (Class mark and abs freq.)
• Cumulative frequency graph (class mark and
• cumulative frequency)

13
BDU: Bahir Dar Institute of Technology: Computing Faculty
Feature selection
• In machine learning, data mining, and statistics, feature selection, al
so known as variable selection, attribute selection or variable sub
set selection, is the process of selecting a subset of relevant features
(variables, predictors) for use in model construction,
• Feature selection techniques are used for several reasons:
• Simplification of models to make them easier to interpret by researchers/users,
[1 ]

• shorter training times,


• to avoid the curse of dimensionality,
• enhanced generalization by reducing overfitting (formally, reduction of varian
ce)

• The central premise when using a feature selection technique is that


the data contains some features that are either redundant or irreleva
nt, and can thus be removed without incurring much loss of informat
ion,
BDU: Bahir Dar Institute of Technology: Computing Faculty 14
Feature selection
• Feature selection techniques should be distinguished from feature
extraction,

• Feature extraction creates new features from functions of the origi


nal features, whereas feature selection returns a subset of the featur
es,

• Feature selection techniques are often used in domains where there a


re many features and comparatively few samples (or data points).

• A feature selection algorithm can be seen as the combination of a se


arch technique for proposing new feature subsets, along with an eval
uation measure which scores the different feature subsets.

BDU: Bahir Dar Institute of Technology: Computing Faculty 15


Training, Testing and Validation Data Sets
• This is aimed to be a short primer for anyone who needs to know the differ
ence between the various dataset splits while training Machine Learning /
Data Mining models.

• In ML/DM, a common task is the study and construction of algorithms tha


t can learn from and make predictions on data,

• Such algorithms work by making data-driven predictions or decisions, thro


ugh building a mathematical model from input data,

• The data used to build the final model usually comes from multiple datasets

• Therefore, in particular, three data sets are commonly used in different sta
ges of the creation of the model.

• The model is initially fit on a training dataset, that is a set of examples us


ed to fit the parameters (e.g. weights of connections between neurons in art
ificial neural networks) of the model.
BDU: Bahir Dar Institute of Technology: Computing Faculty 16
Training, Testing and Validation Data Sets
• The Training Dataset: The sample of data used to fit the model. Tra
ining Data is labeled data used to train the machine learning algorit
hms and increase accuracy.

• How do we make machines intelligent?


• The answer to this question – make them feed on relevant data. This is also ref
erred to as Training data.

• The model (e.g. a neural net or a naive Bayes classifier) is trained on


the training dataset using a supervised learning method (e.g. gradient
descent or stochastic gradient descent),

• In practice, the training dataset often consist of pairs of an input vect


or (or scalar) and the corresponding output vector (or scalar), which
is commonly denoted as the target (or label).
BDU: Bahir Dar Institute of Technology: Computing Faculty 17
Training, Testing and Validation Data Sets
• The actual dataset that we use to train the model (weights and biases in the
case of Neural Network),

• The model sees and learns from this data that a training dataset is a dataset
of examples used for learning, that is to fit the parameters (e.g., weights) o
f, for example, a classifier.

• The test dataset is a dataset used to provide an unbiased evaluation of a fi


nal model fit on the training dataset,
• The test set is a set of observations used to evaluate the performance of th
e model using some performance metric, which is important that no observ
ations from the training set are included in the test set,
• If the data in the test dataset has never been used in training (for example i
n cross-validation), the test dataset is also called a holdout dataset.
• Cross-validation is A dataset can be repeatedly split into a training dataset
, and holdout dataset is the part of the original dataset can be set aside an
18
d used as a test set
Training, Testing and Validation Data Sets
• Successively, the fitted model is used to predict the responses for the obser
vations in a second dataset called the validation dataset,

• The validation dataset provides an unbiased evaluation of a model fit on t


he training dataset while tuning the model's hyper parameters (e.g. the nu
mber of hidden units in a neural network),

• Validation datasets can be used for regularization by early stopping: stop tr


aining when the error on the validation dataset increases, as this is a sign o
f overfitting to the training dataset,

• This simple procedure is complicated in practice by the fact that the valida
tion dataset's error may fluctuate during training, producing multiple local
minima.

BDU: Bahir Dar Institute of Technology: Computing Faculty 19


Training, Testing and Validation Data Sets
• Visualization of the splits

• A test dataset is a dataset that is independent of the training dataset,


but that follows the same probability distribution as the training data
set.

• If a model fit to the training dataset also fits the test dataset well, mi
nimal overfitting has taken place
BDU: Bahir Dar Institute of Technology: Computing Faculty 20
Training, Testing and Validation Data Sets
• Better fitting of the training dataset as opposed to the test dataset usu
ally points to overfitting,

• A test set is therefore a set of examples used only to assess the perfo
rmance (i.e. generalization) of a fully specified classifier,

• A training set (left) and a test set (right) from the same statistical po
pulation are shown as blue points,

• Two predictive models are fit to the training data. Both fitted models
are plotted with both the training and test sets.
BDU: Bahir Dar Institute of Technology: Computing Faculty 21
The Data Split Ratio
• Now that we know what these datasets do, we might be looking for recom
mendations on how to split the dataset into Train, Validation and Test sets…

• This mainly depends on 2 things.


• First, the total number of samples in the data and
• second, on the actual model we are training.

• Some models need substantial data to train upon, so in this case we would
optimize for the larger training sets,

• Models with very few hyper parameters will be easy to validate and tune, s
o we can probably reduce the size of the validation set, but if the model ha
s many hyper parameters, we would want to have a large validation set as
well (although we should also consider cross validation). Also,

• if we happen to have a model with no hyper parameters or ones that cannot


be easily tuned, we probably don’t need a validation set too!
BDU: Bahir Dar Institute of Technology: Computing Faculty 22
The Data Split Ratio
• Should training data always be more than testing data ?

• The answer is very intuitive, which it takes multiples inputs of simil


ar kind for the model to identify the common underlying patterns,

• Whereas this ability of identifying could very well be tested using fe


wer samples,

• In general, data showing an Instance, Feature, and Train-Test Datase


ts,

• Instance: A single row of data is called an instance. It is an observati


on from the domain.

BDU: Bahir Dar Institute of Technology: Computing Faculty 23


Interpreting the data
• Interpreting the data means several things. In particular, it means:
• Relating the findings to the original research problem and to the specific researc
h questions and hypotheses.

• Researchers must eventually come full circle to their starting point – why they
conducted a research study in the first place and what they hoped to discover –
and relate their results to their initial concerns and questions

• Relating the findings to preexisting literature, concepts, theories, a


nd research studies.
• To be useful, research findings must in some way be connected to the larger pi
cture – to what people already know or believe about the topic in question.

• Perhaps the new findings confirm a current theoretical perspective, perhaps th


ey cast doubt on common “knowledge”, or perhaps they simply raise new que
stions that must be addressed before we can truly understand the phenomenon
in question

24
BDU: Bahir Dar Institute of Technology: Computing Faculty
Interpreting the data

 Determining whether the findings have practical significance as well


as statistical significance.

 Statistical significance is one thing; practical significance – whether findings


are actually useful – is something else altogether.

• Identifying limitations of the study. Finally, interpreting the data inv


olves outlining the weaknesses of the study that yielded them.

• No research study can be perfect, and its imperfections inevitably cast at least
a hint of doubt on its findings. Good researchers know – and they also report –
the weaknesses along with the strengths of their research

25
BDU: Bahir Dar Institute of Technology: Computing Faculty
Sampling and sampling process
• Type of sample methods Samples

Probability samples
Non-Probability
samples

Simple
random Stratified
Chunk
Systematic
Quota Cluster
Judgment

26
BDU: Bahir Dar Institute of Technology: Computing Faculty
Sampling and sampling process
• Probability sampling
• Subjects of the sample are chosen based on known probabilities
• Simple random sampling
• Every individual or item from the frame has an equal chance of being selected
• Selection may be with replacement or without replacement
• Samples obtained from table of random numbers or computer random number
generators
• Systematic sampling
• Decide on sample size: n
• Divide frame of N individuals into groups of k individuals: k=N/n
• Randomly select one individual from the 1st group
• Select every k-th individual thereafter

27
BDU: Bahir Dar Institute of Technology: Computing Faculty
Sampling and sampling process
• Systematic sampling
N = 64
n=8
k=8 First group

• Stratified samples
• Population divided into two or more groups according to some common characteristic
(mutually exclusive )
• Simple random sample selected from each group
• The two or more samples are combined into one

28
BDU: Bahir Dar Institute of Technology: Computing Faculty
Sampling and sampling process
• Clustering sampling
• Population divided into several “clusters,” each representative of the population
• Simple random sample selected from each
• The samples are combined into one

Population divided
into 4 clusters.

• Multi stage sampling


• This is where the researcher divides the population into strata, samples the
strata, then resample.
• Repeat the process until the ultimate sampling units are selected.

29
BDU: Bahir Dar Institute of Technology: Computing Faculty
30
BDU: Bahir Dar Institute of Technology: Computing Faculty

You might also like