You are on page 1of 16

Unit 2

MBA/BBA/B.com /BCA/UGC Net

By
Dr. Anand Vyas
Data Collection,
Data Collection Process
Step 1: Identify issues and/or opportunities for
collecting data.
Step 2: Select issue(s) and/or opportunity(ies) and
set goals.
Step 3: Plan an approach and methods.
Step 4: Collect data.
Step 5: Analyze and interpret data.
Step 6: Act on results.
Data Classification
• Data classification is broadly defined as the
process of organizing data by relevant
categories so that it may be used and
protected more efficiently. On a basic level,
the classification process makes data easier to
locate and retrieve
Data Classification
• Qualitative Classification
• Qualitative Classification
• Data Tables or Tabuler Présentation
Data Management,
• Meaning
• Data management is the process of ingesting, storing, organizing and
maintaining the data created and collected by an organization. Effective data
management is a crucial piece of deploying the IT systems that run business
applications and provide analytical information to help drive operational
decision-making and strategic planning by corporate executives, business
managers and other end users.
• Need of data management
• Data increasingly is seen as a corporate asset that can be used to make more-
informed business decisions, improve marketing campaigns, optimize business
operations and reduce costs, all with the goal of increasing revenue and
profits. But a lack of proper data management can saddle organizations with
incompatible data silos, inconsistent data sets and data quality problems that
limit their ability to run business intelligence (BI) and analytics applications —
or, worse, lead to faulty findings.
Big Data, Management,
Organization/sources of data,
• Big data management is the organization,
administration and governance of large
volumes of both structured and unstructured
data.
• Big data management refers to the efficient
handling, organization or use of large volumes
of structured and unstructured data belonging
to an organization.
• CHARACTERISTICS
• Variety: To the existing landscape of transactional and demographic data
such as phone numbers and addresses, information in the form of
photographs, audio streams, video, and a host of other formats now
contributes to a multiplicity of data types about 80% of which are
completely unstructured.
• Volume: This trait refers to the immense amounts of information
generated every second via social media, cell phones, cars, transactions,
connected sensors, images, video, and text. In petabytes, terabytes, or
even zettabytes, these volumes can only be managed by big data
technologies.
• Velocity: Information is streaming into data repositories at a prodigious
rate, and this characteristic alludes to the speed of data accumulation. It
also refers to the speed with which big data can be processed and
analyzed to extract the insights and patterns it contains. These days, that
speed is often real-time.
• Organization/sources of data,
• Primary and Secondary
Importance of data quality,
• Quality data is key to making accurate, informed decisions. And while all
data has some level of “quality,” a variety of characteristics and factors
determines the degree of data quality (high-quality versus low-quality).
Furthermore, different data quality characteristics will likely be more
important to various stakeholders across the organization.
• Data quality characteristics and dimensions include:
• Completeness
• Accuracy
• Consistency
• Reasonability
• Integrity
• Timeliness
Dealing with noisy data,
• Noisy data is meaningless data. It includes any
data that cannot be understood and
interpreted correctly by machines, such as
unstructured text. Noisy data unnecessarily
increases the amount of storage space
required and can also adversely affect the
results of any data mining analysis.
• 1. Binning
• Binning is a technique where we sort the data and then partition the data into equal
frequency bins. Then you may either replace the noisy data with the bin mean bin median or
the bin boundary. This method is to smooth or handle noisy data.

• 2 Regression
• This is used to smooth the data and help handle data when unnecessary data is present. For
the analysis, purpose regression helps decide the suitable variable.
• A regression is a statistical technique that relates a dependent variable to one or more
independent (explanatory) variables

• 3. Clustering
• This is used for finding the outliers and also in grouping the data. Clustering is generally used
in unsupervised learning.

• 4. Outlier Analysis
• Outliers may be detected by clustering, where similar or close values are organized into the
same groups or clusters. Thus, values that fall far apart from the cluster may be considered
noise or outliers. Outliers are extreme values that deviate from other observations on data.
Dealing with missing or incomplete data,
• Listwise or case deletion
• Pairwise deletion
• Mean substitution : Mean imputation (MI) is one such
method in which the mean of the observed values for each
variable is computed and the missing values for that
variable are imputed by this mean.
• Regression imputation : Regression imputation fits a
statistical model on a variable with missing values.
Predictions of this regression model are used to substitute
the missing values in this variable.
• Last observation carried forward
Outlier Analysis, Methods to deal outlier,
• An outlier is an object that deviates significantly from the rest of
the objects. They can be caused by measurement or execution
error. The analysis of outlier data is referred to as outlier analysis or
outlier mining. An outlier is an element of a data set that distinctly
stands out from the rest of the data.

• There are also different degrees of outliers:

• Extreme outliers are beyond an “outer fence.”


• Mild outliers lie beyond an “inner fence” on either side
Data Visualization
• Data visualization is an interdisciplinary field that deals with the graphic
representation of data. It is a particularly efficient way of communicating when the
data is numerous as for example a time series.

• Data visualization is the graphical representation of information and data. By using


visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data.

• From an academic point of view, this representation can be considered as a


mapping between the original data (usually numerical) and graphic elements (for
example, lines or points in a chart). The mapping determines how the attributes of
these elements vary according to the data. In this light, a bar chart is a mapping of
the length of a bar to a magnitude of a variable. Since the graphic design of the
mapping can adversely affect the readability of a chart, mapping is a core
competency of Data visualization.
• Time-series: A single variable is captured over a period of time, such as the
unemployment rate over a 10-year period. A line chart may be used to demonstrate the
trend.

• Ranking: Categorical subdivisions are ranked in ascending or descending order, such as


a ranking of sales performance (the measure) by sales persons (the category, with each
sales person a categorical subdivision) during a single period. A bar chart may be used
to show the comparison across the sales persons.

• Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a


percentage out of 100%). A pie chart or bar chart can show the comparison of ratios,
such as the market share represented by competitors in a market.

• Deviation: Categorical subdivisions are compared against a reference, such as a


comparison of actual vs. budget expenses for several departments of a business for a
given time period. A bar chart can show comparison of the actual versus the reference
amount.

You might also like