You are on page 1of 3

UNIT III

1. Explain errors
->
Errors are the norm, not the exception, when working with data. By now, you've
probably heard the statistic that 88% of spreadsheets contain errors. Since we cannot
safely assume that any of the data, we work with is error-free, our mission should be
to find and tackle errors in the most efficient way possible.

2. Explain the different ways to deal with errors.


->
Errors are the norm, not the exception, when working with data. By now, you've
probably heard the statistic that 88% of spreadsheets contain errors. Since we cannot
safely assume that any of the data, we work with is error-free, our mission should be
to find and tackle errors in the most efficient way possible. Ways to handle error are
as follows:
(a) Accept the Error: If it there is an error within an acceptable standard it can be
accepted and can move on to the next data entry. If an error is accepted, it will affect
data science techniques and algorithms that perform classification, such as binning,
regression, clustering, and decision trees, because these processes assume that the
values in this example are not the same. This option is the easy option, but not always
the best option.
(b) Reject the Error: Occasionally, predominantly with first-time data imports, the
information is so severely damaged that it is better to simply delete the data entry
methodically and not try to correct it. Removing data is a last resort, instead a quality
flag can be used and avoid this erroneous data being used in data science techniques
and algorithms that it will negatively affect.
(c) Correct the Error: A major part of the assess step is dedicated to this option.
Spelling mistakes in customer names, addresses, and locations are a common source
of errors, which are methodically corrected. If there are variations on a name, it is
recommended to set one data source as the “master” and keep the data consolidated
and correct across all the databases using that master as your primary source. Original
error can be stored as a separate value, as it is useful for discovering patterns for data
sources that consistently produce errors.
(c) Create a Default Value: Most system developers assume that if the business
doesn’t enter the value, they should enter a default value. Common values that have
been observed are “unknown” or “n/a.” There are many undesirable choices, such as
birthdays for dates or pets’ names for first name and last name, parents’ addresses
etc. This address choice goes awry, of course, when more than 300 marketing letters
with sample products are sent to parents’ addresses by several companies that are
using the same service to distribute their marketing work. Default values must be
discussed with customer in detail and agree on an official “missing data” value.

PK
3. Explain the principles of Data Analysis.
->
Data analysis is a process of inspecting, cleansing, transforming,
and modelling data with the goal of discovering useful information, informing
conclusions, and supporting decision-making. Following are the principles of data
analysis.
(a) Completeness: The number of incorrect entries on each data source’s fields as a
percentage of the total data can be and if the data source holds specific importance
because of critical data (customer names, phone numbers, e-mail addresses, etc.),
then the analysis is started at first, to ensure that the data source is fit to progress to
the next phase of analysis for completeness on noncritical data. For example, for
personal data to be unique, you need, as a minimum, a first name, last name, and
date of birth. If any of this information is not part of the data, it is an incomplete personal
data entry. Completeness is specific to the business area of the data you are
processing.
(b) Uniqueness: The specific value is evaluated, in comparison to the rest of the data
in the field. Also, the value is tested against other known sources of the same data
sets. The last test for uniqueness is to show where the same field is in many data
sources. Uniqueness is reported normally, as a histogram across all unique values in
each data source.
(c) Timeliness: The impact of date and time on the data source is recorded and are
checked if there are periods of stability or instability. This check is useful when
scheduling extracts from source systems. Customer’s operational people are the ones
with whom we should work closely, to ensure that your data extracts are performed at
the correct point in the business cycle.
(d) Validity: Validity is tested against known and approved standards. It is recorded
as a percentage of nonconformance against the standard. It is bean found that most
data entries are covered by a standard. For example, country code uses ISO 3166-1;
currencies use ISO 4217. Customer-specific standards should also be considered, for
example, International Classification of Diseases (ICD) standards ICD-10. Standards
change over time. For example, ICD-10 is the tenth version of the standard. ICD-7
took effect in 1958, ICD-8A in 1968, ICD-9 in 1979, and ICD-10 in 1999. So, when you
validate data, make sure that you apply the correct standard on the correct data period.
(e) Accuracy: Accuracy is a measure of the data against the real-world person or
object that is recorded in the data source. There are regulations, such as the European
Union’s General Data Protection Regulation (GDPR), that require data to be compliant
for accuracy. It is always recommended to investigate standards and regulations for
complying with for accuracy.
(f) Consistency: This measure is recorded as the shift in the patterns in the data.
Measure how data changes load after load. Patterns and checksums should be
measured for data sources.

PK
4.How will you handle missing value in pandas?
->
• Pandas DataFrame, dropna () function is used to remove rows and columns with
Null/NaN/NaT values.The dropna() function parameters are:
a axis: {0 or 'index', 1 or 'columns'}, default O. If 0, drop rows with null values.
If 1,
drop columns with missing values.
b how: ['any', 'all'}, default 'any'. If 'any', drop the row/column if any of the values
is null. If 'all', drop the row/column if all the values are missing.
c thresh: An int value to specify the threshold for the drop operation.
d subset: Specifies the rows/columns to look for null values.
e inplace: A boolean value, if true, the source DataFrame is changed and None
is returned
• Missing values in pandas can be replaced in following ways:
1.Drop the Columns Where All Elements Are Missing Values

2. Drop the Columns Where Any of the Elements Is Missing Values


3. Keep Only the Rows That Contain a Maximum of Two Missing Values
4. Fill All Missing Values with the Mean, Median, Mode, Minimum

5. (a) Write a python program to drop the columns where any the elements are
missing values.
->
import pandas as pd
import numpy as np
student_dict = {"name": ["Joe", "Sam", "Harry"], "age": [20, 21, 19], "marks": [85.10,
np.nan, 91.54]}
# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)
# drop column with NaN
student_df = student_df.dropna(axis='columns',how='any')
print(student_df)

(b) Write a python program to drop the columns where all the elements are
missing values.
->
import pandas as pd
import numpy as np
student_dict = {"name": ["Joe", "Sam", "Harry"], "age": [20, 21, 19], "marks": [85.10,
np.nan, 91.54]}
# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)
# drop column with NaN
student_df = student_df.dropna(axis='columns',how=all)
print(student_df)

PK

You might also like