Professional Documents
Culture Documents
1. Explain errors
->
Errors are the norm, not the exception, when working with data. By now, you've
probably heard the statistic that 88% of spreadsheets contain errors. Since we cannot
safely assume that any of the data, we work with is error-free, our mission should be
to find and tackle errors in the most efficient way possible.
PK
3. Explain the principles of Data Analysis.
->
Data analysis is a process of inspecting, cleansing, transforming,
and modelling data with the goal of discovering useful information, informing
conclusions, and supporting decision-making. Following are the principles of data
analysis.
(a) Completeness: The number of incorrect entries on each data source’s fields as a
percentage of the total data can be and if the data source holds specific importance
because of critical data (customer names, phone numbers, e-mail addresses, etc.),
then the analysis is started at first, to ensure that the data source is fit to progress to
the next phase of analysis for completeness on noncritical data. For example, for
personal data to be unique, you need, as a minimum, a first name, last name, and
date of birth. If any of this information is not part of the data, it is an incomplete personal
data entry. Completeness is specific to the business area of the data you are
processing.
(b) Uniqueness: The specific value is evaluated, in comparison to the rest of the data
in the field. Also, the value is tested against other known sources of the same data
sets. The last test for uniqueness is to show where the same field is in many data
sources. Uniqueness is reported normally, as a histogram across all unique values in
each data source.
(c) Timeliness: The impact of date and time on the data source is recorded and are
checked if there are periods of stability or instability. This check is useful when
scheduling extracts from source systems. Customer’s operational people are the ones
with whom we should work closely, to ensure that your data extracts are performed at
the correct point in the business cycle.
(d) Validity: Validity is tested against known and approved standards. It is recorded
as a percentage of nonconformance against the standard. It is bean found that most
data entries are covered by a standard. For example, country code uses ISO 3166-1;
currencies use ISO 4217. Customer-specific standards should also be considered, for
example, International Classification of Diseases (ICD) standards ICD-10. Standards
change over time. For example, ICD-10 is the tenth version of the standard. ICD-7
took effect in 1958, ICD-8A in 1968, ICD-9 in 1979, and ICD-10 in 1999. So, when you
validate data, make sure that you apply the correct standard on the correct data period.
(e) Accuracy: Accuracy is a measure of the data against the real-world person or
object that is recorded in the data source. There are regulations, such as the European
Union’s General Data Protection Regulation (GDPR), that require data to be compliant
for accuracy. It is always recommended to investigate standards and regulations for
complying with for accuracy.
(f) Consistency: This measure is recorded as the shift in the patterns in the data.
Measure how data changes load after load. Patterns and checksums should be
measured for data sources.
PK
4.How will you handle missing value in pandas?
->
• Pandas DataFrame, dropna () function is used to remove rows and columns with
Null/NaN/NaT values.The dropna() function parameters are:
a axis: {0 or 'index', 1 or 'columns'}, default O. If 0, drop rows with null values.
If 1,
drop columns with missing values.
b how: ['any', 'all'}, default 'any'. If 'any', drop the row/column if any of the values
is null. If 'all', drop the row/column if all the values are missing.
c thresh: An int value to specify the threshold for the drop operation.
d subset: Specifies the rows/columns to look for null values.
e inplace: A boolean value, if true, the source DataFrame is changed and None
is returned
• Missing values in pandas can be replaced in following ways:
1.Drop the Columns Where All Elements Are Missing Values
5. (a) Write a python program to drop the columns where any the elements are
missing values.
->
import pandas as pd
import numpy as np
student_dict = {"name": ["Joe", "Sam", "Harry"], "age": [20, 21, 19], "marks": [85.10,
np.nan, 91.54]}
# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)
# drop column with NaN
student_df = student_df.dropna(axis='columns',how='any')
print(student_df)
(b) Write a python program to drop the columns where all the elements are
missing values.
->
import pandas as pd
import numpy as np
student_dict = {"name": ["Joe", "Sam", "Harry"], "age": [20, 21, 19], "marks": [85.10,
np.nan, 91.54]}
# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)
# drop column with NaN
student_df = student_df.dropna(axis='columns',how=all)
print(student_df)
PK