Welcome to Scribd!

Skip carousel

DSBDA2

Uploaded by

403 Chaudhari Sanika Sagar

0% found this document useful (0 votes)

2 views6 pages

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

2 views6 pages

DSBDA2

Uploaded by

403 Chaudhari Sanika Sagar

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 6

Search inside document

import pandas as pd

import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from scipy.stats import zscore
from scipy.stats import zscore, skew, shapiro, probplot
from scipy.stats import zscore, skew, shapiro, probplot
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv("Test_Data.csv")

data

age sex bmi health_gradient smoker region

children
0 40.000000 male 29.900000 35760.40000 no southwest
2.0
1 47.000000 male 32.300000 49034.63000 no southwest
1.0
2 54.000000 female 28.880000 45038.93760 no northeast
2.0
3 NaN male 30.568094 0.00000 no northeast
3.0
4 59.130049 male 33.132854 64912.13924 yes northeast
4.0
.. ... ... ... ... ... ...
...
487 51.000000 male 27.740000 39244.88760 no northeast
5.0
488 33.000000 male 42.400000 59326.08000 no southwest
5.0
489 47.769999 male 29.064615 40353.79402 no northeast
5.0
490 41.530738 female 24.260852 24444.53324 no southeast
5.0
491 36.000000 male 33.400000 40160.16000 yes southwest
5.0

[492 rows x 7 columns]

data.isnull().sum()

age 1
sex 0
bmi 1
health_gradient 0
smoker 0
region 0
children 1
dtype: int64

for column in data.columns:

print(f"\nColumn: {column}")
print(data[column].head())

Column: age
0 40.000000
1 47.000000
2 54.000000
3 NaN
4 59.130049
Name: age, dtype: float64

Column: sex
0 male
1 male
2 female
3 male
4 male
Name: sex, dtype: object

Column: bmi
0 29.900000
1 32.300000
2 28.880000
3 30.568094
4 33.132854
Name: bmi, dtype: float64

Column: health_gradient
0 35760.40000
1 49034.63000
2 45038.93760
3 0.00000
4 64912.13924
Name: health_gradient, dtype: float64

Column: smoker
0 no
1 no
2 no
3 no
4 yes
Name: smoker, dtype: object

Column: region
0 southwest
1 southwest
2 northeast
3 northeast
4 northeast
Name: region, dtype: object

Column: children
0 2.0
1 1.0
2 2.0
3 3.0
4 4.0
Name: children, dtype: float64

handle_missing_values_categorical =
SimpleImputer(strategy='most_frequent') #handle strings with mode
data_categorical = data.select_dtypes(exclude='number')
data[data_categorical.columns] =
handle_missing_values_categorical.fit_transform(data_categorical)
#fit_transform calculates the most frequent value for each categorical
column in the training data (data_categorical) and then replaces
missing values with these calculated values.

handle_missing_values_numeric_mean = SimpleImputer(strategy='mean')
#handle numeric with mean
data_numeric = data.select_dtypes(include='number')
data[data_numeric.columns] =
handle_missing_values_numeric_mean.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical
column in the training data (data_categorical) and then replaces
missing values with these calculated values.

handle_missing_values_numeric_median =
SimpleImputer(strategy='median') #handle numeric with median
data_numeric = data.select_dtypes(include='number')
data[data_numeric.columns] =
handle_missing_values_numeric_median.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical
column in the training data (data_categorical) and then replaces
missing values with these calculated values.

print(data)

age sex bmi health_gradient smoker region

children
0 40.000000 male 29.900000 35760.40000 no southwest
2.0
1 47.000000 male 32.300000 49034.63000 no southwest
1.0
2 54.000000 female 28.880000 45038.93760 no northeast
2.0
3 38.844276 male 30.568094 0.00000 no northeast
3.0
4 59.130049 male 33.132854 64912.13924 yes northeast
4.0
.. ... ... ... ... ... ...
...
487 51.000000 male 27.740000 39244.88760 no northeast
5.0
488 33.000000 male 42.400000 59326.08000 no southwest
5.0
489 47.769999 male 29.064615 40353.79402 no northeast
5.0
490 41.530738 female 24.260852 24444.53324 no southeast
5.0
491 36.000000 male 33.400000 40160.16000 yes southwest
5.0

[492 rows x 7 columns]

#Calculate Z-Scores:
z_scores = zscore(data.select_dtypes(include='number'), axis=0)

#Identify Outliers:
outliers = (z_scores > 3) | (z_scores < -3)

#Mask Outliers in the DataFrame:

data_no_outliers = data.select_dtypes(include='number').mask(outliers,
np.nan)

for column in data_no_outliers.columns:

print(f"\nColumn: {column}")
print(data_no_outliers[column].head())

Column: age
0 40.000000
1 47.000000
2 54.000000
3 38.844276
4 59.130049
Name: age, dtype: float64

Column: bmi
0 29.900000
1 32.300000
2 28.880000
3 30.568094
4 33.132854
Name: bmi, dtype: float64
Column: health_gradient
0 35760.40000
1 49034.63000
2 45038.93760
3 0.00000
4 64912.13924
Name: health_gradient, dtype: float64

Column: children
0 NaN
1 NaN
2 NaN
3 NaN
4 4.0
Name: children, dtype: float64

skew_before = data_no_outliers['age'].skew()
print(f"\nSkewness before transformation: {skew_before}")

Skewness before transformation: 0.0453252970458881

data

age sex bmi health_gradient smoker region

children
0 40.000000 male 29.900000 35760.40000 no southwest
2.0
1 47.000000 male 32.300000 49034.63000 no southwest
1.0
2 54.000000 female 28.880000 45038.93760 no northeast
2.0
3 38.844276 male 30.568094 0.00000 no northeast
3.0
4 59.130049 male 33.132854 64912.13924 yes northeast
4.0
.. ... ... ... ... ... ...
...
487 51.000000 male 27.740000 39244.88760 no northeast
5.0
488 33.000000 male 42.400000 59326.08000 no southwest
5.0
489 47.769999 male 29.064615 40353.79402 no northeast
5.0
490 41.530738 female 24.260852 24444.53324 no southeast
5.0
491 36.000000 male 33.400000 40160.16000 yes southwest
5.0

[492 rows x 7 columns]

data['health_gradient'] = np.sqrt(data['health_gradient'])

data

age sex bmi health_gradient smoker region

children
0 40.000000 male 29.900000 189.104204 no southwest
2.0
1 47.000000 male 32.300000 221.437644 no southwest
1.0
2 54.000000 female 28.880000 212.223791 no northeast
2.0
3 38.844276 male 30.568094 0.000000 no northeast
3.0
4 59.130049 male 33.132854 254.778608 yes northeast
4.0
.. ... ... ... ... ... ...
...
487 51.000000 male 27.740000 198.103225 no northeast
5.0
488 33.000000 male 42.400000 243.569456 no southwest
5.0
489 47.769999 male 29.064615 200.882538 no northeast
5.0
490 41.530738 female 24.260852 156.347476 no southeast
5.0
491 36.000000 male 33.400000 200.400000 yes southwest
5.0

[492 rows x 7 columns]

The Fibonacci Number Series
From Everand
The Fibonacci Number Series
Michael Husted
Rating: 5 out of 5 stars
5/5 (1)
R06 Time-Series Analysis - Answers
Document56 pages
R06 Time-Series Analysis - Answers
Shashwat Desai
No ratings yet
Measures of Position For Ungrouped Data
Document47 pages
Measures of Position For Ungrouped Data
solidarity liquidation
No ratings yet
Mathematics: Quarter 4 - Module 3 Interpreting Measures of Position
Document14 pages
Mathematics: Quarter 4 - Module 3 Interpreting Measures of Position
Dianne Dynah Bilaro Dy
100% (3)
Math 10-Q4-Module-1
Document16 pages
Math 10-Q4-Module-1
geefaulks
No ratings yet
Linear Regression: Data Exploration
Document12 pages
Linear Regression: Data Exploration
Fèdríck Sämùél
No ratings yet
1 Linear Regression: 1.1 Data Exploration
Document12 pages
1 Linear Regression: 1.1 Data Exploration
Xyz
No ratings yet
Daily Task 3 & 4 - Query Function & List Comprehension - 06-07-2022 - Jupyter Notebook
Document6 pages
Daily Task 3 & 4 - Query Function & List Comprehension - 06-07-2022 - Jupyter Notebook
Vrushali Vishwasrao
No ratings yet
Practica 11
Document7 pages
Practica 11
2marlenehh2003
No ratings yet
Predicting Insurance Prices Using Machine Learning
Document12 pages
Predicting Insurance Prices Using Machine Learning
Maheshwar Anthwal
No ratings yet
Python Sklearn Linear Regression
Document45 pages
Python Sklearn Linear Regression
Surya Pranav Annadanam
No ratings yet
Ass 1 Dsbda
Document8 pages
Ass 1 Dsbda
adagalepayale023
No ratings yet
1
Document4 pages
1
Arpita Das
No ratings yet
Medical Cost Prediction
Document27 pages
Medical Cost Prediction
Abhinav Raj
No ratings yet
G of Testing Exam
Document5 pages
G of Testing Exam
D P
No ratings yet
Mock - Coding: Numpy NP CSV Sklearn - Linear - Model Pandas PD Matplotlib - Pyplot PLT Sklearn - Metrics
Document2 pages
Mock - Coding: Numpy NP CSV Sklearn - Linear - Model Pandas PD Matplotlib - Pyplot PLT Sklearn - Metrics
YTPUB001
No ratings yet
Lecture 3 Part 1 Understanding Data With Statistics
Document7 pages
Lecture 3 Part 1 Understanding Data With Statistics
zhraa qassem
No ratings yet
ML Project - End-To-End-Heart-Disease-Classification
Document30 pages
ML Project - End-To-End-Heart-Disease-Classification
Anurag Kumar
No ratings yet
Assignment3 VidulGarg
Document14 pages
Assignment3 VidulGarg
vidulgarg1524
No ratings yet
Dsbda 4
Document4 pages
Dsbda 4
Arbaz Shaikh
No ratings yet
Logistic Regression 205
Document8 pages
Logistic Regression 205
Ranadeep Dey
No ratings yet
Bus 308 Week One Assignment
Document14 pages
Bus 308 Week One Assignment
menefiem
No ratings yet
Logistic Pima Indians - Ipynb - Colaboratory
Document4 pages
Logistic Pima Indians - Ipynb - Colaboratory
SHEKHAR SWAMI
No ratings yet
Breast Cancer Dataset
Document154 pages
Breast Cancer Dataset
Iris Rumi
No ratings yet
Python Pandas Tutorial: Dataframe, Date Range, Slice What Is Pandas?
Document7 pages
Python Pandas Tutorial: Dataframe, Date Range, Slice What Is Pandas?
Anish kr Singh
No ratings yet
ch4 Dummy
Document54 pages
ch4 Dummy
Nguyễn Lê Minh Anh
No ratings yet
Analysing NBA DATA
Document13 pages
Analysing NBA DATA
skreddyvgst
No ratings yet
Breast Cancer Classification With Machine Learning
Document17 pages
Breast Cancer Classification With Machine Learning
Aiza Emaan
No ratings yet
Loading The Dataset: 'Diabetes - CSV'
Document4 pages
Loading The Dataset: 'Diabetes - CSV'
Divyani Chavan
No ratings yet
B - 59 - SMA - Exp 4
Document9 pages
B - 59 - SMA - Exp 4
Ritz Fernandes
No ratings yet
2 and 3
Document6 pages
2 and 3
Radhika Khandelwal
No ratings yet
Stat 2032 2014 Final Solutions
Document12 pages
Stat 2032 2014 Final Solutions
Jason
No ratings yet
Gaurav - Data Mining Lab Assignment
Document36 pages
Gaurav - Data Mining Lab Assignment
JJ OLATUNJI
No ratings yet
California 1673295505
Document18 pages
California 1673295505
doudoudz
No ratings yet
Chi-Square Test of Independence
Document15 pages
Chi-Square Test of Independence
Gaming Account
No ratings yet
Project 8 Predictive Analytics - Ipynb - Colaboratory
Document8 pages
Project 8 Predictive Analytics - Ipynb - Colaboratory
aadityadeolalikar
No ratings yet
Stat Is Tika
Document9 pages
Stat Is Tika
miranti
No ratings yet
Practical-5 - Jupyter Notebook
Document8 pages
Practical-5 - Jupyter Notebook
Harsha Gohil
100% (1)
Analisis Diskriminan: Tugas Individu Analisis Peubah Ganda
Document7 pages
Analisis Diskriminan: Tugas Individu Analisis Peubah Ganda
Christian Beren
No ratings yet
Data Munging - Ipynb - Colaboratory - Yodhi Adhi Sanjaya
Document4 pages
Data Munging - Ipynb - Colaboratory - Yodhi Adhi Sanjaya
adhi
No ratings yet
1 Assignment 2: Hypothesis Testing
Document11 pages
1 Assignment 2: Hypothesis Testing
Richard Pill
No ratings yet
Comsats University Islamabad Sub Campus Vehari: Assignment No 2 Submitted To: Dr. Rab Nawaz
Document7 pages
Comsats University Islamabad Sub Campus Vehari: Assignment No 2 Submitted To: Dr. Rab Nawaz
sania iram
No ratings yet
R Critical Value Table PDF
Document1 page
R Critical Value Table PDF
Wendy
100% (1)
Pearsonstable PDF
Document1 page
Pearsonstable PDF
Busyairi Alfan Ramadhan
No ratings yet
Some Important Function of Pandas Library
Document24 pages
Some Important Function of Pandas Library
Rania Dirar
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
Document31 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
Amit
No ratings yet
FDS Solved Slips
Document63 pages
FDS Solved Slips
yashm4071
100% (1)
Name: Muhammad Sarfraz Seat: EP1850086 Section: A Course Code: 514 Course Name: Data Warehousing and Data Mining
Document39 pages
Name: Muhammad Sarfraz Seat: EP1850086 Section: A Course Code: 514 Course Name: Data Warehousing and Data Mining
Muhammad Sarfraz
No ratings yet
Pca Implementation Notebook
Document4 pages
Pca Implementation Notebook
Walid Sassi
No ratings yet
Untitled4 Assigment 3
Document9 pages
Untitled4 Assigment 3
eigintaee
No ratings yet
Project-Password Strength Classifier
Document6 pages
Project-Password Strength Classifier
Olalekan Samuel
No ratings yet
Practica 9
Document24 pages
Practica 9
2marlenehh2003
No ratings yet
Pandas
Document21 pages
Pandas
Shubham dattatray kote
No ratings yet
Day44 KNN Classification
Document2 pages
Day44 KNN Classification
Igor Fernandes
No ratings yet
Satya772244@gmail - Com House Price Prediction
Document5 pages
Satya772244@gmail - Com House Price Prediction
Satyendra Verma
No ratings yet
1717 Chapter II
Document7 pages
1717 Chapter II
Kizaru
No ratings yet
Analisis Dinamico Eje X
Document24 pages
Analisis Dinamico Eje X
VICTOR MANUEL PAITAN MENDEZ
No ratings yet
Estadistica
Document11 pages
Estadistica
César Chávez
No ratings yet
(A) Regress Log of Wages On A Constant and The Female Dummy. Paste Output Here
Document5 pages
(A) Regress Log of Wages On A Constant and The Female Dummy. Paste Output Here
akshay patri
No ratings yet
Cl-Vii Ass4 4301063
Document7 pages
Cl-Vii Ass4 4301063
ATHARVA SHINDE
No ratings yet
Comsats University Islamabad Sub Campus Vehari: Assignment No 2 Submitted To: Dr. Rab Nawaz
Document7 pages
Comsats University Islamabad Sub Campus Vehari: Assignment No 2 Submitted To: Dr. Rab Nawaz
sania iram
No ratings yet
GridDataReport-kel 1 Group 3
Document7 pages
GridDataReport-kel 1 Group 3
mefriadi adi
No ratings yet
Gridding Report - : Data Source
Document7 pages
Gridding Report - : Data Source
Waariss Hasan
No ratings yet
Kuis 2 1. Panggil Data Wages Penyelesaian: Script
Document14 pages
Kuis 2 1. Panggil Data Wages Penyelesaian: Script
Hilda Ramadhania
No ratings yet
Unit 3 - CORRELATION AND REGRESSION
Document85 pages
Unit 3 - CORRELATION AND REGRESSION
saritalodhi636
No ratings yet
Assignment T Test
Document13 pages
Assignment T Test
suhaimi sobrie
No ratings yet
Predicting The Churn in Telecom Industry
Document23 pages
Predicting The Churn in Telecom Industry
ashish841
No ratings yet
MDA Book
Document68 pages
MDA Book
Akshat
No ratings yet
Iiia) Measures of Central Tendency and Dispersion, Moments, Skewness, Kurtosis (1 Marks)
Document17 pages
Iiia) Measures of Central Tendency and Dispersion, Moments, Skewness, Kurtosis (1 Marks)
Aniket Sinare
No ratings yet
Measures of Dispersion Kurtosis and Skewness
Document19 pages
Measures of Dispersion Kurtosis and Skewness
JOHN RAUTO
No ratings yet
Chapter 10 Return and Risk (CAPM)
Document26 pages
Chapter 10 Return and Risk (CAPM)
Ken Ratri
No ratings yet
Normal Distribution: Statistics and Probability Topic #4
Document18 pages
Normal Distribution: Statistics and Probability Topic #4
Diama, Hazel Anne B. 11-STEM 9
No ratings yet
S1 Chapter 3 PDF
Document40 pages
S1 Chapter 3 PDF
Islam hamdy
No ratings yet
Box Plot Answers MME
Document2 pages
Box Plot Answers MME
Sabih Azhar
No ratings yet
STAT7055 T01 Sol
Document8 pages
STAT7055 T01 Sol
hydrogenbearowo
No ratings yet
RD Sharma Class 11 Maths Chapter 32
Document48 pages
RD Sharma Class 11 Maths Chapter 32
atrayeeganguly8f39
No ratings yet
Stsistics Notes
Document66 pages
Stsistics Notes
neha
No ratings yet
Chapter 3 - Basic Statistical Concepts
Document16 pages
Chapter 3 - Basic Statistical Concepts
Christian Alfred Villena
No ratings yet
Module 3 Descriptive Statistics Numerical Measures
Document28 pages
Module 3 Descriptive Statistics Numerical Measures
Sophia Angela Garcia
No ratings yet
3 Data Description and Measures of Central Tenndency
Document72 pages
3 Data Description and Measures of Central Tenndency
Olivier Makengo
No ratings yet
Sampiling Distribution
Document23 pages
Sampiling Distribution
Protik
No ratings yet
Frequency Distribution: Postgraduate Corner
Document3 pages
Frequency Distribution: Postgraduate Corner
Pranay Pandey
No ratings yet
Quality Control Solutions
Document2 pages
Quality Control Solutions
Shivangi Bhasin
No ratings yet
Reflection Paper On Quartile
Document2 pages
Reflection Paper On Quartile
Rodel Esteban
No ratings yet
Sample Quiz 2 Statistics Essentials of Business Development
Document15 pages
Sample Quiz 2 Statistics Essentials of Business Development
Jessica Boehm
No ratings yet
Statistical Packages
Document18 pages
Statistical Packages
annie naeem
No ratings yet
STATS
Document2 pages
STATS
Carl Angelo Martin
No ratings yet
OR Project D Mart
Document12 pages
OR Project D Mart
Pratiksha Chauhan
No ratings yet
3 - Introduction To Data
Document56 pages
3 - Introduction To Data
Kanika Chanana
No ratings yet
Quartile of Grouped Data
Document25 pages
Quartile of Grouped Data
JohnRosevilSalaan
No ratings yet