You are on page 1of 17

Exp - 1

February 22, 2024

1 a) Creating and loading different datasets in Python.


[27]: import pandas as pd
import numpy as np

[28]: df = pd.DataFrame({'Fruits':['banana', 'apple', 'pear', 'grapes', 'orange',␣


↪'kiwi', 'watermelon', 'pomegranate', 'pineapple', 'mango'],

'Vegetable':['cucumber', 'carrot', 'capsicum', 'onion',␣


↪'potato', 'lemon', 'tomato', 'raddish', 'beetroot', 'cabbage']})

[29]: df

[29]: Fruits Vegetable


0 banana cucumber
1 apple carrot
2 pear capsicum
3 grapes onion
4 orange potato
5 kiwi lemon
6 watermelon tomato
7 pomegranate raddish
8 pineapple beetroot
9 mango cabbage

[30]: df.to_csv("fruits_and_Vegetable.csv",index=False)

[31]: a1 = pd.read_csv('fruits_and_Vegetable.csv')

[32]: a1

[32]: Fruits Vegetable


0 banana cucumber
1 apple carrot
2 pear capsicum
3 grapes onion
4 orange potato
5 kiwi lemon
6 watermelon tomato

1
7 pomegranate raddish
8 pineapple beetroot
9 mango cabbage

[ ]: #######################################################################################

2 Netflix_titles dataset operations


[33]: import numpy as np
import pandas as pd
df = pd.DataFrame()
df = pd.read_csv("netflix_titles.csv")

[34]: df

[34]: show_id type title director \


0 s1 Movie Dick Johnson Is Dead Kirsten Johnson
1 s2 TV Show Blood & Water NaN
2 s3 TV Show Ganglands Julien Leclercq
3 s4 TV Show Jailbirds New Orleans NaN
4 s5 TV Show Kota Factory NaN
… … … … …
8802 s8803 Movie Zodiac David Fincher
8803 s8804 TV Show Zombie Dumb NaN
8804 s8805 Movie Zombieland Ruben Fleischer
8805 s8806 Movie Zoom Peter Hewitt
8806 s8807 Movie Zubaan Mozez Singh

cast country \
0 NaN United States
1 Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban… South Africa
2 Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi… NaN
3 NaN NaN
4 Mayur More, Jitendra Kumar, Ranjan Raj, Alam K… India
… … …
8802 Mark Ruffalo, Jake Gyllenhaal, Robert Downey J… United States
8803 NaN NaN
8804 Jesse Eisenberg, Woody Harrelson, Emma Stone, … United States
8805 Tim Allen, Courteney Cox, Chevy Chase, Kate Ma… United States
8806 Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan… India

date_added release_year rating duration \


0 September 25, 2021 2020 PG-13 90 min
1 September 24, 2021 2021 TV-MA 2 Seasons
2 September 24, 2021 2021 TV-MA 1 Season
3 September 24, 2021 2021 TV-MA 1 Season

2
4 September 24, 2021 2021 TV-MA 2 Seasons
… … … … …
8802 November 20, 2019 2007 R 158 min
8803 July 1, 2019 2018 TV-Y7 2 Seasons
8804 November 1, 2019 2009 R 88 min
8805 January 11, 2020 2006 PG 88 min
8806 March 2, 2019 2015 TV-14 111 min

listed_in \
0 Documentaries
1 International TV Shows, TV Dramas, TV Mysteries
2 Crime TV Shows, International TV Shows, TV Act…
3 Docuseries, Reality TV
4 International TV Shows, Romantic TV Shows, TV …
… …
8802 Cult Movies, Dramas, Thrillers
8803 Kids' TV, Korean TV Shows, TV Comedies
8804 Comedies, Horror Movies
8805 Children & Family Movies, Comedies
8806 Dramas, International Movies, Music & Musicals

description
0 As her father nears the end of his life, filmm…
1 After crossing paths at a party, a Cape Town t…
2 To protect his family from a powerful drug lor…
3 Feuds, flirtations and toilet talk go down amo…
4 In a city of coaching centers known to train I…
… …
8802 A political cartoonist, a crime reporter and a…
8803 While living alone in a spooky town, a young g…
8804 Looking to survive in a world taken over by zo…
8805 Dragged from civilian life, a former superhero…
8806 A scrappy but poor boy worms his way into a ty…

[8807 rows x 12 columns]

[35]: df.head()

[35]: show_id type title director \


0 s1 Movie Dick Johnson Is Dead Kirsten Johnson
1 s2 TV Show Blood & Water NaN
2 s3 TV Show Ganglands Julien Leclercq
3 s4 TV Show Jailbirds New Orleans NaN
4 s5 TV Show Kota Factory NaN

cast country \
0 NaN United States

3
1 Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban… South Africa
2 Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi… NaN
3 NaN NaN
4 Mayur More, Jitendra Kumar, Ranjan Raj, Alam K… India

date_added release_year rating duration \


0 September 25, 2021 2020 PG-13 90 min
1 September 24, 2021 2021 TV-MA 2 Seasons
2 September 24, 2021 2021 TV-MA 1 Season
3 September 24, 2021 2021 TV-MA 1 Season
4 September 24, 2021 2021 TV-MA 2 Seasons

listed_in \
0 Documentaries
1 International TV Shows, TV Dramas, TV Mysteries
2 Crime TV Shows, International TV Shows, TV Act…
3 Docuseries, Reality TV
4 International TV Shows, Romantic TV Shows, TV …

description
0 As her father nears the end of his life, filmm…
1 After crossing paths at a party, a Cape Town t…
2 To protect his family from a powerful drug lor…
3 Feuds, flirtations and toilet talk go down amo…
4 In a city of coaching centers known to train I…

[36]: df.tail()

[36]: show_id type title director \


8802 s8803 Movie Zodiac David Fincher
8803 s8804 TV Show Zombie Dumb NaN
8804 s8805 Movie Zombieland Ruben Fleischer
8805 s8806 Movie Zoom Peter Hewitt
8806 s8807 Movie Zubaan Mozez Singh

cast country \
8802 Mark Ruffalo, Jake Gyllenhaal, Robert Downey J… United States
8803 NaN NaN
8804 Jesse Eisenberg, Woody Harrelson, Emma Stone, … United States
8805 Tim Allen, Courteney Cox, Chevy Chase, Kate Ma… United States
8806 Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan… India

date_added release_year rating duration \


8802 November 20, 2019 2007 R 158 min
8803 July 1, 2019 2018 TV-Y7 2 Seasons
8804 November 1, 2019 2009 R 88 min
8805 January 11, 2020 2006 PG 88 min

4
8806 March 2, 2019 2015 TV-14 111 min

listed_in \
8802 Cult Movies, Dramas, Thrillers
8803 Kids' TV, Korean TV Shows, TV Comedies
8804 Comedies, Horror Movies
8805 Children & Family Movies, Comedies
8806 Dramas, International Movies, Music & Musicals

description
8802 A political cartoonist, a crime reporter and a…
8803 While living alone in a spooky town, a young g…
8804 Looking to survive in a world taken over by zo…
8805 Dragged from civilian life, a former superhero…
8806 A scrappy but poor boy worms his way into a ty…

[38]: df.shape

[38]: (8807, 12)

[39]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB

[40]: df.describe()

[40]: release_year
count 8807.000000
mean 2014.180198
std 8.819312

5
min 1925.000000
25% 2013.000000
50% 2017.000000
75% 2019.000000
max 2021.000000

[41]: df.isnull()

[41]: show_id type title director cast country date_added \


0 False False False False True False False
1 False False False True False False False
2 False False False False False True False
3 False False False True True True False
4 False False False True False False False
… … … … … … … …
8802 False False False False False False False
8803 False False False True True True False
8804 False False False False False False False
8805 False False False False False False False
8806 False False False False False False False

release_year rating duration listed_in description


0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
… … … … … …
8802 False False False False False
8803 False False False False False
8804 False False False False False
8805 False False False False False
8806 False False False False False

[8807 rows x 12 columns]

[42]: df.isnull().sum()

[42]: show_id 0
type 0
title 0
director 2634
cast 825
country 831
date_added 10
release_year 0
rating 4

6
duration 3
listed_in 0
description 0
dtype: int64

[98]: #######################################################################################

3 Housing dataset operations


[44]: df1 = pd.read_csv('Housing.csv')

[45]: df1.head()

[45]: price area bedrooms bathrooms stories mainroad guestroom basement \


0 13300000 7420 4 2 3 yes no no
1 12250000 8960 4 4 4 yes no no
2 12250000 9960 3 2 2 yes no yes
3 12215000 7500 4 2 2 yes no yes
4 11410000 7420 4 1 2 yes yes yes

hotwaterheating airconditioning parking prefarea furnishingstatus


0 no yes 2 yes furnished
1 no yes 3 no furnished
2 no no 2 yes semi-furnished
3 no yes 3 yes furnished
4 no yes 2 no furnished

[46]: df1.tail()

[46]: price area bedrooms bathrooms stories mainroad guestroom basement \


540 1820000 3000 2 1 1 yes no yes
541 1767150 2400 3 1 1 no no no
542 1750000 3620 2 1 1 yes no no
543 1750000 2910 3 1 1 no no no
544 1750000 3850 3 1 2 yes no no

hotwaterheating airconditioning parking prefarea furnishingstatus


540 no no 2 no unfurnished
541 no no 0 no semi-furnished
542 no no 0 no unfurnished
543 no no 0 no furnished
544 no no 0 no unfurnished

[47]: df1['mainroad']=='yes'

[47]: 0 True
1 True

7
2 True
3 True
4 True

540 True
541 False
542 True
543 False
544 True
Name: mainroad, Length: 545, dtype: bool

[48]: min(df1['area'])

[48]: 1650

[49]: max(df1['area'])

[49]: 16200

[50]: df1.isna()

[50]: price area bedrooms bathrooms stories mainroad guestroom \


0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
.. … … … … … … …
540 False False False False False False False
541 False False False False False False False
542 False False False False False False False
543 False False False False False False False
544 False False False False False False False

basement hotwaterheating airconditioning parking prefarea \


0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
.. … … … … …
540 False False False False False
541 False False False False False
542 False False False False False
543 False False False False False
544 False False False False False

8
furnishingstatus
0 False
1 False
2 False
3 False
4 False
.. …
540 False
541 False
542 False
543 False
544 False

[545 rows x 13 columns]

[51]: df1.value_counts()

[51]: price area bedrooms bathrooms stories mainroad guestroom basement


hotwaterheating airconditioning parking prefarea furnishingstatus
1750000 2910 3 1 1 no no no no
no 0 no furnished 1
5229000 7085 3 1 1 yes yes yes no
no 2 yes semi-furnished 1
5110000 11410 2 1 2 yes no no no
no 0 yes furnished 1
5145000 3410 3 1 2 no no no no
yes 0 no semi-furnished 1
7980 3 1 1 yes no no no
no 1 yes semi-furnished 1
..
3675000 3630 2 1 1 yes no no no
yes 0 no unfurnished 1
3600 2 1 1 yes no no no
no 0 no furnished 1
3640000 5960 3 1 2 yes yes yes no
no 0 no unfurnished 1
4280 2 1 1 yes no no no
yes 2 no semi-furnished 1
13300000 7420 4 2 3 yes no no no
yes 2 yes furnished 1
Name: count, Length: 545, dtype: int64

9
4 b) Reshaping, Filtering, Scaling, Merging the data and Handling
the missing values in datasets.

5 Merging
[52]: house_price=pd.DataFrame()
house_area=pd.DataFrame()

[53]: house_price=pd.read_csv('Housing.csv')
house_price=house_price[['price','mainroad']]
house_price

[53]: price mainroad


0 13300000 yes
1 12250000 yes
2 12250000 yes
3 12215000 yes
4 11410000 yes
.. … …
540 1820000 yes
541 1767150 no
542 1750000 yes
543 1750000 no
544 1750000 yes

[545 rows x 2 columns]

[54]: house_area=pd.read_csv('Housing.csv')
house_area=house_area[['area','mainroad']]
house_area

[54]: area mainroad


0 7420 yes
1 8960 yes
2 9960 yes
3 7500 yes
4 7420 yes
.. … …
540 3000 yes
541 2400 no
542 3620 yes
543 2910 no
544 3850 yes

[545 rows x 2 columns]

10
[55]: # Merging the dataframe
house_data=house_price.merge(house_area, how='inner',on='mainroad')
house_data

[55]: price mainroad area


0 13300000 yes 7420
1 13300000 yes 8960
2 13300000 yes 9960
3 13300000 yes 7500
4 13300000 yes 7420
… … … …
224948 1750000 no 3000
224949 1750000 no 3420
224950 1750000 no 2990
224951 1750000 no 2400
224952 1750000 no 2910

[224953 rows x 3 columns]

6 Handling Missing Values


[56]: df.isnull().sum()

[56]: show_id 0
type 0
title 0
director 2634
cast 825
country 831
date_added 10
release_year 0
rating 4
duration 3
listed_in 0
description 0
dtype: int64

[57]: from sklearn.impute import SimpleImputer


import numpy as np

[58]: imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

[ ]: df_trans1 = imputer.fit_transform(df)
# Not able to Handle missing values as dataframe contans categorical data

11
7 Label Encoding
[59]: from sklearn.preprocessing import LabelEncoder

[60]: df1.head()

[60]: price area bedrooms bathrooms stories mainroad guestroom basement \


0 13300000 7420 4 2 3 yes no no
1 12250000 8960 4 4 4 yes no no
2 12250000 9960 3 2 2 yes no yes
3 12215000 7500 4 2 2 yes no yes
4 11410000 7420 4 1 2 yes yes yes

hotwaterheating airconditioning parking prefarea furnishingstatus


0 no yes 2 yes furnished
1 no yes 3 no furnished
2 no no 2 yes semi-furnished
3 no yes 3 yes furnished
4 no yes 2 no furnished

[61]: encoder = LabelEncoder()

[62]: new_house_data=pd.DataFrame()
new_house_data

[62]: Empty DataFrame


Columns: []
Index: []

[63]: new_house_data['area'] = encoder.fit_transform(df1['area'])

[64]: new_house_data

[64]: area
0 232
1 260
2 268
3 237
4 232
.. …
540 39
541 15
542 72
543 35
544 90

[545 rows x 1 columns]

12
[65]: new_house_data['price'] = df1['price']
new_house_data['bedrooms'] = df1['bedrooms']
new_house_data['stories'] = encoder.fit_transform(df1['stories'])
new_house_data['bathrooms'] = df1['bathrooms']

[66]: new_house_data

[66]: area price bedrooms stories bathrooms


0 232 13300000 4 2 2
1 260 12250000 4 3 4
2 268 12250000 3 1 2
3 237 12215000 4 1 2
4 232 11410000 4 1 1
.. … … … … …
540 39 1820000 2 0 1
541 15 1767150 3 0 1
542 72 1750000 2 0 1
543 35 1750000 3 0 1
544 90 1750000 3 1 1

[545 rows x 5 columns]

[68]: # Now all column data is in numner now we can handle missing data.

[67]: from sklearn.impute import SimpleImputer


import numpy as np

[69]: imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

[70]: house_NaN_handle = imputer.fit_transform(new_house_data)

[71]: house_NaN_handle

[71]: array([[2.320e+02, 1.330e+07, 4.000e+00, 2.000e+00, 2.000e+00],


[2.600e+02, 1.225e+07, 4.000e+00, 3.000e+00, 4.000e+00],
[2.680e+02, 1.225e+07, 3.000e+00, 1.000e+00, 2.000e+00],
…,
[7.200e+01, 1.750e+06, 2.000e+00, 0.000e+00, 1.000e+00],
[3.500e+01, 1.750e+06, 3.000e+00, 0.000e+00, 1.000e+00],
[9.000e+01, 1.750e+06, 3.000e+00, 1.000e+00, 1.000e+00]])

[72]: house_NaN_handle=pd.DataFrame(house_NaN_handle)

[73]: house_NaN_handle

[73]: 0 1 2 3 4
0 232.0 13300000.0 4.0 2.0 2.0
1 260.0 12250000.0 4.0 3.0 4.0

13
2 268.0 12250000.0 3.0 1.0 2.0
3 237.0 12215000.0 4.0 1.0 2.0
4 232.0 11410000.0 4.0 1.0 1.0
.. … … … … …
540 39.0 1820000.0 2.0 0.0 1.0
541 15.0 1767150.0 3.0 0.0 1.0
542 72.0 1750000.0 2.0 0.0 1.0
543 35.0 1750000.0 3.0 0.0 1.0
544 90.0 1750000.0 3.0 1.0 1.0

[545 rows x 5 columns]

[74]: house_NaN_handle.isnull().sum()

[74]: 0 0
1 0
2 0
3 0
4 0
dtype: int64

8 Creating Dependent and Indepent columns : X and Y


[75]: housedata=pd.DataFrame(house_NaN_handle)

[76]: housedata.shape

[76]: (545, 5)

[77]: X = pd.DataFrame(housedata.iloc[:, 0:4].values)


#cardata.iloc[:, 0:4].values extracts the values from the selected rows and␣
↪columns

[78]: X

[78]: 0 1 2 3
0 232.0 13300000.0 4.0 2.0
1 260.0 12250000.0 4.0 3.0
2 268.0 12250000.0 3.0 1.0
3 237.0 12215000.0 4.0 1.0
4 232.0 11410000.0 4.0 1.0
.. … … … …
540 39.0 1820000.0 2.0 0.0
541 15.0 1767150.0 3.0 0.0
542 72.0 1750000.0 2.0 0.0
543 35.0 1750000.0 3.0 0.0
544 90.0 1750000.0 3.0 1.0

14
[545 rows x 4 columns]

[79]: Y = pd.DataFrame(housedata.iloc[:, -1].values)

[80]: Y = pd.DataFrame(housedata.iloc[:, 4:].values)

[81]: Y

[81]: 0
0 2.0
1 4.0
2 2.0
3 2.0
4 1.0
.. …
540 1.0
541 1.0
542 1.0
543 1.0
544 1.0

[545 rows x 1 columns]

9 Feature Scaling of DataSet- MinMaxScalar


[82]: from sklearn.preprocessing import MinMaxScaler

[83]: scalar = MinMaxScaler()

[84]: X_scaled = pd.DataFrame(scalar.fit_transform(X))

[85]: X_scaled

[85]: 0 1 2 3
0 0.819788 1.000000 0.6 0.666667
1 0.918728 0.909091 0.6 1.000000
2 0.946996 0.909091 0.4 0.333333
3 0.837456 0.906061 0.6 0.333333
4 0.819788 0.836364 0.6 0.333333
.. … … … …
540 0.137809 0.006061 0.2 0.000000
541 0.053004 0.001485 0.4 0.000000
542 0.254417 0.000000 0.2 0.000000
543 0.123675 0.000000 0.4 0.000000
544 0.318021 0.000000 0.4 0.333333

15
[545 rows x 4 columns]

10 Train Test Split


[93]: from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3,␣
↪random_state=2)

[88]: X_train

[88]: 0 1 2 3
166 243.0 5320000.0 3.0 0.0
378 12.0 3640000.0 3.0 2.0
349 141.0 3780000.0 3.0 1.0
368 168.0 3675000.0 2.0 0.0
306 142.0 4165000.0 3.0 1.0
.. … … … …
299 220.0 4200000.0 3.0 0.0
534 39.0 2100000.0 4.0 1.0
493 95.0 2800000.0 3.0 0.0
527 2.0 2275000.0 2.0 0.0
168 115.0 5250000.0 4.0 1.0

[381 rows x 4 columns]

[89]: X_train.shape

[89]: (381, 4)

[90]: X_test

[90]: 0 1 2 3
333 39.0 3920000.0 3.0 1.0
84 84.0 6510000.0 3.0 1.0
439 93.0 3255000.0 2.0 0.0
396 75.0 3500000.0 2.0 0.0
161 188.0 5460000.0 3.0 2.0
.. … … … …
117 80.0 5950000.0 4.0 1.0
314 102.0 4095000.0 2.0 1.0
340 160.0 3850000.0 5.0 1.0
444 46.0 3220000.0 3.0 1.0
307 107.0 4165000.0 3.0 1.0

[164 rows x 4 columns]

[91]: X_test.shape

16
[91]: (164, 4)

[94]: Y_train

[94]: 0
166 1.0
378 1.0
349 1.0
368 1.0
306 1.0
.. …
299 1.0
534 1.0
493 1.0
527 1.0
168 1.0

[381 rows x 1 columns]

[95]: Y_train.shape

[95]: (381, 1)

[96]: Y_test

[96]: 0
333 1.0
84 1.0
439 1.0
396 1.0
161 1.0
.. …
117 1.0
314 1.0
340 2.0
444 1.0
307 1.0

[164 rows x 1 columns]

[97]: Y_test.shape

[97]: (164, 1)

[ ]:

17

You might also like