Professional Documents
Culture Documents
DataWrangling - Jupyter Notebook
DataWrangling - Jupyter Notebook
import pandas as pd
https://keys.craigslist.org/cto/d/summerland- florida
935 https://keys.craigslist.org 21000
k... keys
worcester
https://worcester.craigslist.org/cto/d/west-
760 / central https://worcester.craigslist.org 1500
br...
MA
columns
In [5]: ## Multiple row and column selections using iloc and DataFrame
#The iloc indexer syntax is data.iloc[<row selection>, <column select
datav2.iloc[0:5] # first five rows of dataframe
https://keys.craigslist.org/cto/d/summerland- florida
2 7221797935 https://keys.craigslis
k... keys
worcester
https://worcester.craigslist.org/cto/d/west-
3 7222270760 / central https://worcester.craigslis
br...
MA
5 rows × 26 columns
In [6]: datav2.iloc[:,0:2]# first two columns of data frame with all rows
Out[6]: id url
0 7222695916 https://prescott.craigslist.org/cto/d/prescott...
1 7218891961 https://fayar.craigslist.org/ctd/d/bentonville...
2 7221797935 https://keys.craigslist.org/cto/d/summerland-k...
3 7222270760 https://worcester.craigslist.org/cto/d/west-br...
4 7210384030 https://greensboro.craigslist.org/cto/d/trinit...
In [48]: datav2.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th
In [49]: datav2.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of dat
Out[50]: 0 prescott
1 fayetteville
2 florida keys
3 worcester / central MA
4 greensboro
5 hudson valley
6 hudson valley
7 hudson valley
8 medford-ashland
9 erie
10 el paso
11 el paso
12 el paso
13 el paso
14 el paso
15 bellingham
16 bellingham
17 bellingham
18 bellingham
19 bellingham
20 bellingham
21 bellingham
22 bellingham
23 bellingham
24 skagit / island / SJI
25 skagit / island / SJI
26 la crosse
27 auburn
28 auburn
29 auburn
30 auburn
31 auburn
32 auburn
33 auburn
34 auburn
35 auburn
36 auburn
37 auburn
38 auburn
39 auburn
40 auburn
41 auburn
42 auburn
43 auburn
44 auburn
45 auburn
46 auburn
47 auburn
48 auburn
49 auburn
Name: region, dtype: object
In [51]: #display single row using slicing
datav2[0:2]
2 rows × 26 columns
In [52]: # display or access contents of more than one column
datav2[['id','url','region']]
Out[52]: id url region
https://keys.craigslist.org/cto/d/summerland- florida
2 7221797935 https://keys.craig
k... keys
worcester
3 7222270760 https://worcester.craigslist.org/cto/d/west-br... / central https://worcester.craig
MA
hudson
5 7222379453 https://hudsonvalley.craigslist.org/cto/d/west... https://hudsonvalley.craig
valley
hudson
6 7221952215 https://hudsonvalley.craigslist.org/cto/d/west... https://hudsonvalley.craig
valley
hudson
7 7220195662 https://hudsonvalley.craigslist.org/cto/d/poug... https://hudsonvalley.craig
valley
florida
2 7221797935 https://keys.craigslist.org 21000 NaN NaN
keys
worcester
3 7222270760 / central https://worcester.craigslist.org 1500 NaN NaN
MA
hudson
5 7222379453 https://hudsonvalley.craigslist.org 1600 NaN NaN
valley
hudson
6 7221952215 https://hudsonvalley.craigslist.org 1000 NaN NaN
valley
hudson
7 7220195662 https://hudsonvalley.craigslist.org 15995 NaN NaN
valley
medford
In [59]: #datav2['age']
In [61]: #datav2['region']
Out[69]: 0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
In [70]: # function or defination to double the price of particular column
def pricedouble(x):
return x*2
datav2['pricedouble'] = datav2['price'].apply(pricedouble)
In [71]: datav2[['price','pricedouble']]
Out[71]: price pricedouble
0 6000 12000
1 11900 23800
2 21000 42000
3 1500 3000
4 4900 9800
5 1600 3200
6 1000 2000
7 15995 31990
8 5000 10000
9 3000 6000
10 0 0
11 0 0
12 0 0
13 0 0
14 0 0
15 13995 27990
16 24999 49998
17 21850 43700
18 26850 53700
19 11999 23998
20 24999 49998
21 21850 43700
22 26850 53700
23 11999 23998
24 24999 49998
25 21850 43700
26 500 1000
27 33590 67180
28 22590 45180
29 39590 79180
30 30990 61980
31 15000 30000
32 27990 55980
33 34590 69180
34 35000 70000
35 29990 59980
price pricedouble
36 38590 77180
37 4500 9000
38 32990 65980
39 24590 49180
40 30990 61980
41 27990 55980
42 37990 75980
43 33590 67180
44 30990 61980
45 27990 55980
46 0 0
47 34590 69180
48 30590 61180
49 32990 65980
0 6000 24000
1 11900 47600
2 21000 84000
3 1500 6000
4 4900 19600
5 1600 6400
6 1000 4000
7 15995 63980
8 5000 20000
9 3000 12000
10 0 0
11 0 0
12 0 0
13 0 0
14 0 0
15 13995 55980
16 24999 99996
17 21850 87400
18 26850 107400
19 11999 47996
20 24999 99996
21 21850 87400
22 26850 107400
23 11999 47996
24 24999 99996
25 21850 87400
26 500 2000
27 33590 134360
28 22590 90360
29 39590 158360
30 30990 123960
31 15000 60000
32 27990 111960
33 34590 138360
34 35000 140000
35 29990 119960
price pr
36 38590 154360
37 4500 18000
38 32990 131960
39 24590 98360
40 30990 123960
41 27990 111960
42 37990 151960
43 33590 134360
44 30990 123960
45 27990 111960
46 0 0
47 34590 138360
48 30590 122360
49 32990 131960
Out[66]: 0 prescott
1 fayetteville
2 florida keys
3 worcester / central MA
4 greensboro
5 hudson valley
6 hudson valley
7 hudson valley
8 medford-ashland
9 erie
10 el paso
11 el paso
12 el paso
13 el paso
14 el paso
15 bellingham
16 bellingham
17 bellingham
18 bellingham
19 bellingham
20 bellingham
21 bellingham
22 bellingham
23 bellingham
24 skagit / island / SJI
25 skagit / island / SJI
26 la crosse
27 auburn
28 auburn
29 auburn
30 auburn
31 auburn
32 auburn
33 auburn
34 auburn
35 auburn
36 auburn
37 auburn
38 auburn
39 auburn
40 auburn
41 auburn
42 auburn
43 auburn
44 auburn
45 auburn
46 auburn
47 auburn
48 auburn
49 auburn
Name: region, dtype: object
In [67]: #REMOVING DUPLICATES
In [69]: datav2['region']
datav5= list(dict.fromkeys(datav2['region']))
print(datav5)
# using set()
# to remove duplicated
# from list
test_list = list(set(test_list))
# using set()
# to remove duplicated
# from list
test_list = list(set(datav2['region']))
Out[72]: 0.0 2
Name: size, dtype: int64
Out[78]: id 0
region 0
region_url 0
price 0
year 27
manufacturer 27
model 27
condition 27
cylinders 34
fuel 27
odometer 27
title_status 27
transmission 27
VIN 31
drive 38
size 48
type 28
paint_color 29
image_url 27
description 27
county 50
state 0
lat 27
long 27
posting_date 27
age 27
price_mile 27
pricedouble 0
pr 0
dtype: int64
In [ ]:
In [ ]: