You are on page 1of 2

Ujian Tengah Semester

Petenjuk Pengerjaan
1. UTS boleh dikerjakan di semua IDE (Jupyter Notebook, Vs Code, Google Colab, dll)
2. UTS dikerjakan sendiri-sendiri
3. UTS dikumpulkan paling lambat tangal 8 Mei 2023, jam 23.59
4. UTS dikumpulkan di link berikut (https://forms.gle/GWTx3dW89DMrnNYbA), dengan format pdf atau html
5. Sifat Open Book
6. Kerapihan Notebook juga menjadi point penilaian (Deskripsi code menggunakan markdown)

Problem Statement
🔻 Anda adalah seorang data analyst yang bekerja di bidang kepolisian. Anda akan melakukan analisis untuk memberikan insight kepada pihak kepolisian tentang insiden penembakkan di US yang terjadi pada tahun 2014 - 2021.
Dataset Preparation
🔻 Silakan lakukan preparation sesuai dengan teknik yang telah dipelajari. Dataset yang digunakan terletak pada file US-Gun-Violence.csv
In [74]: import pandas as pd

In [75]: df= pd.read_csv("US-Gun-Violence.csv")

In [76]: df.head

Out[76]: <bound method NDFrame.head of incident_id incident_date state city_or_county \


0 2201535 December 31 2021 Maryland Capitol Heights
1 2201716 December 31 2021 Mississippi Gulfport
2 2201216 December 31 2021 California Los Angeles
3 2200968 December 30 2021 Pennsylvania Philadelphia
4 2201052 December 30 2021 Missouri Kirksville
... ... ... ... ...
3386 95550 January 12 2014 Alabama Huntsville
3387 95146 January 11 2014 Mississippi Jackson
3388 94514 January 5 2014 Pennsylvania Erie
3389 92704 January 3 2014 New York Queens
3390 92194 January 1 2014 Virginia Norfolk

address killed injured


0 Cindy Ln 0 4.0
1 1200 block of Lewis Ave 3 4.0
2 10211 S. Avalon Blvd 0 6.0
3 5100 block of Germantown Ave 0 6.0
4 700 block of E Dodson St 2 2.0
... ... ... ...
3386 University Drive 0 5.0
3387 3430 W. Capitol Street 0 4.0
3388 829 Parade St 1 3.0
3389 Farmers Boulevard and 133rd Avenue 1 3.0
3390 Rockingham Street and Berkley Avenue Extended 2 2.0

[3391 rows x 7 columns]>

In [77]: df.tail()

Out[77]: incident_id incident_date state city_or_county address killed injured

3386 95550 January 12 2014 Alabama Huntsville University Drive 0 5.0

3387 95146 January 11 2014 Mississippi Jackson 3430 W. Capitol Street 0 4.0

3388 94514 January 5 2014 Pennsylvania Erie 829 Parade St 1 3.0

3389 92704 January 3 2014 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0

3390 92194 January 1 2014 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0

Read Data
In [78]: df_uncl = pd.read_csv("US-Gun-Violence.csv", index_col = 0,parse_dates = ["incident_date"])

In [79]: df.describe()

Out[79]: incident_id killed injured

count 3.391000e+03 3391.000000 3389.000000

mean 1.236033e+06 1.054851 4.182650

std 6.459642e+05 2.046927 7.841665

min 9.219400e+04 0.000000 0.000000

25% 6.347590e+05 0.000000 3.000000

50% 1.314253e+06 1.000000 4.000000

75% 1.799726e+06 1.000000 5.000000

max 2.201716e+06 59.000000 441.000000

In [80]: df.size

Out[80]: 23737

Dataset Information
Silakan untuk mengecek struktur dataset terlebih dahulu sebelum melakukan analisis lebih lanjut

In [81]: #code here


df.shape

Out[81]: (3391, 7)

In [82]: df.columns

Out[82]: Index(['incident_id', 'incident_date', 'state', 'city_or_county', 'address',


'killed', 'injured'],
dtype='object')

In [83]: df.dtypes

Out[83]: incident_id int64


incident_date object
state object
city_or_county object
address object
killed int64
injured float64
dtype: object

In [84]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3391 entries, 0 to 3390
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 incident_id 3391 non-null int64
1 incident_date 3391 non-null object
2 state 3391 non-null object
3 city_or_county 3391 non-null object
4 address 3386 non-null object
5 killed 3391 non-null int64
6 injured 3389 non-null float64
dtypes: float64(1), int64(2), object(4)
memory usage: 185.6+ KB

📄 Deskripsi Dataset
1. incident_id id insiden
2. incident_date tanggal insiden
3. state negara bagian tempat terjadi insiden
4. city_or_county kota tempat terjadi insiden
5. address alamat kejadian
6. killed jumlah yang terbunuh
7. injured jumlah yang terluka

Business Question
🔍 Silakan pikirkan minimal 1 pertanyaan yang sekiranya dapat memberikan insight menarik dari data tersebut.
Dari total data 3392, apakah data tersebut sudah benar secara keseluruhan tanpa ada kesalahan data (missing data)? jawaban : [Berdasarkan hasil analisis yang saya dapat dari data itu yang sebanyak 3392 bahwa data itu hanya terdapat satu
data missing yaitu pada bagian duplicate data yang di mana memiliki 1 data yang di duplicate dan harus di hapus pada data duplicate tersebut sehingga dari total data 3392 menjadi 3391, lalu untuk semua data selain duplicate sudah benar (true) dimulai
dari:

1. incident_id
2. incident_date
3. state
4. city_or_country
5. address
6. killed
7. injured

Kemudian, data itu memiliki 3391 rows dan 7 columns.]

Data Preprocessing
Setelah itu, lakukan beberapa teknik preprocessing data.

DateTime Conversion
Apakah ada tipe data yang perlu dijadikan datetime? Silakan lakukan konversi jika ada. ada yang perlu dijadikan datetime

In [85]: df.dtypes

Out[85]: incident_id int64


incident_date object
state object
city_or_county object
address object
killed int64
injured float64
dtype: object

In [86]: df.head()

Out[86]: incident_id incident_date state city_or_county address killed injured

0 2201535 December 31 2021 Maryland Capitol Heights Cindy Ln 0 4.0

1 2201716 December 31 2021 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2 2201216 December 31 2021 California Los Angeles 10211 S. Avalon Blvd 0 6.0

3 2200968 December 30 2021 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

4 2201052 December 30 2021 Missouri Kirksville 700 block of E Dodson St 2 2.0

In [87]: df_conv = pd.read_csv("US-Gun-Violence.csv", parse_dates = ["incident_date"])

In [88]: df_conv.dtypes

Out[88]: incident_id int64


incident_date datetime64[ns]
state object
city_or_county object
address object
killed int64
injured float64
dtype: object

In [89]: df2 = df.copy()


df2["incident_date"] = pd.to_datetime(df2["incident_date"])

In [90]: df2.dtypes

Out[90]: incident_id int64


incident_date datetime64[ns]
state object
city_or_county object
address object
killed int64
injured float64
dtype: object

In [91]: df

Out[91]: incident_id incident_date state city_or_county address killed injured

0 2201535 December 31 2021 Maryland Capitol Heights Cindy Ln 0 4.0

1 2201716 December 31 2021 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2 2201216 December 31 2021 California Los Angeles 10211 S. Avalon Blvd 0 6.0

3 2200968 December 30 2021 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

4 2201052 December 30 2021 Missouri Kirksville 700 block of E Dodson St 2 2.0

... ... ... ... ... ... ... ...

3386 95550 January 12 2014 Alabama Huntsville University Drive 0 5.0

3387 95146 January 11 2014 Mississippi Jackson 3430 W. Capitol Street 0 4.0

3388 94514 January 5 2014 Pennsylvania Erie 829 Parade St 1 3.0

3389 92704 January 3 2014 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0

3390 92194 January 1 2014 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0

3391 rows × 7 columns

In [92]: ex = pd.Series(["31-12-2021",
"01-01-2014"])

pd.to_datetime(ex, dayfirst=True)

Out[92]: 0 2021-12-31
1 2014-01-01
dtype: datetime64[ns]

In [93]: ex2 = pd.Series(["12 31 2021",


"01 01 2014"])
pd.to_datetime(ex2, format="%m %d %Y")

Out[93]: 0 2021-12-31
1 2014-01-01
dtype: datetime64[ns]

Checking Missing Values


Apakah dalam data tersebut terdapat missing values? jawaban [Tidak, karena tidak terdapat kesalahan pada kolom isna dan notna]

In [94]: #code here


df_uncl

Out[94]: incident_date state city_or_county address killed injured

incident_id

2201535 2021-12-31 Maryland Capitol Heights Cindy Ln 0 4.0

2201716 2021-12-31 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2201216 2021-12-31 California Los Angeles 10211 S. Avalon Blvd 0 6.0

2200968 2021-12-30 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

2201052 2021-12-30 Missouri Kirksville 700 block of E Dodson St 2 2.0

... ... ... ... ... ... ...

95550 2014-01-12 Alabama Huntsville University Drive 0 5.0

95146 2014-01-11 Mississippi Jackson 3430 W. Capitol Street 0 4.0

94514 2014-01-05 Pennsylvania Erie 829 Parade St 1 3.0

92704 2014-01-03 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0

92194 2014-01-01 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0

3391 rows × 6 columns

Penanganan apa yang paling sesuai untuk data tersebut?

In [95]: df_uncl.isna()

Out[95]: incident_date state city_or_county address killed injured

incident_id

2201535 False False False False False False

2201716 False False False False False False

2201216 False False False False False False

2200968 False False False False False False

2201052 False False False False False False

... ... ... ... ... ... ...

95550 False False False False False False

95146 False False False False False False

94514 False False False False False False

92704 False False False False False False

92194 False False False False False False

3391 rows × 6 columns

In [96]: df_uncl.isna().sum()

Out[96]: incident_date 0
state 0
city_or_county 0
address 5
killed 0
injured 2
dtype: int64

In [97]: #code here


df_uncl.notna()

Out[97]: incident_date state city_or_county address killed injured

incident_id

2201535 True True True True True True

2201716 True True True True True True

2201216 True True True True True True

2200968 True True True True True True

2201052 True True True True True True

... ... ... ... ... ... ...

95550 True True True True True True

95146 True True True True True True

94514 True True True True True True

92704 True True True True True True

92194 True True True True True True

3391 rows × 6 columns

In [98]: df_uncl[df_uncl["incident_date"].notna()]

Out[98]: incident_date state city_or_county address killed injured

incident_id

2201535 2021-12-31 Maryland Capitol Heights Cindy Ln 0 4.0

2201716 2021-12-31 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2201216 2021-12-31 California Los Angeles 10211 S. Avalon Blvd 0 6.0

2200968 2021-12-30 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

2201052 2021-12-30 Missouri Kirksville 700 block of E Dodson St 2 2.0

... ... ... ... ... ... ...

95550 2014-01-12 Alabama Huntsville University Drive 0 5.0

95146 2014-01-11 Mississippi Jackson 3430 W. Capitol Street 0 4.0

94514 2014-01-05 Pennsylvania Erie 829 Parade St 1 3.0

92704 2014-01-03 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0

92194 2014-01-01 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0

3391 rows × 6 columns

In [99]: df_uncl.isnull()

Out[99]: incident_date state city_or_county address killed injured

incident_id

2201535 False False False False False False

2201716 False False False False False False

2201216 False False False False False False

2200968 False False False False False False

2201052 False False False False False False

... ... ... ... ... ... ...

95550 False False False False False False

95146 False False False False False False

94514 False False False False False False

92704 False False False False False False

92194 False False False False False False

3391 rows × 6 columns

Checking Duplicates
Apakah dalam data tersebut terdapat data yang duplikat? terdapat 1 data yang duplikat, karena terdapat data yang sama

In [100… #code here


df_uncl.duplicated().sum()

Out[100]: 1

In [101… df_uncl.drop_duplicates()

Out[101]: incident_date state city_or_county address killed injured

incident_id

2201535 2021-12-31 Maryland Capitol Heights Cindy Ln 0 4.0

2201716 2021-12-31 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2201216 2021-12-31 California Los Angeles 10211 S. Avalon Blvd 0 6.0

2200968 2021-12-30 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

2201052 2021-12-30 Missouri Kirksville 700 block of E Dodson St 2 2.0

... ... ... ... ... ... ...

95550 2014-01-12 Alabama Huntsville University Drive 0 5.0

95146 2014-01-11 Mississippi Jackson 3430 W. Capitol Street 0 4.0

94514 2014-01-05 Pennsylvania Erie 829 Parade St 1 3.0

92704 2014-01-03 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0

92194 2014-01-01 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0

3390 rows × 6 columns

In [102… df_uncl.head()

Out[102]: incident_date state city_or_county address killed injured

incident_id

2201535 2021-12-31 Maryland Capitol Heights Cindy Ln 0 4.0

2201716 2021-12-31 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2201216 2021-12-31 California Los Angeles 10211 S. Avalon Blvd 0 6.0

2200968 2021-12-30 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

2201052 2021-12-30 Missouri Kirksville 700 block of E Dodson St 2 2.0

In [103… df_uncl.drop_duplicates(keep="first").head()

Out[103]: incident_date state city_or_county address killed injured

incident_id

2201535 2021-12-31 Maryland Capitol Heights Cindy Ln 0 4.0

2201716 2021-12-31 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2201216 2021-12-31 California Los Angeles 10211 S. Avalon Blvd 0 6.0

2200968 2021-12-30 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

2201052 2021-12-30 Missouri Kirksville 700 block of E Dodson St 2 2.0

In [104… df_uncl.drop_duplicates(keep="last").head()

Out[104]: incident_date state city_or_county address killed injured

incident_id

2201535 2021-12-31 Maryland Capitol Heights Cindy Ln 0 4.0

2201716 2021-12-31 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2201216 2021-12-31 California Los Angeles 10211 S. Avalon Blvd 0 6.0

2200968 2021-12-30 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

2201052 2021-12-30 Missouri Kirksville 700 block of E Dodson St 2 2.0

Apakah duplicate sebaiknya dihapus?

Jawaban : [sebaiknya dihapus, karena data yang memiliki nilai yang sama persis untuk tiap kolomnya kurang diperlukan]

Feature Engineering
Silakan untuk melakukan feature engineering jika dibutuhkan

In [105… df2

Out[105]: incident_id incident_date state city_or_county address killed injured

0 2201535 2021-12-31 Maryland Capitol Heights Cindy Ln 0 4.0

1 2201716 2021-12-31 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2 2201216 2021-12-31 California Los Angeles 10211 S. Avalon Blvd 0 6.0

3 2200968 2021-12-30 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

4 2201052 2021-12-30 Missouri Kirksville 700 block of E Dodson St 2 2.0

... ... ... ... ... ... ... ...

3386 95550 2014-01-12 Alabama Huntsville University Drive 0 5.0

3387 95146 2014-01-11 Mississippi Jackson 3430 W. Capitol Street 0 4.0

3388 94514 2014-01-05 Pennsylvania Erie 829 Parade St 1 3.0

3389 92704 2014-01-03 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0

3390 92194 2014-01-01 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0

3391 rows × 7 columns

In [106… df2["incident_date"].dt.day

Out[106]: 0 31
1 31
2 31
3 30
4 30
..
3386 12
3387 11
3388 5
3389 3
3390 1
Name: incident_date, Length: 3391, dtype: int64

In [107… df2["incident_date"].dt.month_name()

Out[107]: 0 December
1 December
2 December
3 December
4 December
...
3386 January
3387 January
3388 January
3389 January
3390 January
Name: incident_date, Length: 3391, dtype: object

In [108… df2["incident_date"].dt.day_name()

Out[108]: 0 Friday
1 Friday
2 Friday
3 Thursday
4 Thursday
...
3386 Sunday
3387 Saturday
3388 Sunday
3389 Friday
3390 Wednesday
Name: incident_date, Length: 3391, dtype: object

In [109… df2.head()

Out[109]: incident_id incident_date state city_or_county address killed injured

0 2201535 2021-12-31 Maryland Capitol Heights Cindy Ln 0 4.0

1 2201716 2021-12-31 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2 2201216 2021-12-31 California Los Angeles 10211 S. Avalon Blvd 0 6.0

3 2200968 2021-12-30 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

4 2201052 2021-12-30 Missouri Kirksville 700 block of E Dodson St 2 2.0

In [110… df2['incident_date'].dt.to_period('D')

Out[110]: 0 2021-12-31
1 2021-12-31
2 2021-12-31
3 2021-12-30
4 2021-12-30
...
3386 2014-01-12
3387 2014-01-11
3388 2014-01-05
3389 2014-01-03
3390 2014-01-01
Name: incident_date, Length: 3391, dtype: period[D]

In [111… df2['incident_date'].dt.to_period('M')

Out[111]: 0 2021-12
1 2021-12
2 2021-12
3 2021-12
4 2021-12
...
3386 2014-01
3387 2014-01
3388 2014-01
3389 2014-01
3390 2014-01
Name: incident_date, Length: 3391, dtype: period[M]

In [112… df2['incident_date'].dt.to_period('Q')

Out[112]: 0 2021Q4
1 2021Q4
2 2021Q4
3 2021Q4
4 2021Q4
...
3386 2014Q1
3387 2014Q1
3388 2014Q1
3389 2014Q1
3390 2014Q1
Name: incident_date, Length: 3391, dtype: period[Q-DEC]

Category Conversion
Apakah ada kolom yang menurut Anda perlu dijadikan tipe data kategori? Kalau ada silakan konversi disini.

In [113… df2 = df.copy()

In [114… df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3391 entries, 0 to 3390
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 incident_id 3391 non-null int64
1 incident_date 3391 non-null object
2 state 3391 non-null object
3 city_or_county 3391 non-null object
4 address 3386 non-null object
5 killed 3391 non-null int64
6 injured 3389 non-null float64
dtypes: float64(1), int64(2), object(4)
memory usage: 185.6+ KB

In [115… df2.nunique()

Out[115]: incident_id 3391


incident_date 1747
state 49
city_or_county 936
address 3354
killed 19
injured 26
dtype: int64

In [116… df2["state"].unique()

Out[116]: array(['Maryland', 'Mississippi', 'California', 'Pennsylvania',


'Missouri', 'Colorado', 'Ohio', 'Alabama', 'New York', 'Texas',
'Virginia', 'Louisiana', 'Georgia', 'Illinois', 'North Carolina',
'Florida', 'Michigan', 'Wisconsin', 'Tennessee', 'New Jersey',
'South Carolina', 'Oregon', 'Arizona', 'South Dakota', 'Iowa',
'Kentucky', 'Oklahoma', 'Nevada', 'Idaho', 'Washington',
'Arkansas', 'Minnesota', 'Delaware', 'District of Columbia',
'Kansas', 'West Virginia', 'Indiana', 'Massachusetts',
'Rhode Island', 'New Hampshire', 'Connecticut', 'Nebraska',
'Alaska', 'Utah', 'New Mexico', 'Maine', 'Montana', 'Wyoming',
'Vermont'], dtype=object)

In [117… df2.nunique()

Out[117]: incident_id 3391


incident_date 1747
state 49
city_or_county 936
address 3354
killed 19
injured 26
dtype: int64

In [118… df_conv.dtypes

Out[118]: incident_id int64


incident_date datetime64[ns]
state object
city_or_county object
address object
killed int64
injured float64
dtype: object

In [119… df_before = df2.copy()

In [120… df2['state'] = df2['state'].astype('category')

In [121… df2.dtypes

Out[121]: incident_id int64


incident_date object
state category
city_or_county object
address object
killed int64
injured float64
dtype: object

In [122… df2

Out[122]: incident_id incident_date state city_or_county address killed injured

0 2201535 December 31 2021 Maryland Capitol Heights Cindy Ln 0 4.0

1 2201716 December 31 2021 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0

2 2201216 December 31 2021 California Los Angeles 10211 S. Avalon Blvd 0 6.0

3 2200968 December 30 2021 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0

4 2201052 December 30 2021 Missouri Kirksville 700 block of E Dodson St 2 2.0

... ... ... ... ... ... ... ...

3386 95550 January 12 2014 Alabama Huntsville University Drive 0 5.0

3387 95146 January 11 2014 Mississippi Jackson 3430 W. Capitol Street 0 4.0

3388 94514 January 5 2014 Pennsylvania Erie 829 Parade St 1 3.0

3389 92704 January 3 2014 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0

3390 92194 January 1 2014 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0

3391 rows × 7 columns

Your Analysis
🔻 Silakan analisis berdasarkan business question yang sudah anda jabarkan dengan menggunakan method yang sudah dipelajari. Tuliskan insight yang bisa didapatkan
pada hari apa pembunuhan paling banyak terjadi?: [July 5 2020 15 July 4 2021 11 May 23 2020 9 June 20 2020 9 June 20 2021 8 .. April 4 2018 1 April 2 2018 1 March 31 2018 1 March 24 2018 1 January 1 2014 1 Name: incident_date, Length: 1747,
dtype: int64]

In [123… #code here


df["incident_date"].value_counts()

Out[123]: July 5 2020 15


July 4 2021 11
May 23 2020 9
June 20 2020 9
June 20 2021 8
..
April 4 2018 1
April 2 2018 1
March 31 2018 1
March 24 2018 1
January 1 2014 1
Name: incident_date, Length: 1747, dtype: int64

Insight :

July 5 2020 15 July 4 2021 11 May 23 2020 9 June 20 2020 9 June 20 2021 8 .. April 4 2018 1 April 2 2018 1 March 31 2018 1 March 24 2018 1 January 1 2014 1 Name: incident_date, Length: 1747, dtype: int64
In [ ]:

You might also like