Professional Documents
Culture Documents
Abdulmu'min 2210631150051
Abdulmu'min 2210631150051
Petenjuk Pengerjaan
1. UTS boleh dikerjakan di semua IDE (Jupyter Notebook, Vs Code, Google Colab, dll)
2. UTS dikerjakan sendiri-sendiri
3. UTS dikumpulkan paling lambat tangal 8 Mei 2023, jam 23.59
4. UTS dikumpulkan di link berikut (https://forms.gle/GWTx3dW89DMrnNYbA), dengan format pdf atau html
5. Sifat Open Book
6. Kerapihan Notebook juga menjadi point penilaian (Deskripsi code menggunakan markdown)
Problem Statement
🔻 Anda adalah seorang data analyst yang bekerja di bidang kepolisian. Anda akan melakukan analisis untuk memberikan insight kepada pihak kepolisian tentang insiden penembakkan di US yang terjadi pada tahun 2014 - 2021.
Dataset Preparation
🔻 Silakan lakukan preparation sesuai dengan teknik yang telah dipelajari. Dataset yang digunakan terletak pada file US-Gun-Violence.csv
In [74]: import pandas as pd
In [76]: df.head
In [77]: df.tail()
3387 95146 January 11 2014 Mississippi Jackson 3430 W. Capitol Street 0 4.0
3389 92704 January 3 2014 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0
3390 92194 January 1 2014 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0
Read Data
In [78]: df_uncl = pd.read_csv("US-Gun-Violence.csv", index_col = 0,parse_dates = ["incident_date"])
In [79]: df.describe()
In [80]: df.size
Out[80]: 23737
Dataset Information
Silakan untuk mengecek struktur dataset terlebih dahulu sebelum melakukan analisis lebih lanjut
Out[81]: (3391, 7)
In [82]: df.columns
In [83]: df.dtypes
In [84]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3391 entries, 0 to 3390
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 incident_id 3391 non-null int64
1 incident_date 3391 non-null object
2 state 3391 non-null object
3 city_or_county 3391 non-null object
4 address 3386 non-null object
5 killed 3391 non-null int64
6 injured 3389 non-null float64
dtypes: float64(1), int64(2), object(4)
memory usage: 185.6+ KB
📄 Deskripsi Dataset
1. incident_id id insiden
2. incident_date tanggal insiden
3. state negara bagian tempat terjadi insiden
4. city_or_county kota tempat terjadi insiden
5. address alamat kejadian
6. killed jumlah yang terbunuh
7. injured jumlah yang terluka
Business Question
🔍 Silakan pikirkan minimal 1 pertanyaan yang sekiranya dapat memberikan insight menarik dari data tersebut.
Dari total data 3392, apakah data tersebut sudah benar secara keseluruhan tanpa ada kesalahan data (missing data)? jawaban : [Berdasarkan hasil analisis yang saya dapat dari data itu yang sebanyak 3392 bahwa data itu hanya terdapat satu
data missing yaitu pada bagian duplicate data yang di mana memiliki 1 data yang di duplicate dan harus di hapus pada data duplicate tersebut sehingga dari total data 3392 menjadi 3391, lalu untuk semua data selain duplicate sudah benar (true) dimulai
dari:
1. incident_id
2. incident_date
3. state
4. city_or_country
5. address
6. killed
7. injured
Data Preprocessing
Setelah itu, lakukan beberapa teknik preprocessing data.
DateTime Conversion
Apakah ada tipe data yang perlu dijadikan datetime? Silakan lakukan konversi jika ada. ada yang perlu dijadikan datetime
In [85]: df.dtypes
In [86]: df.head()
1 2201716 December 31 2021 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0
2 2201216 December 31 2021 California Los Angeles 10211 S. Avalon Blvd 0 6.0
3 2200968 December 30 2021 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0
In [88]: df_conv.dtypes
In [90]: df2.dtypes
In [91]: df
1 2201716 December 31 2021 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0
2 2201216 December 31 2021 California Los Angeles 10211 S. Avalon Blvd 0 6.0
3 2200968 December 30 2021 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0
3387 95146 January 11 2014 Mississippi Jackson 3430 W. Capitol Street 0 4.0
3389 92704 January 3 2014 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0
3390 92194 January 1 2014 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0
In [92]: ex = pd.Series(["31-12-2021",
"01-01-2014"])
pd.to_datetime(ex, dayfirst=True)
Out[92]: 0 2021-12-31
1 2014-01-01
dtype: datetime64[ns]
Out[93]: 0 2021-12-31
1 2014-01-01
dtype: datetime64[ns]
incident_id
92704 2014-01-03 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0
92194 2014-01-01 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0
In [95]: df_uncl.isna()
incident_id
In [96]: df_uncl.isna().sum()
Out[96]: incident_date 0
state 0
city_or_county 0
address 5
killed 0
injured 2
dtype: int64
incident_id
In [98]: df_uncl[df_uncl["incident_date"].notna()]
incident_id
92704 2014-01-03 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0
92194 2014-01-01 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0
In [99]: df_uncl.isnull()
incident_id
Checking Duplicates
Apakah dalam data tersebut terdapat data yang duplikat? terdapat 1 data yang duplikat, karena terdapat data yang sama
Out[100]: 1
In [101… df_uncl.drop_duplicates()
incident_id
92704 2014-01-03 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0
92194 2014-01-01 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0
In [102… df_uncl.head()
incident_id
In [103… df_uncl.drop_duplicates(keep="first").head()
incident_id
In [104… df_uncl.drop_duplicates(keep="last").head()
incident_id
Jawaban : [sebaiknya dihapus, karena data yang memiliki nilai yang sama persis untuk tiap kolomnya kurang diperlukan]
Feature Engineering
Silakan untuk melakukan feature engineering jika dibutuhkan
In [105… df2
3389 92704 2014-01-03 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0
3390 92194 2014-01-01 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0
In [106… df2["incident_date"].dt.day
Out[106]: 0 31
1 31
2 31
3 30
4 30
..
3386 12
3387 11
3388 5
3389 3
3390 1
Name: incident_date, Length: 3391, dtype: int64
In [107… df2["incident_date"].dt.month_name()
Out[107]: 0 December
1 December
2 December
3 December
4 December
...
3386 January
3387 January
3388 January
3389 January
3390 January
Name: incident_date, Length: 3391, dtype: object
In [108… df2["incident_date"].dt.day_name()
Out[108]: 0 Friday
1 Friday
2 Friday
3 Thursday
4 Thursday
...
3386 Sunday
3387 Saturday
3388 Sunday
3389 Friday
3390 Wednesday
Name: incident_date, Length: 3391, dtype: object
In [109… df2.head()
In [110… df2['incident_date'].dt.to_period('D')
Out[110]: 0 2021-12-31
1 2021-12-31
2 2021-12-31
3 2021-12-30
4 2021-12-30
...
3386 2014-01-12
3387 2014-01-11
3388 2014-01-05
3389 2014-01-03
3390 2014-01-01
Name: incident_date, Length: 3391, dtype: period[D]
In [111… df2['incident_date'].dt.to_period('M')
Out[111]: 0 2021-12
1 2021-12
2 2021-12
3 2021-12
4 2021-12
...
3386 2014-01
3387 2014-01
3388 2014-01
3389 2014-01
3390 2014-01
Name: incident_date, Length: 3391, dtype: period[M]
In [112… df2['incident_date'].dt.to_period('Q')
Out[112]: 0 2021Q4
1 2021Q4
2 2021Q4
3 2021Q4
4 2021Q4
...
3386 2014Q1
3387 2014Q1
3388 2014Q1
3389 2014Q1
3390 2014Q1
Name: incident_date, Length: 3391, dtype: period[Q-DEC]
Category Conversion
Apakah ada kolom yang menurut Anda perlu dijadikan tipe data kategori? Kalau ada silakan konversi disini.
In [114… df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3391 entries, 0 to 3390
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 incident_id 3391 non-null int64
1 incident_date 3391 non-null object
2 state 3391 non-null object
3 city_or_county 3391 non-null object
4 address 3386 non-null object
5 killed 3391 non-null int64
6 injured 3389 non-null float64
dtypes: float64(1), int64(2), object(4)
memory usage: 185.6+ KB
In [115… df2.nunique()
In [116… df2["state"].unique()
In [117… df2.nunique()
In [118… df_conv.dtypes
In [121… df2.dtypes
In [122… df2
1 2201716 December 31 2021 Mississippi Gulfport 1200 block of Lewis Ave 3 4.0
2 2201216 December 31 2021 California Los Angeles 10211 S. Avalon Blvd 0 6.0
3 2200968 December 30 2021 Pennsylvania Philadelphia 5100 block of Germantown Ave 0 6.0
3387 95146 January 11 2014 Mississippi Jackson 3430 W. Capitol Street 0 4.0
3389 92704 January 3 2014 New York Queens Farmers Boulevard and 133rd Avenue 1 3.0
3390 92194 January 1 2014 Virginia Norfolk Rockingham Street and Berkley Avenue Extended 2 2.0
Your Analysis
🔻 Silakan analisis berdasarkan business question yang sudah anda jabarkan dengan menggunakan method yang sudah dipelajari. Tuliskan insight yang bisa didapatkan
pada hari apa pembunuhan paling banyak terjadi?: [July 5 2020 15 July 4 2021 11 May 23 2020 9 June 20 2020 9 June 20 2021 8 .. April 4 2018 1 April 2 2018 1 March 31 2018 1 March 24 2018 1 January 1 2014 1 Name: incident_date, Length: 1747,
dtype: int64]
Insight :
July 5 2020 15 July 4 2021 11 May 23 2020 9 June 20 2020 9 June 20 2021 8 .. April 4 2018 1 April 2 2018 1 March 31 2018 1 March 24 2018 1 January 1 2014 1 Name: incident_date, Length: 1747, dtype: int64
In [ ]: