Professional Documents
Culture Documents
Tugas1 - 4 Analisis Data Talitha Syahda Aguslin (20037061)
Tugas1 - 4 Analisis Data Talitha Syahda Aguslin (20037061)
Nim : 20037061
Mengimport Data
import pandas as pd
Data_BreastCancer = "/content/Breast_Cancer.xlsx"
Data = pd.read_excel(Data_BreastCancer)
df = pd.DataFrame(Data)
print(df)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 4024 non-null int64
1 Race 4024 non-null object
2 Marital Status 4024 non-null object
3 T Stage 4024 non-null object
4 N Stage 4024 non-null object
5 6th Stage 4024 non-null object
6 differentiate 4024 non-null object
7 Grade 4024 non-null object
8 A Stage 4024 non-null object
9 Tumor Size 4024 non-null int64
10 Estrogen Status 4024 non-null object
11 Progesterone Status 4024 non-null object
12 Regional Node Examined 4024 non-null int64
13 Reginol Node Positive 4024 non-null int64
14 Survival Months 4024 non-null int64
15 Status 4024 non-null object
dtypes: int64(5), object(11)
memory usage: 503.1+ KB
Data_Float = df.select_dtypes(include=[int])
print(Data_Float)
Survival Months
0 60
1 62
2 75
3 84
4 50
... ...
4019 49
4020 69
4021 69
4022 72
4023 100
datalist = ["Age","Race","Marital Stage","T Stage","N Stage","6th Stage","Grade","A Stage","Tumor Size","Estrogen Status","Progesteron Status
print(datalist)
type(datalist)
['Age', 'Race', 'Marital Stage', 'T Stage', 'N Stage', '6th Stage', 'Grade', 'A Stage', 'Tumor Size', 'Estrogen Status', 'Progesteron S
list
dataset = {"Age","68","Face","White","Age","50"}
print(dataset)
type(dataset)
Marital Stage
str
a = {
"Age" : 65,
"Face" : "White",
"Marital Stage" : "Married",
"Tumor Size" : 4,
"Status" : "Alive"
}
print(a)
type(a)y
{'Age': 65, 'Face': 'White', 'Marital Stage': 'Married', 'Tumor Size': 4, 'Status': 'Alive'}
dict
import collections
datacoll = collections.deque (["Married","Married","Divorced","Married","Married","Married"])
datacoll.append ("Single")
print (datacoll)
datacoll.appendleft ("separated")
print (datacoll)
datacoll.pop ()
print(datacoll)
datacoll.popleft()
print(datacoll)
type(datacoll)
import heapq
T = [4,35,63,18,41,20,8,30,103]
heapq.heapify (T)
heapq.heapreplace (T,2)
print(T)
type(T)
Nim : 20037061
list1 = [68,50,58,58,47]
list2 = ["White","White","White","White","White"]
list3 = ["Married","Married","Divorced","Married","Married"]
import pandas as pd
list1 = list1 + [48]
list2 = list2 + ["Black"]
list3 = list3 + ["Divorced"]
dataframe_list = pd.DataFrame(list(zip(list1,list2,list3)),columns = ['Age',"Race","Marital Stage"], index = [1,2,3,4,5,6])
dataframe_list
1 68 White Married
2 50 White Married
3 58 White Divorced
4 58 White Married
5 47 White Married
6 48 Black Divorced
tuple1 = ('Regional','Alive')
tuple2 = ('Distant','Dead')
dataframe_tuple = pd.DataFrame(tuple((tuple1,tuple2)), columns = ['a stage','status'])
dataframe_tuple
a stage status
0 Regional Alive
1 Distant Dead
xlsx_file = pd.read_excel("/content/Breast_Cancer.xlsx")
xlsx_file.head()
Regional Reginol
Marital T N 6th Tumor Estrogen Progesterone
Age Race differentiate Grade A Stage Node Node
Status Stage Stage Stage Size Status Status
Examined Positive
Poorly
0 68 White Married T1 N1 IIA 3 Regional 4 Positive Positive 24 1
differentiated
Moderately
1 50 White Married T2 N2 IIIA 2 Regional 35 Positive Positive 14 5
differentiated
Moderately
2 58 White Divorced T3 N3 IIIC 2 Regional 63 Positive Positive 14 7
differentiated
Poorly
3 58 White Married T1 N1 IIA 3 Regional 18 Positive Positive 2 1
differentiated
Poorly
4 47 White Married T2 N1 IIB 3 Regional 41 Positive Positive 3 1
differentiated
4. membuat dataframe dari url format
import pandas as pd
download_url=("https://storage.googleapis.com/kagglesdsdata/datasets/2396275/4045493/Breast_Cancer.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Go
df = pd.read_csv(download_url)
type(df)
df
Regional Regi
Marital T N 6th Tumor Estrogen Progesterone
Age Race differentiate Grade A Stage Node N
Status Stage Stage Stage Size Status Status
Examined Posit
Poorly
0 68 White Married T1 N1 IIA 3 Regional 4 Positive Positive 24
differentiated
Moderately
1 50 White Married T2 N2 IIIA 2 Regional 35 Positive Positive 14
differentiated
Moderately
2 58 White Divorced T3 N3 IIIC 2 Regional 63 Positive Positive 14
differentiated
Poorly
3 58 White Married T1 N1 IIA 3 Regional 18 Positive Positive 2
differentiated
Poorly
4 47 White Married T2 N1 IIB 3 Regional 41 Positive Positive 3
differentiated
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Moderately
4019 62 Other Married T1 N1 IIA 2 Regional 9 Positive Positive 1
differentiated
Moderately
4020 56 White Divorced T2 N2 IIIA 2 Regional 46 Positive Positive 14
differentiated
Moderately
4021 68 White Married T2 N1 IIB 2 Regional 22 Positive Negative 11
differentiated
Moderately
4022 58 Black Divorced T2 N1 IIB 2 Regional 44 Positive Positive 11
differentiated
Moderately
4023 46 White Married T2 N1 IIB 2 Regional 30 Positive Positive 7
differentiated
csv_file = pd.read_csv("/content/Breast_Cancer.csv")
csv_file.head()
Regional Reginol
Marital T N 6th Tumor Estrogen Progesterone
Age Race differentiate Grade A Stage Node Node
Status Stage Stage Stage Size Status Status
Examined Positive
Poorly
0 68 White Married T1 N1 IIA 3 Regional 4 Positive Positive 24 1
differentiated
Moderately
1 50 White Married T2 N2 IIIA 2 Regional 35 Positive Positive 14 5
differentiated
Moderately
2 58 White Divorced T3 N3 IIIC 2 Regional 63 Positive Positive 14 7
differentiated
Poorly
3 58 White Married T1 N1 IIA 3 Regional 18 Positive Positive 2 1
differentiated
Poorly
4 47 White Married T2 N1 IIB 3 Regional 41 Positive Positive 3 1
differentiated
import numpy as np
array = np.array([['White',68,4],['White',50,35],['White',58,63]])
dataframe_numpy = pd.DataFrame(array,columns = ['Race','Age','Tumor Size'])
dataframe_numpy
Race Age Tumor Size
0 White 68 4
1 White 50 35
7. membuat
2 White dataframe
58 dari63
pandas series
series1 = pd.Series(['Married','Married','Discovered'])
series2 = pd.Series([3,2,2])
dataframe_series = pd.DataFrame({'Marital Status':series1, 'grade':series2})
dataframe_series
0 Married 3
1 Married 2
2 Discovered 2
Nama : Talitha Syahda Aguslin
Nim : 20037061
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
import numpy as np
titanic.describe()
titanic.describe(exclude = np.number)
count 891 889 891 891 891 203 889 891 891
unique 2 3 3 3 2 7 3 2 2
Data Exploration
a. melihat nilai 10 data pertama dan 5 data terakhir
titanic.tail()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alon
886 0 2 male 27.0 0 0 13.00 S Second man True NaN Southampton no Tru
887 1 1 female 19.0 0 0 30.00 S First woman False B Southampton yes Tru
888 0 3 female NaN 1 2 23.45 S Third woman False NaN Southampton no Fals
889 1 1 male 26.0 0 0 30.00 C First man True C Cherbourg yes Tru
titanic.head(10)
890 0 3 male 32.0 0 0 7.75 Q Third man True NaN Queenstown no Tru
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alon
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes Tru
8 1 3 female 27.0 0 2 11.1333 S Third woman False NaN Southampton yes Fals
9 1 2 female 14.0 1 0 30.0708 C Second child False NaN Cherbourg yes Fals
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 714 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 889 non-null object
8 class 891 non-null category
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 203 non-null category
12 embark_town 889 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
2
titanic.describe()1 3 female 0 0 7.9250 Third woman False yes True
titanic.describe().T
titanic.size
9801
titanic.shape
(891, 11)
titanic.who.value_counts()
man 537
woman 271
child 83
Name: who, dtype: int64
d. matriks varians-covarians
titanic.cov().style.background_gradient(cmap='coolwarm')
titanic.corr().style.background_gradient(cmap='coolwarm')
persentase_data_kosong = titanic.isna().sum()*100/len(titanic)
nilaikosong_titanic = pd.DataFrame({'Persentase Data Kosong' : persentase_data_kosong})
nilaikosong_titanic
survived 0.0
pclass 0.0
sex 0.0
sibsp 0.0
parch 0.0
fare 0.0
class 0.0
who 0.0
adult_male 0.0
alive 0.0
alone 0.0
titanic=titanic.dropna()
titanic
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alon
h. visualisasi standar deviasi dan varians
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes Fals
<matplotlib.lines.Line2D
182 rows × 15 columns at 0x7f5e95ab93a0>
i. visualisasi quartil
Nim : 20037061
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head(15)
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alo
8 1 3 female 27.0 0 2 11.1333 S Third woman False NaN Southampton yes Fal
9 1 2 female 14.0 1 0 30.0708 C Second child False NaN Cherbourg yes Fal
Visualisasi Data
1. Scatter Plot
scatter_plot = titanic.plot.scatter(x = 'age', y = 'fare', c = 'green', title = 'Titanic Dataset', s = 30, figsize=(5,5))
Mengganti Background Color
scatter_plot = titanic.plot.scatter(x = 'age', y = 'fare', c = 'green', title = 'Titanic Dataset', s = 30, figsize=(5,5))
scatter_plot.set_facecolor('plum')
2. Line Chart
titanic_plot_line = titanic['age'].plot.line(color = ('pink'), title = 'Titanic Dataset', subplots =True, figsize=(8,8), layout=(2,2))
titanic_plot_line
array([[<AxesSubplot:>, <AxesSubplot:>],
[<AxesSubplot:>, <AxesSubplot:>]], dtype=object)
3. Histogram
array([[<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>],
[<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>],
[<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>],
[<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>,
<AxesSubplot:ylabel='Frequency'>]], dtype=object)
<AxesSubplot:>
5. AREA (NUMERIK)
array([<AxesSubplot:>], dtype=object)
titanic['fare'].plot.box()
<AxesSubplot:>
titanic.boxplot(fontsize=8)
<AxesSubplot:>
8. Pie CHART
<AxesSubplot:ylabel='who'>
1. Scatter Plot
x1 = [2,3,4]
y1 = [5,5,5]
x2 =[1,2,3,4,5]
y2 =[2,3,2,3,4]
y3 =[6,8,7,8,7]
plt.scatter(x1,y1)
plt.scatter(x2,y2,marker='v',color='r')
plt.scatter(x2,y3,marker='^',color='m')
plt.show()
2. Line Chart
x =[1,2,3,4,5,6,7,8,9]
y1=[1,3,5,3,1,3,5,3,1]
y2=[2,4,6,4,2,4,6,4,2]
plt.plot(x,y1,label='line L')
plt.plot(x,y2,label='line H')
plt.plot()
plt.xlabel('x axis')
plt.xlabel('y axis')
plt.title('Line Graph Example')
plt.legend()
plt.show()
3. Histogram
4. Bar Chart
5. Stack Plot
labels = 'S1','S2','S3'
sections = [56,66,24]
colors=['c','y','m']
plt.axis('equal')
plt.title('Pie Chart Exampple')
plt.show()
7. 3D Scatter Plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
X = titanic['age']
Y = titanic['fare']
ax.scatter(X,Y, c='r', marker='<')
ax.set_xlabel('x axis')
ax.set_ylabel('y axis')
plt.show()
8. 3D Bar Plot
fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = np.random.randint(10, size=10)
z = np.zeros(10)
dx = np.ones(10)
dy = np.ones(10)
dz = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
ax.set_xlabel('x axis')
ax.set_ylabel('y axis')
ax.set_zlabel('z axis')
plt.title('3D Bar Chart Example')
plt.tight_layout()
plt.show()
9. Wireframe Plot
x,y,z = axes3d.get_test_data()
plt.title("Wireframe plot")
plt.tight_layout()
plt.show()
ax.view_init(-140, 60)
plt.show()
1. Displot
<seaborn.axisgrid.FacetGrid at 0x7f13149d1880>
2. Scatter Plot
df = sns.load_dataset('titanic')
sns.regplot(x=df['age'], y=df['fare'], color='indigo', marker='x')
<AxesSubplot:xlabel='age', ylabel='fare'>
3. Box Plot
sns.boxplot(titanic['age'], color='forestgreen')
4. Matriks Korelasi
<seaborn.axisgrid.JointGrid at 0x7f13140233d0>
6 Violin Plot
<AxesSubplot:xlabel='who', ylabel='age'>
7. Pair Plot
iris = sns.load_dataset('iris')
sns.pairplot(hue='species',data=iris)
<seaborn.axisgrid.PairGrid at 0x7fa426dcad00>