Professional Documents
Culture Documents
Reading data
In [28]: df_gender = pd.read_csv('data/prepared/Survey_on_Gender_Equality_At_Home/Survey_on_Gender_Equality_At_Home_2_c
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 1/13
8/27/2021 Analysis
Out[29]: Year Region Country Internet_Penetration Gender a1_agree a1_neutral a1_disagree a2_opps_other a3_yes ... d6_wait d7_bore
North
0 2020 Canada 95 to 100% Female 93 1 5 7.40 54 ... 33 3
America
North
1 2020 Canada 95 to 100% Male 92 2 5 7.62 59 ... 31 3
America
North
2 2020 Canada 95 to 100% Combined 93 2 5 7.55 56 ... 32 3
America
United
North States
3 2020 70 to 75% Female 90 3 7 7.42 50 ... 36 3
America of
America
United
North States
4 2020 70 to 75% Male 89 4 8 7.88 64 ... 28 2
America of
America
5 rows × 92 columns
The columns are labeled with names that are not clear, we can rename them and filter the database to keep only the variables that are
now of interest.
In [30]: df = df[['Country','Gender','Internet_Penetration','a1_agree','b7_full']]
df.rename(columns={'a1_agree':'Equal_Rights','b7_full':'Access_Money'},inplace=True)
In [31]: print(df['Internet_Penetration'].unique())
['95 to 100%' '70 to 75%' '75 to 80%' '40 to 45%' '45 to 50%' '35 to 40%'
'80 to 85%' '85 to 90%' '15 to 20%' '65 to 70%' '5% or less' '60 to 65%'
'.' '10 to 15%' '5 to 10%' '50 to 55%' '55 to 60%' '25 to 30%'
'20 to 25%' '90 to 95%' '30 to 35%' '50 to 65%']
It seems that we need some work here. We will remove the dot (which corresponds to regions not countries) and take the first value and
change the type of data to numeric.
In [32]: df = df[df['Internet_Penetration']!='.']
df['Internet_Penetration'] = [float(df.loc[i,'Internet_Penetration'][:2])
if df.loc[i,'Internet_Penetration'][1].isalnum()
else float(df.loc[i,'Internet_Penetration'][:1]) for i in df.index]
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 2/13
8/27/2021 Analysis
In [33]: print(df['Equal_Rights'].unique())
[93 92 90 89 91 94 84 76 80 82 79 86 87 95 81 78 83 88 97 96 49 65 69 71
75 61 73 68 74 85 67 77 64 19 72 70 66]
There is an outlier (19), we can see to which country it correspond to before removing.
In [34]: df[df['Equal_Rights']==19]['Country']
Analysis
Now, we will plot three relations. The first between Internet penetration and Equal rights perception, the second between Internet
penetration and self access of money, and finally on Equal rights perception and Self access to money. We will see a scatter plot and a a
linear regression to see the tendencies of the relation between these variables.
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 3/13
8/27/2021 Analysis
It looks like there is a positive relation between the three variables. We can test this statistically with a correlation proof, which gives a
mesure of the relation of the variables.
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 4/13
8/27/2021 Analysis
In [38]: df_gender = df
In [40]: df.head()
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 5/13
8/27/2021 Analysis
First, the variable "statistic" is the number of responds in each category. Then, to calculate the percentages this columns will be divided by
"total_asked" and multiplied by 100.
The names of the countries are in alpha-2 code, then, to obtain information we need to match it with the names of the correspondent
countries. To do that we downloaded a csv with the apha-2 code and their correspondent countries from this list
https://www.iban.com/country-codes. Let us open this list.
0 Afghanistan AF AFG 4
2 Albania AL ALB 8
3 Algeria DZ DZA 12
There are more variables here, we will take only the relevant ones, "Country" and "Alpha-2 code"
We will take only the variables of interest: "gen_opn_1_text", "gen_opn_2_text" and "own_fem_text". This is a different process than before
because the variables are listed in a column. First we will eliminate the columns that we will not use anymore as "country",
"who_was_asked" and "total_asked", then we will take only the rows that coincide with the variables that we are interested in from column
"variable" and rename these variables.
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 6/13
8/27/2021 Analysis
In [45]: df.drop(columns=['country','who_was_asked','total_asked'],inplace=True)
df = df[(df['variable'] == "gen_opn_1_text") | (df['variable'] == "own_fem_text")]
In [46]: df.replace(['gen_opn_1_text','own_fem_text'],['Self_perp_equal_rights','Owner_female'],inplace=True)
Now, we can take a look of the values that the chosen variables takes.
In [47]: print(df['value'].unique())
In [48]: df.dropna(inplace=True)
df.reset_index(drop=True,inplace=True)
In [49]: df
The table in this form is too difficult to read. We will create new variables that combine the variables and their values, to obtain an structure
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 7/13
8/27/2021 Analysis
Country
United
Kingdom
of Great
Britain
24.37 3.05 56.35 3.55
and
Northern
Ireland
(the)
United
States of
28.26 1.09 52.90 4.35
America
(the)
These are many variables and we want a simple analysis, then we will group them in only four categories. The concernings of equal rights
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 8/13
8/27/2021 Analysis
will be grouped only in "agree" (containing "agree" and "strongly agree") and "disagree" (containing "disagree" and "strongly disagree"),
and "Female_owner_Half_or_more" which is the percentage of respondents that said that half or more business are owned by a female.
Country
Finally we can see the relation between variables in the following scatter and regression plot.
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 9/13
8/27/2021 Analysis
Now, we can see how to use more than one survey at time. We can merge these databases using the countries as the fix variable.
South
236 Female 50.0 88 19 77.93 9.01
Africa
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 10/13
8/27/2021 Analysis
82 rows × 9 columns
Last, let's see the relation between the variables "Access_Money" and "Female_owner_Half_or_more" and "Internet_Penetration" and
"Female_owner_Half_or_more". Remember that they were originally in different surveys.
plt.show()
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 11/13
8/27/2021 Analysis
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 12/13
8/27/2021 Analysis
Conclusions
From the survey on gender equality we can conclude that the internet penetration has a positive correlation with the
perception of equality and with the full access of females to the household funds. Also the perception of equality in
rights have a positive relation with the access to funds.
When we look at the survey about the future of business we see that the percentage of persons who think that
women and men should have equal rights is positively correlated with the percentage of persons that said that the
proportion of business owned by females are the half or more.
We were able to merge the two databases and compare variables from both of them, and see that there is a
positive relation between full access to household funds, and internet penetration with more female owners of
business.
This is a very simple example of the information that can be extracted from these two surveys, using data analysis
techniques and computational tools. This kind of information evidences the importance of having access to good,
organized and meaningful data.
In [ ]:
localhost:8888/nbconvert/html/Analysis.ipynb?download=false 13/13