You are on page 1of 4

Data analysis

get the data – from you simulated experiments


aloha = pd.read_csv('aloha.csv')
explore the data
You can view the contents of the data frame by simply entering the name of the variable (aloha).
Alternatively, you can use the head() method of the data frame to view just the first few lines.
aloha.head()

The complementary tail() method shows the last few lines. There is also an iloc method that
we use at places in this tutorial to show rows from the middle of the data frame. It accepts a range:
aloha.iloc[20:30] selects 10 lines from line 20, aloha.iloc[:5] is like head(), and
aloha.iloc[-5:] is like tail().

aloha.iloc[1200:1205]
Hint: If you are in the terminal and you find that the data frame printout does not make use of the
whole width of the terminal, you can increase the display width for better readability with the
following commands:
pd.set_option('display.width', 180)
pd.set_option('display.max_colwidth', 100)

You can refer to a column as a whole with the array index syntax: aloha['run']. Alternatively,
the more convenient member access syntax (aloha.run) can also be used, with restrictions. (E.g.
the column name must be valid as a Python identifier, and should not collide with existing methods
of the data frame. Names that are known to cause trouble include name, min, max, mean).

aloha.run.head() # .head() is for limiting the output to 5 lines here


Selecting multiple columns is also possible, one just needs to use a list of column names as index.
The result will be another data frame. (The double brackets in the command are due to the fact that
both the array indexing and the list syntax use square brackets.)
tmp = aloha[['run', 'attrname', 'attrvalue']]
tmp.head()

The describe() method can be used to get an idea about the contents of a column. When
applied to a non-numeric column, it prints the number of non-null elements in it (count), the
number of unique values (unique), the most frequently occurring value (top) and its multiplicity
(freq), and the inferred data type (more about that later.)

aloha.module.describe()
You can get a list of the unique values using the unique() method. For example, the following
command lists the names of modules that have recorded any statistics:
aloha.module.unique()
When you apply describe() to a numeric column, you get a statistical summary with things like
mean, standard deviation, minimum, maximum, and various quantiles.
aloha.value.describe()
Applying describe() to the whole data frame creates a similar report about all numeric
columns.
aloha.describe()
Let's spend a minute on data types and column data types. Every column has a data type
(abbreviated dtype) that determines what type of values it may contain. Column dtypes can be
printed with dtypes:
aloha.dtypes
The two most commonly used dtypes are float64 and object. A float64 column contains floating-
point numbers, and missing values are represented with NaNs. An object column may contain
basically anything -- usually strings, but we'll also have NumPy arrays (np.ndarray) as elements
in this tutorial. Numeric values and booleans may also occur in an object column. Missing values in
an object column are usually represented with None, but Pandas also interprets the floating-point
NaN like that. Some degree of confusion arises from fact that some Pandas functions check the
column's dtype, while others are already happy if the contained elements are of the required type.
To clarify: applying describe() to a column prints a type inferred from the individual elements,
not the column dtype. The column dtype type can be changed with the astype() method; we'll
see an example for using it later in this tutorial.
The column dtype can be accessed as the dtype property of a column, for example
aloha.stddev.dtype yields dtype('float64'). There are also convenience functions
such as is_numeric_dtype() and is_string_dtype() for checking column dtype. (They
need to be imported from the pandas.api.types package though.)

filtering
tmp = aloha[(aloha.type=='scalar') & (aloha.module=='Aloha.server') &
(aloha.name=='channelUtilization:last')]
tmp.head()

Conditions can be combined with AND/OR using the "&" and "|" operators, but you need
parentheses because of operator precedence. The above command selects the rows that contain
scalars with a certain name and owner module
add column
You'll also need to know how to add a new column to the data frame. Now that is a bit controversial
topic, because at the time of writing, there is a "convenient" syntax and an "official" syntax for it.
The "convenient" syntax is a simple assignment, for example:
aloha['qname'] = aloha.module + "." + aloha.name
aloha[aloha.type=='scalar'].head() # print excerpt

It looks nice and natural, but it is not entirely correct. It often results in a warning:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.... The
message essentially says that the operation (here, adding the new column) might have been applied
to a temporary object instead of the original data frame, and thus might have been ineffective.
Luckily, that is not the case most of the time (the operation does take effect). Nevertheless, for
production code, i.e. scripts, the "official" solution, the assign() method of the data frame is
recommended, like this:
aloha = aloha.assign(qname = aloha.module + "." + aloha.name)
aloha[aloha.type=='scalar'].head()
delete col
For completeness, one can remove a column from a data frame using either the del operator or the
drop() method of the data frame. Here we show the former (also to remove the column we added
above, as we won't need it for now):
del aloha['qname']
plots
import matplotlib.pyplot as plt
df.plot()
plt.show()

#Two lines to make our compiler able to draw:


plt.savefig(sys.stdout.buffer)
sys.stdout.flush()

note:
can work with ‘data.csv’, available:https://www.w3schools.com/python/pandas/data.csv.txt

scatter plot
Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
A scatter plot needs an x- and a y-axis.
In the example below we will use "Duration" for the x-axis and "Calories" for the y-axis.
Include the x and y arguments like this:
x = 'Duration', y = 'Calories'
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

plt.show()

note: In the previous study with this data, it is found that the correlation between "Duration" and
"Calories" was 0.922721, and a conclusion was made that higher duration means more calories
burned.
By looking at the scatterplot, I will agree. Do you?

Let's create another scatterplot, where there is a bad relationship between the columns, like
"Duration" and "Maxpulse", with the correlation 0.009403:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')

plt.show()
execise:
Use the kind argument to specify that you want a histogram:
kind = 'hist'
A histogram needs only one column.
A histogram shows us the frequency of each interval, e.g. how many workouts lasted between 50
and 60 minutes?
In the example below we will use the "Duration" column to create the histogram:
df["Duration"].plot(kind = 'hist')
Note: The histogram tells us that there were over 100 workouts that lasted between 50 and 60
minutes.

pivot tables

get info about your data


.info(), provides all the essential information about a dataset.
Eg
df.info()

You might also like