You are on page 1of 4

Import pandas as pd specifically for pd.

series
Data=pd.series([list], index=[list]) the first parameter is like numpy
array while the second parameter is used to index like a dictionary where
the index is the key and the first parameter is value. If index is not passed
then the indexing works like normal 0,1,2,3,4,5… . series handles only
1D array.
Data.values printing this will give you the values present in the
first parameter. It is numpy array. Data itself is pandas object.
Data.index printing this will return either the second parameter or
if not present it will return value from 0 up-to the desired index-able
value. Its type is pandas object.
Printing ‘Data’ will return will return 2 rows where the right row is the
index and the left row the the value.
We can pass a dictionary in the pd.series to change the keys to index and
the values to values.
If we pass values in the second parameter I.e=[n1,n2,n3,…nx] then this Is
called explicit index in which if we try to index from [nd,nx] then both
will be accessed. But if we index with the traditional way I.e. [x,d] where
both are numbers then x will be included but not d. this is called implicit
indexing.
How to index
Data[x] returns the value that goes with the given index.
Data[x:y] returns the value from x to (y-1) if implicit index. Or returns
the value from x to (y) if explicit index.
Note: Data[x:y] if we pass number even if the index passed is number
then by default it access the implicit index. Where as the Data[x] by
default acess the explicit index. In order to avoid confusion between
explicit and implicit index when the index we passed is also number, then
do this: -
Data.iloc[x:y] or Data.iloc[x] to forcefully use implicit index.
Data.loc[x:y] or Data.loc[x] to forcefully use explicit index.
pd.DataFrame
The first way to utilize this object is sending multiple pd.series into a
dictionary like:
Super_data=Pd.DataFrame({data:pd.series, data1:pd.series, …})
printing will return columns composed of this data, data1 … and the
index in each of pd.series as a row and the intersection of this rows and
columns will be filled with the corresponding values present in the
pd.series.
You can transpose it like numpy. By passing the .T object.
Super_Data.value this returns a 2D array of values present in the cells.
Apply all methods of indexing present in the numpy notes for
matrix. Remember that the matrix is super_data.value. or we can pass
the .iloc object after our super data in order to utilize every indexing
present in the numpy notes indexing for matrix. So basically if we
pass .iloc after our dataframe we can use it as a matrix.
Super_data.columns returns all columns.
To add another column do this:-
If we want to add colum new_data then:
Super_data[new_data] = {pass the key and value}
Del super_data[new_data] will delet the column
In order to access sub-matrix of the super_data then:
Super_matrix[super_matrix[column] >, <, !=, =, value] then it will
operate according to the operation given.
pd.DataFrame([{key:value}, {key:value}]) it returns data composed of
rows and columns. The keys will be changed to columns and the values
will be changed to cells and index as a row. Since there is no index then
the index will be 0,1,2,3,4 …
Note: the index here represents the dictionary, plus mind that the
dictionaries are passed in a list.

How to handle nan values in python


if we pass multiple dictionary or pd.Series in the pd.DataFrame and if
there are index (for pd.series) or columns(for dictionaries in a list) that
are not common for them then in their cell it will pass NaN wich stands
for none.
(pd.DataFrame).fillna(anything) it will fill all cells with NaN value
with the value filled in the .fillna object.

Working with real datas by the:


Df=pd.read_(csv, excel, jason …)(absolute location of file with
specific extension)
Df.head(float) returns the first (float) rows.
Df.drop([column/row, column/row …], axis = 1, 0 , inplace = True /
False) drops / remove selected columns or rows depending on the second
parameter where axis=0 means row and axis=1 means column unlike
numpy the third parameter decides whether to change the actual data
frame or to create a copy of that in which either the column or row is
removed. If passed True then it operates on the actual data frame, where
as if False passed the actual data frame will not be touched rather a copy
will be created.
Df.rename(columns={existing column name:new column name,
existing column name:new column name, existing column name:new
column name, …}, inplace = True/False) renames column to new ones
passed in the value of the keys, while the second parameter is like whats
mentioned above.
Df[column containg dates]=pd.to_datetime(Df[column containg
dates]) fix the date format present in the file in-to format that pandas
utilizes. Pandas expects the format : yr-mon-day
Df.describe() will provide summary statistics for each numerical column
in the DataFrame.

The output of `df.describe()` includes the following statistics:

- **count**: Number of non-null values.


- **mean**: Mean (average) value.
- **std**: Standard deviation, a measure of the spread of the data.
- **min**: Minimum value.
- **25%**: First quartile (25th percentile).
- **50%**: Median or second quartile (50th percentile).
- **75%**: Third quartile (75th percentile).
- **max**: Maximum value.
Df.info() The `df.info()` method in the pandas library provides a concise
summary of a DataFrame, including information about the data types of
each column, the number of non-null values, and the memory usage. This
method is useful for quickly understanding the structure of the DataFrame
and identifying any missing or null values.
The output will include the following information:

- The number of non-null values in each column.


- The data type of each column.
- The total memory usage of the DataFrame.

Additionally, it will display a count of non-null values for each column,


helping you identify columns with missing data. This summary is helpful
for initial data exploration and understanding the basic characteristics of
the DataFrame.
Df.[colum name].value_counts() returns the amount of repetition of
value in the cell.
Df.groupby([columns, columns,…]).column.agg(len, min, max) ==
df . set_index([columns, columns…], inplace=True/False) ==
df.piviot_table(value=columns, index=column, column=column)
is like creating multiple index(mulitindex) where the rest are values.
Certain value we can think this as 2D array.

You might also like