Professional Documents
Culture Documents
19
Popular Python Data Science Libraries
Data manipulation and analysis
• Pandas: for the manipulation and analysis of
data.
• provides many easy to use functions to perform
data manipulation and analysis with the help of
data structures
• the data structures supported by Pandas are Data
frames (which handles two-dimensional data) and
series (which handles one-dimensional data).
• It offers to work with labelled and relational
datasets.
• It offers to convert the data structures into the data
frame and to provide some data analysis tasks to
find out the missing values, plot the data with
histogram, drop the null values columns, and more.
• It has all the data manipulation, optimization,
visualization, and data wrangling features.
Data processing
• Numpy : can be considered as an
abbreviated form for Numerical Python
and it is used for scientific computing.
• It provides a large number of functions to
deal with high dimensional arrays, metrics
and linear algebra.
• A wide range of operations can be
performed on array and matrices using the
methods provided by this python package.
• It also provides various tools for integrating
the code of C, C++, and Fortron.
Data processing
• Scipy: Python Scipy library employs in both
Data Science and Scientific Computing.
• libraries for math, science and engineering.
• NumPy, Matplotlib and pandas are libraries
that fall under the SciPy project umbrella.
• It includes different modules for image
processing, linear algebra, integration and
interpolation of data, etc.
Machine Learning and NLP
• Scikit-learn: popular python library for
implementing machine learning algorithms.
• It helps in quickly implementing several popular
machine learning algorithms like linear
regression, logistic regression, data
preprocessing and dimensionality reduction
tasks etc.
• This python library is developed around Numpy,
Scipy, and Matplotlib.
• NLTK: NLTK stands for Natural Language Toolkit.
It is an open-source library to work with the
human language data sets. It is very useful for
problems like text analytics, sentiment analysis,
analyzing linguistic structure, etc.
Deep Learning
• TensorFlow: an open-source framework
by Google for an end to end machine
learning and deep learning solutions.
• It gives low-level controls to the users to
design and train highly scalable and
complex neural networks.
• available for both desktop and mobile and
supports an extensive number of
programming languages through
wrappers.
Deep Learning
Try:
• Create first_six: first 6 rows by omitting the begin index.
• Create last_four: last 4 rows omitting the end index.
Selecting columns
• Specific_data=data[“Species”]
• specific_data=data[["Id","Species"]]
• #data[["column_name1","column_name2","column_name3"]]
• data['name'].unique()
• data.groupby('name').size()
Select a column or columns in a dataframe
• For pandas objects (Series, DataFrame), the indexing operator [] only accepts:
1. column name or list of column names to select column(s)
2. slicing or Boolean array to select row(s), i.e. it only refers to one dimension of the dataframe.
• So if you’re choosing one column, you can get away with passing in just the name
of the column.
df[“columnname”]
• But if you’re choosing multiple columns, you have to pass in a container that
contains multiple values. The most commonly used data type is the list, which
also happens to be defined using square brackets.
columns = [“c1”, “c2”]
df[columns]
• Or to simplify
df[[“c1”, “c2”]]