You are on page 1of 13

COURSE CODE: MGNM801

ACADEMIC TASK NO:02 ACADEMIC TASK TITLE: PRACTICAL


Reg No; 12303726
STUDENT'S ROLL NO: RQ2341A26 SECTION: Q2341
GUIDED BY: ANAND KUMAR

LEARNING OUTCOMES

DECLARATION –

I DECLARE THAT THIS ASSIGNMENT IS MY INDIVIDUAL WORK. I HAVE NOT COPIED IT FROM ANY
OTHER STUDENT’S WORK OR FROM ANY OTHER SOURCE EXCEPT WHERE DUE
ACKNOWLEDGEMENT IS MADE EXPLICITLY IN THE NEXT, NOR HAS ANY PART BEEN WRITTEN FOR ME
BY ANY OTHER PERSON.

STUDENT’S SIGNATURE: SAURABH MAURYA

EVALUATOR’S COMMENTS (FOR INSTRUCTOR’S USE ONLY)

GENERAL OBSERVATIONS SUGGESTIONS FOR BEST PART OF ASSIGNMENTS


IMPROVEMENT

EVALUATOR’S SIGNATURE AND DATE


MARKS OBTAINED - ____________ MAX. MARKS - ___________
UNIT-4
Part 1: Pandas
Q.1. List at least three real-world scenarios where pandas can be used for data
analysis. Explain the specific use cases in each scenario.
Ans: The three real-world scenarios where pandas can be used are as follow:

1. Financial Analysis: Pandas is frequently used to analyse and work with


financial data in the finance sector. Pandas can be used, for instance, by
investment firms to examine stock prices, compute portfolio returns, and
carry out risk evaluations. Making well-informed financial judgments is
aided by its efficient data manipulation, statistical analysis, and
visualization capabilities.
a. Identify Potential defaulters: Group loan data by credit score, income,
and loan type. Calculate average repayment rates for each group to
identify segments with higher default risk.
b. Target market campaigns: Based on loan amounts, consumer
purchasing patterns, and demographics, create customer groupings.
Utilize this data to create customized marketing efforts for various
audiences.
2. Market Research: Pandas can be used to process and evaluate big
datasets in market research. Pandas can be used by businesses to clean
and preprocess survey data, aggregate and segment data, and produce
insightful reports. Researchers can use it to find patterns, trends, and
customer preferences, which helps with strategic planning and product
development.
a. Identify top-selling product and trends: Analyse sales statistics
according to price, brand, and product category. Determine the
income and average selling price of each product to find the most
popular ones and new trends.
b. Analyse basket: List the goods that are usually bought in tandem.
Make use of this information to boost average order value by
suggesting complementary items.
3. Healthcare Analytics: Pandas is useful in the field of healthcare analytics
as well. Patient data, including demographics, medical records, and
treatment results, can be analysed with it. Healthcare workers can use
pandas to forecast patient outcomes, assess the efficacy of therapies,
and find trends in the occurrence of diseases. Making data-driven
decisions, allocating resources optimally, and delivering better healthcare
are all aided by it.
a. Compare treatment groups: Examine clinical trial data according to
patient demographics, treatment group, and health results. To find
out if the new medication is more effective than current therapies,
use statistical tests.
b. Identify side effects: Keep track of any adverse occurrences that trial
participants report. To find any safety issues, use pandas to filter and
organize data according to incident type, severity, and patient
attributes

Q.1.2. Describe the primary data structures in Pandas, namely series and
Data frame. Explain the difference and use cases for each.
Ans: Panda Series:

a. One-dimensional array of indexed data: It is identical to a spreadsheet


column, it contains values of only one kind of data (strings, integers,
etc.).
Key components:

1. Values: The actual data elements.


2. Index: A label or position for each value, enabling fast lookup and
alignment.
3. Data type: All elements have the same data type if they are
homogeneous.

Use cases:

1. Keeping and handling single-column data, such as a group of


names or a list of prices
2. Remove or alter a particular column from a the DataFrame
3. Maintaining time-indexed values is how time series data are
represented.
4. Categorical data storage: using distinct labels to represent
categories.
Pandas DataFrame:

1. Two-dimensional table with adjustable sizes.


2. Visualize it as a SQL table or spreadsheet with called columns and
rows that can store different kinds of data.
Use cases:

1. Utilizing and storing information in an organized manner is


known as "tabular data representation."
2. Reading and writing data from a variety of sources, including as
databases, Excel, and CSV files.
3. Data formatting, variable transformation, and management of
missing values are examples of data cleaning and preparation.
4. Calculations, aggregates, and visualizations are all part of data
analysis and exploration.
5. Using already-existing data, feature engineering generates new
features.
CODE OF BOTH ONE AND TWO DIMENSIONAL:

Part 2: NumPy
Q.1. Write a brief description of what NumPy is and why it is important for
scientific computing and data analysis in Python.
Ans: NumPy is a code name for "Numerical Python" and is a well-known Python
package. Large, multi-dimensional arrays and matrices are supported, and a few
mathematical operations are available for effective manipulation of these arrays.
Scientific computing, data analysis, and machine learning tasks all make
extensive use of NumPy. It enables effective manipulation and calculation of
numerical data and provides high-performance numerical operations. NumPy is
frequently used to carry out intricate data analysis and visualization activities in
conjunction with other libraries such as Pandas and Matplotlib.

1. Efficiency handling multidimensional arrays:


a. The nparray (NumPy array) is a powerful, high-performance data
structure offered by NumPy. Unlike normal Python lists, this array can
store and manage enormous quantities of data effectively.
b. NumPy's built C routines greatly speed up operations like
vectorization, broadcasting, and matrix multiplication.
2. Providing a rich set of mathematical and statistical functions:
a. Numerous built-in functions for statistical analysis, random number
generation, linear algebra, and other applications are available in
NumPy. Because of this, intricate numerical calculations can be
performed without writing a lot of code.
b. These functions can apply calculations to full arrays at once since they
are optimized for vectorized operations, which further increases
efficiency.
NUMPY IS IMPORTANT FOR SCIENTIFIC COMPUTING:

a. Speed and memory efficiency: When compared to conventional


Python, NumPy's specialized arrays and functions greatly increase the
performance and memory utilization of scientific computations.
b. Versatility: It is a fundamental tool for data analysis and visualization
and forms the basis for numerous other scientific libraries, including
SciPy, pandas, and Matplotlib.
c. Simplicity and elegance: NumPy's succinct and intuitive syntax
facilitates the expression and understanding of complex computations.

The foundation of Python's scientific computing and data analysis is NumPy. For
anyone working with numerical data, its robust array structures, well-optimized
functions, and broad integration make it a vital tool.
Q.2. Explain the significance of NumPy in terms of performance and efficiency
when working with large datasets and numerical computations.
Ans: Vectorization and memory optimization are two important aspects of
NumPy that contribute to its efficiency and performance for huge datasets and
numerical computations.
Vectorization:

a. Using NumPy, you can work with complete data arrays (vectors) as
opposed to single elements. This lets you take full advantage of the
Single Instruction, Multiple Data (SIMD) features of your CPU. When
compared to iterating through each data element separately, SIMD
significantly speeds up the process by executing the same instruction on
several components at once.
b. Consider the task of adding up a million numbers. Each element would
be inserted one after the other using a for loop using Python lists.
Nevertheless, NumPy adds each element in parallel, greatly cutting
down on execution time.

Memory optimization:
Unlike Python lists, which can be dispersed throughout memory, NumPy saves
data in blocks of contiguous memory. There are two main benefits to this
continuous storage:

a. Cache efficiency: Performance can be further enhanced by efficiently


loading data from closer together in memory into the CPU's cache.
b. Smaller memory footprint: In general, contiguous storage uses less
memory than scattered lists—especially when dealing with big
information. Because of the decreased memory utilization, operating with
greater data volumes on the same hardware is made possible and memory
pressure is minimized.
Here are some of the additional points to highlight the significance of
NumPy:

a. Reduced code complexity: Writing complicated computations in a


clear, understandable manner using vectorized operations results in
code that is clearer and easier to maintain.
b. Interoperability: Workflow is made easier by NumPy's seamless
integration with other well-known scientific and machine learning
libraries like TensorFlow, SciPy, and Pandas.
c. Open-source and mature: NumPy is an established and popular library
that has a large documentation base and strong community support,
making it a dependable option for your applications.

UNIT-5
Data Visualization:

1. Create a matplotlib bar plot showing the sales of products in a store for
a given month. Label the axes, add a title, and customize the
appearance (e.g., colour, width).
Ans: A robust Python package called Matplotlib may be used to create a
wide variety of visualizations, from straightforward line plots to complex
three-dimensional models. It is a fundamental component of Python data
visualization, providing adaptability, personalization, and connection with
additional scientific instruments.

Q.2. Provide at least three example of data visualization scenarios where


seaborn is the preferred library over Matplotlib. Describe the type of plots or
charts involved and why seaborn is a better choice.
Ans:
Scenario 1: Visualizing Distributions

a. Type of plots: Histograms, KDE plots, violin plots, empirical cumulative


distribution functions (ECDFs) Why Seaborn excels:

a. Simpler syntax for creating these plots


b. Integrated statistical estimation for smoother distributions
c. Better visual clarity and aesthetics
Example: Comparing weight distribution of different animal species
Scenario 2: Exploring relationships btw variables

a. Types of plots: Scatter plots, joint plots, pair plots, heatmaps Why
seaborn excels:

a. Simplifies handling multiple variables and groups


b. Provides built-in functions for regression and correlation analysis
c. Offers options for customization and colour palettes
Example: Visualizing relationship between exam scores, study hours, and
student demographics
Scenario 3: Visualizing categorical Data

a. Type of plots: Bar plots, count plots, box plots, violin plots, swarm plots
Why seaborn excels:

a. Streamlines handling categorical variables


b. Offers a wider variety of plot types for categorical data
c. Facilitates visual comparisons between groups
Example: Comparing customer satisfaction ratings across different product
categories:

Some main points:


a. Providing a more advanced interface for statistical data visualization, Seaborn
expands upon Matplotlib.

b. Seaborn is excellent at handling statistical computations, simplifying


complicated charts, and offering visually beautiful defaults.
c. Because of its effectiveness, lucidity, and emphasis on statistical analysis,
Seaborn is frequently chosen when working with distributions,
connections between variables, or categorical data.
UNIT-6
Q.1. Figure, Data, and Layout. Explain the purpose of each structure in
creating visualizations.
Ans: 1. Figure:

a. Purpose: The total holding area for the data and graphic components. It
determines the visualization's dimensions and bounds.
b. Components:
1. Canvas: The region where images are placed.
2. Axes: Establish the coordinate system and scales before
beginning a data plot.
3. Gridlines: Optional lines that facilitate reading values and serve
as reference points.
4. Titles and labels: Give the visualization's material some context
and clarification.
5. Legends (for multi-series charts): Describe the meanings of the
various colours, forms, and symbols.

2. Data:
a. Purpose: The core information being visualized, presented in a visual
form.
b. Representation:
1. Numerical values: Shown as regions, bars, points, lines, or
other shapes.
2. Categorial data: Depicted by the use of text labels, shapes, or
colours.
3. Spatial data: Mapped to a visual space's coordinates.
4. Textual data: Shown as headings, comments, or as part of an
image.

3. Layout:
a. Purpose: The facts and visual elements are arranged in the figure to
improve comprehension and communication.
b. Elements:
1. Positioning: Deciding on the placement of the pieces on the
canvas.
2. Spacing: Modifying the spacing between components to
improve clarity of vision.
3. Hierarchy: Highlighting specific components to direct
attention.
4. Alignment: Establishing visual coherence and organization.
5. Grouping: Putting similar components in order.
Q.2. Load a sales dataset with columns ‘sales’, create a plotly line chart to
visualize the total sales trend. Include axis labels, a title, and customize the
appearance.
Ans:

You might also like