You are on page 1of 11

COURSE TITLE: BUSINESS ANALYTICS-I

COURSE CODE: MGNM801 ACADEMIC TASK TYPE: INDIVIDUAL

ACADEMIC TASK: 02 DATE OF SUBMISSION: 23-12-23

DATE OF ALLOTMENT: GUIDED BY: PUNEET PAHADIA

NAME: RAUSHAN KUMAR REGISTRATION NO: 12313538

SECTION: Q2327 ROLL NO: RQ2327A08

Declaration:
I declare that this Assignment is my individual work. I have not copied it from any other student’s
work or from any other source except where due acknowledgement is made explicitly in the text,
nor has any part been written for me by any other person.

EVALUATOR’S COMMENTS (FOR INSTRUCTOR’S USE ONLY)

GENERAL OBSERVATIONS SUGGESTIONS FOR BEST PART OF


IMPROVEMENT ASSIGNMENTS

EVALUATOR’S SIGNATURE AND DATE

MARKS OBTAINED - ____________ MAX. MARKS - ___________


UNIT 4
1 PART 1: PANDAS

1.1 Q1. LIST AT LEAST THREE REAL-WORLD SCENARIOS WHERE PANDAS CAN BE USED FOR DATA ANALYSIS.

1.2 EXPLAIN THE SPECIFIC USE CASES IN EACH SCENARIO.

Some real-world scenarios where Pandas can be used for data analysis:

1.Sports: Analyzing player performance statistics, tracking team trends, identifying factors that
contribute to wins and losses, optimizing training and strategies.

Pandas use:

• Loading and cleaning data from various sources (game scores, player statistics, sensor
readings, etc.).
• Calculating key performance metrics (averages, shooting percentages, assists, rebounds,
etc.).
• Visualizing trends and patterns (player performance over time, team comparisons, win-loss
distributions).
• Building predictive models to forecast player performance, game outcomes, or injury risks.

2. Social Media: Understanding user behaviour, identifying popular topics and trends, analyzing
sentiment and engagement, optimizing marketing campaigns.

Pandas use:

• Collecting and preparing social media data (tweets, posts, comments, likes, shares).
• Cleaning and preprocessing text data (removing noise, handling
emojis, stemming/lemmatizing words).
• Conducting sentiment analysis (classifying positive, negative, or neutral sentiment in text).
• Identifying trending topics and influencers.
• Visualizing social network structures and interactions.

3. HR Analytics for Employee Performance: An HR manager wants to assess employee


performance by analyzing data related to key performance indicators (KPIs), training records, and
employee feedback.

Pandas use:

• Import and merge HR datasets using Pandas to consolidate information on employee


performance, training records, and feedback.
• Aggregate and summarize data to gain insights into performance at various levels, such as
individual, team, or department.
• Apply time-series analysis with Pandas to identify trends in employee performance over
different time periods.
• Utilize Pandas to calculate key performance metrics, such as productivity scores, completion
rates for training programs, and overall performance indicators.

1.3 Q2. DESCRIBE THE PRIMARY DATA STRUCTURES IN PANDAS, NAMELY SERIES AND DATAFRAME. EXPLAIN
THE DIFFERENCES AND USE CASES FOR EACH.

Here's a description of Series and DataFrame, the primary data structures in Pandas, along with their
differences and use cases:

Series:

• One-dimensional array of labeled data: Think of it as a single column in a spreadsheet.

• Holds any data type: Numbers, strings, dates, booleans, or even custom objects.

• Two key components:

o Values: The actual data elements.

o Index: A label for each value, often used for selection and alignment.

Use cases:

• Representing a single feature or variable in a dataset.

• Storing time series data (e.g., stock prices over time).

• Creating dictionaries with meaningful keys.

DataFrame:

• Two-dimensional labeled data structure: Think of it as a spreadsheet or a table.

• Collection of Series objects: Each column is a Series, and each row represents an
observation.

• Can have columns of different data types.

• Two key components:

o Rows: Represent individual observations or records.

o Columns: Represent different variables or features.

o Index: Labels for both rows and columns, enabling flexible access and manipulation.

Use cases:

• Representing tabular data, such as datasets imported from CSV, Excel, or databases.

• Storing and analyzing multivariate data with multiple features.


• Performing operations like filtering, grouping, aggregation, and joining on tabular data.

2 PART 2: NUMPY

2.1 Q1. WRITE A BRIEF DESCRIPTION OF WHAT NUMPY IS AND WHY IT IS IMPORTANT FOR SCIENTIFIC
COMPUTING AND DATA ANALYSIS IN PYTHON

NumPy, short for Numerical Python, is a fundamental library for numerical computing in Python. It
provides support for large, multi-dimensional arrays and matrices, along with a collection of high-
level mathematical functions to operate on these arrays.

Key features and reasons why NumPy is important for scientific computing and data analysis in
Python include:

Efficient Array Operations: NumPy provides a powerful N-dimensional array object (ndarray), which
allows for efficient storage and manipulation of large datasets. The ndarray supports a variety of
data types and enables vectorized operations, which significantly enhances the performance of
numerical computations.

Mathematical Functions: NumPy includes a comprehensive set of mathematical functions that


operate element-wise on arrays. These functions are optimized for performance and are crucial for
scientific computations, linear algebra, signal processing, and more. Examples include trigonometric,
logarithmic, statistical, and linear algebra functions.

Broadcasting: NumPy's broadcasting capability enables operations on arrays of different shapes and
sizes, making it easier to perform element-wise operations without the need for explicit loops. This
enhances code readability and reduces the need for unnecessary duplication of data.

Memory Management: NumPy efficiently manages memory and provides tools for creating views on
arrays without copying data, saving both time and resources. This is particularly beneficial when
working with large datasets, as it minimizes memory overhead.

2.2 Q2. EXPLAIN THE SIGNIFICANCE OF NUMPY IN TERMS OF PERFORMANCE AND EFFICIENCY WHEN WORKING
WITH LARGE DATASETS AND NUMERICAL COMPUTATIONS.

When it comes to handling large datasets and complex numerical computations in Python, NumPy
reigns supreme in terms of performance and efficiency. Here's why:

1. Memory Efficiency:

▪ Contiguous Memory Layout: NumPy stores data in contiguous blocks of memory, unlike
Python lists which can be scattered. This allows for faster access and manipulation of
elements as data doesn't need to be searched across memory fragments.
▪ Optimized Data Types: NumPy offers specialized data types like float64 or int32 designed for
numerical operations. These are more compact and efficient than generic Python types like
"float" or "int", reducing memory footprint and boosting processing speed.

2. Vectorized Operations:

▪ Single Instruction, Multiple Data (SIMD): NumPy leverages vectorized operations, utilizing
SIMD instructions on modern CPUs. This allows performing the same operation on multiple
data elements simultaneously, leading to significant speedups compared to looping over
elements one by one.
▪ Broadcasting: NumPy automatically broadcasts operations between arrays of different sizes,
eliminating the need for manual loop-based iteration and further enhancing performance.

3. C-optimized Backend:

NumPy relies heavily on optimized C code under the hood, making it significantly faster than pure
Python implementations. This C code takes advantage of hardware capabilities and low-level
memory access, further pushing the boundaries of performance.

4. Reduced Code Complexity:

Concise syntax: NumPy provides vectorized functions and operators that eliminate the need for long
and intricate loops, simplifying code and making it more readable. This not only improves efficiency
but also reduces the risk of errors.

Overall, NumPy's efficiency advantages translate to:

• Faster execution times: Analyzing large datasets and performing complex calculations
become significantly faster with NumPy compared to pure Python or other less optimized
libraries.
• Reduced CPU and memory usage: Smaller memory footprint and efficient computations
translate to lower resource consumption, enabling smooth processing of even massive
datasets on smaller machines.
• Simplified code and easier maintenance: Concise and readable code thanks to vectorization
improves maintainability and reduces debugging time.
UNIT 5
3 DATA VISUALIZATION

3.1 Q1. CREATE A MATPLOTLIB BAR PLOT SHOWING THE SALES OF PRODUCTS IN A STORE FOR A GIVEN
MONTH. LABEL THE AXES, ADD A TITLE, AND CUSTOMIZE THE APPEARANCE (E.G., COLOUR, WIDTH)

Code:
3.2 Q2. PROVIDE AT LEAST THREE EXAMPLES OF DATA VISUALIZATION SCENARIOS WHERE SEABORN IS THE
PREFERRED LIBRARY OVER MATPLOTLIB. DESCRIBE THE TYPE OF PLOTS OR CHARTS INVOLVED AND WHY
SEABORN IS A BETTER CHOICE.

Here are three examples where Seaborn often excels over Matplotlib for specific visualization tasks:
1. Visualizing Statistical Relationships:

• Plot types: Pair plots, joint plots, distributions, heatmaps, violin plots

• Why Seaborn is better:

o Simplifies creation of multi-faceted plots with minimal code.

o Integrates statistical estimation and visual representation for informative plots.

o Automatically handles data alignment and labeling for complex relationships.

o Offers aesthetically pleasing default styles and color palettes.

Example: Visualizing correlations between multiple variables in a dataset using a pair plot, revealing
patterns and potential interactions.
2. Exploring Categorical Data:

• Plot types: Bar plots, box plots, violin plots, strip plots, point plots

• Why Seaborn is better:

o Simplifies visualization of multi-categorical data with automatic grouping and visual


distinction.

o Provides informative summaries of distributions within categories, highlighting


outliers and central tendencies.

o Offers built-in estimation of confidence intervals and statistical significance.

Example: Comparing distributions of customer satisfaction scores across different product categories
using box plots to identify potential issues.
3. Handling Data with Facets:

• Plot types: FacetGrids, relplot, catplot, jointplot

• Why Seaborn is better:

o Simplifies creation of multi-panel plots to visualize relationships across multiple


variables or groups.

o Automatically handles data splitting and layout, ensuring consistent visual


comparisons.

o Offers flexible customization options for facet arrangement and labeling.

Example: Comparing sales trends across regions and product categories using a faceted line plot to
identify regional differences and potential market opportunities.
In summary, Seaborn shines when:

• Statistical relationships and distributions are central to the analysis.

• Categorical data requires clear comparisons and summaries.

• Multi-faceted visualizations are needed to explore complex interactions.

• Aesthetic appeal and concise code are desired for effective communication.

UNIT 6
4 DESCRIBE THE THREE KEY STRUCTURES IN PLOTLY:

4.1 Q1. FIGURE, DATA, AND LAYOUT. EXPLAIN THE PURPOSE OF EACH STRUCTURE IN CREATING
VISUALIZATIONS.

1. Figure:
• The overall container: It acts as the canvas or window that holds all the elements of your
visualization.

• Foundation for visual elements: It provides the space where you'll create and arrange
plots, axes, titles, legends, annotations, and other visual components.

• Management and customization: It allows you to manage the overall size, aspect
ratio, background color, and other stylistic properties of the entire visualization.

2. Data:
• The heart of the visualization: It consists of the numerical values, categorical information, or
text that you want to visualize.

• Source and format: It can come from various sources like arrays, dataframes, or external
files, and it's typically structured in a format that visualization libraries can understand.

• Mapping to visual elements: It's used to create the visual representations within the
figure, such as bars in a bar chart, lines in a line plot, or points in a scatter plot.

3. Layout:
• Organization and arrangement: It determines the spatial arrangement of visual elements
within the figure, ensuring clarity and readability.

• Grid-based or hierarchical: It can involve a grid-based layout for multiple subplots or a


hierarchical structure for nested plots.

• Customization and control: It allows you to adjust spacing, margins, alignment, and the
overall visual hierarchy of elements to effectively guide the viewer's attention.
How they work together:

1. Create a figure: You typically start by creating a figure object to establish the overall
container for your visualization.

2. Load and prepare data: You then load your data, ensuring it's in a suitable format for the
visualization library you're using.

3. Map data to visual elements: You create visual elements like plots, axes, and
markers, mapping the data to their properties (e.g., x-axis values, y-axis values, colors, sizes).

4. Arrange elements within layout: You position and organize these visual elements within the
figure using layout tools, ensuring a clear and informative presentation.

5. Customize appearance: You can apply stylistic choices to both the figure and individual
elements to enhance readability and visual appeal.

4.2 Q2. LOAD A SALES DATASET WITH COLUMNS 'SALES,' CREATE A PLOTLY LINE CHART TO VISUALIZE THE
TOTAL SALES TREND. INCLUDE AXIS LABELS, A TITLE, AND CUSTOMIZE THE APPEARANCE.
Code

You might also like