Professional Documents
Culture Documents
Relational databases: Data stored in tables with predefined columns and data
types, such as customer information in a CRM system.
Spreadsheets: Excel files containing sales data with columns for date, product,
quantity, and price.
Transaction records: Bank transactions stored in a database with columns for
transaction date, amount, account number, and transaction type.
Structured data is typically analyzed using SQL queries and traditional analytics
techniques, such as aggregation, filtering, and joining tables.
2. Unstructured Data: Unstructured data refers to data that lacks a predefined
schema or organization. It doesn't fit neatly into rows and columns and can come
in various formats, including text documents, images, videos, audio files, social
media posts, emails, and more. Unstructured data is often human-generated and
can contain valuable insights, but it requires advanced techniques for analysis
due to its complexity and variability.
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves
examining and understanding the characteristics of a dataset before applying more
complex statistical techniques. EDA helps data scientists gain insights into the data,
identify patterns, detect outliers, and formulate hypotheses for further investigation.
Here's why EDA is important:
Central Tendency:
Central tendency refers to the tendency of data to cluster around a central value or
typical value. It provides information about the "average" or "typical" value of a dataset.
There are several measures of central tendency, including:
Mean: The arithmetic average of a set of values calculated by summing all values
and dividing by the total number of values.
Median: The middle value in a sorted list of values. It separates the higher half
from the lower half of the data.
Mode: The value that appears most frequently in a dataset.
Central tendency measures help summarize the data and provide insights into its typical
values or behaviors.
Dispersion:
Dispersion, also known as variability or spread, measures the extent to which data points
deviate from the central tendency. It provides information about the spread or
distribution of values within a dataset. Common measures of dispersion include:
Range: The difference between the maximum and minimum values in a dataset.
Variance: The average of the squared differences between each data point and
the mean. It quantifies the overall variability of the data.
Standard Deviation: The square root of the variance. It measures the average
distance of data points from the mean and provides a more interpretable
measure of dispersion.
Dispersion measures help assess the degree of variability within the data and provide
insights into its consistency or variability.
Proximity:
Proximity, in the context of EDA, refers to the degree of closeness or similarity between
data points or observations. It measures the relationship or distance between individual
data points in a dataset. Proximity can be quantified using various distance metrics, such
as Euclidean distance, Manhattan distance, or cosine similarity.
Proximity measures are commonly used in clustering analysis, pattern recognition, and
similarity-based recommendation systems to identify groups or patterns within the data
based on their proximity to each other.
Q NO 4:
What are the core points of Data Pre-Processing? and why it is necessary to Pre-Process data
before model training?
Data preprocessing is a crucial step in the data analysis pipeline that involves cleaning,
transforming, and preparing raw data into a format suitable for analysis and modeling.
Here are the core points of data preprocessing:
1. Data Cleaning:
Handling missing values: Identifying and dealing with missing data by
imputation (replacing missing values with estimated ones) or deletion
(removing rows or columns with missing values).
Removing duplicates: Identifying and removing duplicate records to
ensure data integrity and prevent bias in analysis.
2. Data Transformation:
Encoding categorical variables: Converting categorical variables into
numerical representations suitable for machine learning algorithms using
techniques like one-hot encoding or label encoding.
Scaling and normalization: Standardizing numerical features to a common
scale to prevent features with larger magnitudes from dominating the
analysis. Common techniques include min-max scaling and z-score
normalization.
Feature engineering: Creating new features or transforming existing ones
to better represent the underlying patterns in the data. This may involve
mathematical transformations, aggregation, or extracting information from
text or temporal data.
3. Data Reduction:
Dimensionality reduction: Reducing the number of features in the dataset
to mitigate the curse of dimensionality and improve computational
efficiency. Techniques like principal component analysis (PCA) or feature
selection methods can be employed for this purpose.
Sampling: If dealing with large datasets, sampling techniques like random
sampling or stratified sampling may be used to reduce the dataset's size
while preserving its characteristics.
4. Handling Outliers:
Detecting and treating outliers: Identifying outliers that may skew analysis
results and deciding whether to remove, transform, or impute them based
on domain knowledge and the specific context of the data.
5. Data Integration and Aggregation:
Integrating data from multiple sources: Combining data from different
sources or databases to create a unified dataset for analysis.
Aggregating data: Summarizing and aggregating data at different levels of
granularity (e.g., daily, weekly, or monthly) to extract meaningful insights
and patterns.
6. Data Formatting:
Ensuring data consistency and format: Standardizing data formats, units,
and conventions across different variables to facilitate analysis and
interpretation.
Q NO 5:
Observe the attributes of the titanic dataset and mention some of
the most important attributes. Mention your reason why they are
important. (20 Marks) Which attributes are not important and
why? Write some initial observations about the data. Is it a
cleaned dataset? If not, then which EDA steps need to be
performed to make it clean. Mention the target variables. Specify
whether the problem is classification or regression?
1. Survived: This attribute indicates whether a passenger survived or not. It is
crucial as it serves as the target variable for predictive modeling. Understanding
factors influencing survival can help improve safety measures in similar situations.
2. Pclass: This attribute represents the passenger class (1st, 2nd, or 3rd). It is
important as it may correlate with socio-economic status, which could influence
survival rates. Higher-class passengers might have had better access to lifeboats
or other resources.
3. Sex: The gender of the passenger is significant as it could impact survival rates
due to societal norms and priority given to women and children during rescue
operations.
4. Age: Age is important as it may influence survival rates. Children and elderly
passengers might have had different chances of survival compared to adults due
to their physical condition or priority during evacuation.
5. Fare: Fare paid by the passenger may reflect their socio-economic status, which
could correlate with survival rates.
6. Embarked: The port of embarkation could indirectly indicate the passenger's
socio-economic background or travel circumstances, which might affect their
survival chances.
1. PassengerId: This attribute is merely an identifier and does not provide any
meaningful information for predictive modeling.
2. Ticket: Ticket numbers might not directly influence survival and may not provide
valuable insights unless further feature engineering is performed.
3. Cabin: While cabin numbers could potentially indicate proximity to lifeboats or
other factors influencing survival, the dataset contains many missing values for
this attribute, which may limit its usefulness.
Initial observations about the data suggest that it may not be entirely clean. Missing
values are present in attributes such as Age, Cabin, and Embarked. Thus, the following
EDA steps need to be performed to make it clean:
1. Handling Missing Values: Missing values in attributes like Age, Cabin, and
Embarked need to be addressed through techniques like imputation or deletion,
depending on the extent of missingness and the importance of the attribute.
2. Outlier Detection: Identifying and handling outliers, especially in numerical
attributes like Age and Fare, to prevent them from skewing analysis results.
3. Data Transformation: Encoding categorical variables like Sex and Embarked into
numerical representations for modeling purposes.
4. Feature Engineering: Creating new features from existing ones or extracting
additional information from attributes like Name or Ticket to enhance the
predictive power of the model.