You are on page 1of 8

Q NO 1:

Define Data Science. Explain important key points of data science.


Data Science is an interdisciplinary field that utilizes scientific methods, algorithms,
processes, and systems to extract knowledge and insights from structured and
unstructured data. It combines various domains such as statistics, computer science,
mathematics, and domain-specific knowledge to uncover patterns, trends, correlations,
and other valuable information from data.

Here are some important key points about data science:

1. Interdisciplinary Approach: Data science integrates techniques and theories


from multiple disciplines including statistics, mathematics, computer science, and
domain expertise from various fields such as business, healthcare, finance, and
more.
2. Data Acquisition and Collection: Data scientists gather, extract, and collect data
from various sources including databases, APIs, sensors, social media, websites,
and other platforms. This process involves cleaning and preprocessing data to
ensure its quality and suitability for analysis.
3. Data Exploration and Visualization: Exploring and visualizing data help in
understanding its underlying patterns, trends, and relationships. Data
visualization techniques such as charts, graphs, and dashboards are used to
communicate insights effectively.
4. Statistical Analysis and Modeling: Data scientists apply statistical methods and
machine learning algorithms to analyze data and build predictive models. These
models can be used for forecasting, classification, clustering, anomaly detection,
and other tasks to solve real-world problems.
5. Big Data Technologies: With the exponential growth of data, data scientists
often work with big data technologies such as Hadoop, Spark, and NoSQL
databases to handle and process large volumes of data efficiently.
6. Ethical Considerations and Privacy: Data scientists need to consider ethical
implications and privacy concerns when dealing with sensitive data. They must
adhere to ethical guidelines and regulations to ensure responsible data handling
and usage.
7. Continuous Learning and Adaptation: Data science is a rapidly evolving field
with new techniques, algorithms, and tools emerging constantly. Data scientists
need to stay updated with the latest advancements and continuously learn new
skills to remain effective in their roles.
8. Communication Skills: Effective communication is essential for data scientists to
convey their findings and insights to stakeholders, including non-technical
audiences. They must be able to translate complex technical concepts into
actionable insights and recommendations.
9. Business Understanding: Data scientists must have a deep understanding of the
business or domain they are working in to align data analysis with organizational
goals and priorities. This involves collaborating closely with stakeholders to
identify key business questions and opportunities.
10. Iterative Process: Data science projects often follow an iterative process
involving data exploration, model building, evaluation, and refinement. This
iterative approach allows data scientists to continuously improve models and
insights based on feedback and new data.
Q No 2:
What are the different types of Data? Provide real-world examples
of each type of data. Structured, Unstructured, and Semi-
structured data differ in terms of organization and analysis
methods?
1. Structured Data: Structured data refers to data that has a well-defined schema
and is organized in a tabular format with rows and columns. Each column
typically represents a specific attribute or variable, and each row represents a
single record or observation. Structured data is highly organized and can be
easily processed, queried, and analyzed using traditional database management
systems (DBMS) and structured query language (SQL).

Real-world examples of structured data include:

 Relational databases: Data stored in tables with predefined columns and data
types, such as customer information in a CRM system.
 Spreadsheets: Excel files containing sales data with columns for date, product,
quantity, and price.
 Transaction records: Bank transactions stored in a database with columns for
transaction date, amount, account number, and transaction type.

Structured data is typically analyzed using SQL queries and traditional analytics
techniques, such as aggregation, filtering, and joining tables.
2. Unstructured Data: Unstructured data refers to data that lacks a predefined
schema or organization. It doesn't fit neatly into rows and columns and can come
in various formats, including text documents, images, videos, audio files, social
media posts, emails, and more. Unstructured data is often human-generated and
can contain valuable insights, but it requires advanced techniques for analysis
due to its complexity and variability.

Real-world examples of unstructured data include:

 Text documents: Articles, reports, emails, and social media posts.


 Multimedia files: Images, videos, and audio recordings.
 Sensor data: Raw data from IoT devices or sensors, such as temperature readings
or GPS coordinates.
 Webpages: HTML documents containing text, images, and hyperlinks.

Analyzing unstructured data requires techniques such as natural language processing


(NLP), computer vision, audio processing, and sentiment analysis. Machine learning and
deep learning algorithms are often used to extract meaningful information from
unstructured data sources.

3. Semi-Structured Data: Semi-structured data falls somewhere between


structured and unstructured data. It has some organization, but it doesn't
conform to the strict schema of structured data. Semi-structured data may
contain tags, metadata, or other markers that provide a partial structure, making
it more flexible than structured data but less chaotic than unstructured data.

Real-world examples of semi-structured data include:

 XML (eXtensible Markup Language) files: Documents with hierarchical


structures and tags, such as configuration files or RSS feeds.
 JSON (JavaScript Object Notation) files: Data interchange format commonly
used in web development for storing and transmitting structured data.
 Log files: Records of system activities or events, often in a text-based format with
some structured elements.

Analyzing semi-structured data may involve techniques such as parsing, extracting


relevant information, and transforming data into a more structured format for further
analysis. NoSQL databases and document-oriented databases are commonly used to
store and manage semi-structured data.
Q NO 3:
What is the importance of EDA in data? (20 Marks) ● Central
Tendency (explore and define it) ● Dispersion (explore and define
it) ● Proximity (explore and define it)?

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves
examining and understanding the characteristics of a dataset before applying more
complex statistical techniques. EDA helps data scientists gain insights into the data,
identify patterns, detect outliers, and formulate hypotheses for further investigation.
Here's why EDA is important:

1. Understanding the Data: EDA allows data scientists to gain a deep


understanding of the dataset they are working with. By exploring the data's
structure, distributions, and relationships between variables, they can identify
potential challenges and limitations that may impact subsequent analysis.
2. Data Quality Assessment: EDA helps in assessing the quality and integrity of the
data. By examining summary statistics, visualizing distributions, and detecting
missing or erroneous values, data scientists can identify data errors,
inconsistencies, or anomalies that need to be addressed before further analysis.
3. Pattern Discovery: EDA helps uncover patterns, trends, and relationships within
the data. By visualizing data using plots, charts, and graphs, data scientists can
identify correlations between variables, understand how variables interact with
each other, and uncover hidden insights that may not be apparent initially.
4. Identifying Outliers and Anomalies: EDA allows for the detection of outliers
and anomalies in the data. Outliers can skew statistical measures and distort
analysis results, so identifying and understanding their nature is essential for
accurate analysis and modeling.
5. Hypothesis Generation: EDA facilitates the formulation of hypotheses for further
analysis. By exploring the data and identifying interesting patterns or
relationships, data scientists can generate hypotheses about potential cause-and-
effect relationships or factors influencing certain outcomes, which can then be
tested using more advanced statistical methods.

Now, let's define and explore three important concepts in EDA:

Central Tendency:
Central tendency refers to the tendency of data to cluster around a central value or
typical value. It provides information about the "average" or "typical" value of a dataset.
There are several measures of central tendency, including:

 Mean: The arithmetic average of a set of values calculated by summing all values
and dividing by the total number of values.
 Median: The middle value in a sorted list of values. It separates the higher half
from the lower half of the data.
 Mode: The value that appears most frequently in a dataset.

Central tendency measures help summarize the data and provide insights into its typical
values or behaviors.

Dispersion:

Dispersion, also known as variability or spread, measures the extent to which data points
deviate from the central tendency. It provides information about the spread or
distribution of values within a dataset. Common measures of dispersion include:

 Range: The difference between the maximum and minimum values in a dataset.
 Variance: The average of the squared differences between each data point and
the mean. It quantifies the overall variability of the data.
 Standard Deviation: The square root of the variance. It measures the average
distance of data points from the mean and provides a more interpretable
measure of dispersion.

Dispersion measures help assess the degree of variability within the data and provide
insights into its consistency or variability.

Proximity:

Proximity, in the context of EDA, refers to the degree of closeness or similarity between
data points or observations. It measures the relationship or distance between individual
data points in a dataset. Proximity can be quantified using various distance metrics, such
as Euclidean distance, Manhattan distance, or cosine similarity.

Proximity measures are commonly used in clustering analysis, pattern recognition, and
similarity-based recommendation systems to identify groups or patterns within the data
based on their proximity to each other.
Q NO 4:

What are the core points of Data Pre-Processing? and why it is necessary to Pre-Process data
before model training?

Data preprocessing is a crucial step in the data analysis pipeline that involves cleaning,
transforming, and preparing raw data into a format suitable for analysis and modeling.
Here are the core points of data preprocessing:

1. Data Cleaning:
 Handling missing values: Identifying and dealing with missing data by
imputation (replacing missing values with estimated ones) or deletion
(removing rows or columns with missing values).
 Removing duplicates: Identifying and removing duplicate records to
ensure data integrity and prevent bias in analysis.
2. Data Transformation:
 Encoding categorical variables: Converting categorical variables into
numerical representations suitable for machine learning algorithms using
techniques like one-hot encoding or label encoding.
 Scaling and normalization: Standardizing numerical features to a common
scale to prevent features with larger magnitudes from dominating the
analysis. Common techniques include min-max scaling and z-score
normalization.
 Feature engineering: Creating new features or transforming existing ones
to better represent the underlying patterns in the data. This may involve
mathematical transformations, aggregation, or extracting information from
text or temporal data.
3. Data Reduction:
 Dimensionality reduction: Reducing the number of features in the dataset
to mitigate the curse of dimensionality and improve computational
efficiency. Techniques like principal component analysis (PCA) or feature
selection methods can be employed for this purpose.
 Sampling: If dealing with large datasets, sampling techniques like random
sampling or stratified sampling may be used to reduce the dataset's size
while preserving its characteristics.
4. Handling Outliers:
 Detecting and treating outliers: Identifying outliers that may skew analysis
results and deciding whether to remove, transform, or impute them based
on domain knowledge and the specific context of the data.
5. Data Integration and Aggregation:
 Integrating data from multiple sources: Combining data from different
sources or databases to create a unified dataset for analysis.
 Aggregating data: Summarizing and aggregating data at different levels of
granularity (e.g., daily, weekly, or monthly) to extract meaningful insights
and patterns.
6. Data Formatting:
 Ensuring data consistency and format: Standardizing data formats, units,
and conventions across different variables to facilitate analysis and
interpretation.

Q NO 5:
Observe the attributes of the titanic dataset and mention some of
the most important attributes. Mention your reason why they are
important. (20 Marks) Which attributes are not important and
why? Write some initial observations about the data. Is it a
cleaned dataset? If not, then which EDA steps need to be
performed to make it clean. Mention the target variables. Specify
whether the problem is classification or regression?
1. Survived: This attribute indicates whether a passenger survived or not. It is
crucial as it serves as the target variable for predictive modeling. Understanding
factors influencing survival can help improve safety measures in similar situations.
2. Pclass: This attribute represents the passenger class (1st, 2nd, or 3rd). It is
important as it may correlate with socio-economic status, which could influence
survival rates. Higher-class passengers might have had better access to lifeboats
or other resources.
3. Sex: The gender of the passenger is significant as it could impact survival rates
due to societal norms and priority given to women and children during rescue
operations.
4. Age: Age is important as it may influence survival rates. Children and elderly
passengers might have had different chances of survival compared to adults due
to their physical condition or priority during evacuation.
5. Fare: Fare paid by the passenger may reflect their socio-economic status, which
could correlate with survival rates.
6. Embarked: The port of embarkation could indirectly indicate the passenger's
socio-economic background or travel circumstances, which might affect their
survival chances.

Attributes that might not be as important include:

1. PassengerId: This attribute is merely an identifier and does not provide any
meaningful information for predictive modeling.
2. Ticket: Ticket numbers might not directly influence survival and may not provide
valuable insights unless further feature engineering is performed.
3. Cabin: While cabin numbers could potentially indicate proximity to lifeboats or
other factors influencing survival, the dataset contains many missing values for
this attribute, which may limit its usefulness.

Initial observations about the data suggest that it may not be entirely clean. Missing
values are present in attributes such as Age, Cabin, and Embarked. Thus, the following
EDA steps need to be performed to make it clean:

1. Handling Missing Values: Missing values in attributes like Age, Cabin, and
Embarked need to be addressed through techniques like imputation or deletion,
depending on the extent of missingness and the importance of the attribute.
2. Outlier Detection: Identifying and handling outliers, especially in numerical
attributes like Age and Fare, to prevent them from skewing analysis results.
3. Data Transformation: Encoding categorical variables like Sex and Embarked into
numerical representations for modeling purposes.
4. Feature Engineering: Creating new features from existing ones or extracting
additional information from attributes like Name or Ticket to enhance the
predictive power of the model.

You might also like