Data Science Unit 1

UNIT-I
Data Science Introduction

Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.
What is Data Science?

Data Science is about data gathering, analysis and decision-making.
Data Science is about finding patterns in data, through analysis, and make future predictions.
By using Data Science, companies are able to make:
• Better decisions (should we choose A or B)

• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden information in the data)
Where is Data Science Needed?

Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare,
and manufacturing.
Examples of where Data Science is needed:
• For route planning: To discover the best routes to ship

• To foresee delays for flight/ship/train etc. (through predictive analysis)
• To create promotional offers
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company
• To analyze health benefit of training
• To predict who will win elections
Data Science can be applied in nearly every part of a business where data is available. Examples
are:
• Consumer goods
• Stock markets
• Industry
• Politics
• Logistic companies
• E-commerce
How Does a Data Scientist Work?

A Data Scientist requires expertise in several backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must
organize the data in a standard format.
Here is how a Data Scientist works:
1. Ask the right questions - To understand the business problem.

2. Explore and collect data - From database, web logs, customer feedback, etc.
3. Extract the data - Transform the data to a standardized format.
4. Clean the data - Remove erroneous values from the data.
5. Find and replace missing values - Check for missing values and replace them with a
suitable value (e.g. an average value).
6. Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m.
However, the number 140 is larger than 1,8. - so scaling is important).
7. Analyze data, find patterns and make future predictions.
8. Represent the result - Present the result with useful insights in a way the "company" can
understand.
What is Data?
Data is a collection of information.
One purpose of Data Science is to structure data, making it interpretable and easy to work with.
Data can be categorized into two groups:
• Structured data
• Unstructured data
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.
Structured Data
Structured data is organized and easier to work with.
How to Structure Data?

We can use an array or a database table to structure or present data.
Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Database Table
A database table is a table with structured data.
The following table shows a database table with health data extracted from a sports watch:
Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
45 100 140 280 0 7
60 105 140 290 7 8
60 110 145 300 7 8
60 115 145 310 8 8
75 120 150 320 0 8
75 125 150 330 8 8
This dataset contains information of a typical training session such as duration, average pulse,
calorie burnage etc.
Database Table Structure
A database table consists of column(s) and row(s):
Column Column 2 Column 3 Column 4 Column 5 Column 6

1
Duratio Average_Puls Max_Puls Calorie_Burna Hours_Wor Hours_Slee

n e e ge k p
Ro 30 80 120 240 10 7
w1
Ro 30 85 120 250 10 7
w2
Ro 45 90 130 260 8 7
w3
Ro 45 95 130 270 8 7
w4
Ro 45 100 140 280 0 7

w5
Ro 60 105 140 290 7 8

w6
Ro 60 110 145 300 7 8

w7
Ro 60 115 145 310 8 8

w8
Ro 75 120 150 320 0 8

w9
Ro 75 125 150 330 8 8
w
10
A row is a horizontal representation of data.
A column is a vertical representation of data
Variables
A variable is defined as something that can be measured or counted.
Examples can be characters, numbers or time.
In the example under, we can observe that each column represents a variable.
Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
45 100 140 280 0 7
60 105 140 290 7 8

60 110 145 300 7 8
60 115 145 310 8 8
75 120 150 320 0 8
75 125 150 330 8 8
There are 6 columns, meaning that there are 6 variables (Duration, Average_Pulse, Max_Pulse,
Calorie_Burnage, Hours_Work, Hours_Sleep).
There are 11 rows, meaning that each variable has 10 observations.
But if there are 11 rows, how come there are only 10 observations?
It is because the first row is the label, meaning that it is the name of the variable.
Evolution of Data Science: Growth &

Innovation
The term “data science” — and the practice itself — has evolved over the years. In
recent years, its popularity has grown considerably due to innovations in data
collection, technology, and mass production of data worldwide. Gone are the days
when those who worked with data had to rely on expensive programs and
mainframes. The proliferation of programming languages like Python and
procedures to collect, analyze, and interpret data paved the way for data science to
become the popular field it is today.
Data science began in statistics. Part of the evolution of data science was the
inclusion of concepts such as machine learning, artificial intelligence, and the
internet of things. With the flood of new information coming in and businesses
seeking new ways to increase profit and make better decisions, data science started
to expand to other fields, including medicine, engineering, and more.
Origins, Predictions, Beginnings
We could say that data science was born from the idea of merging applied statistics with
computer science. The resulting field of study would use the extraordinary power of modern
computing. Scientists realized they could not only collect data and solve statistical problems
but also use that data to solve real-world problems and make reliable fact-driven predictions.
1962: American mathematician John W. Tukey first articulated the data science dream. In his
now-famous article "The Future of Data Analysis," he foresaw the inevitable emergence of a
new field nearly two decades before the first personal computers. While Tukey was ahead of
his time, he was not alone in his early appreciation of what would come to be known as "data
science." Another early figure was Peter Naur, a Danish computer engineer whose book
Concise Survey of Computer Methods offers one of the very first definitions of data science:
"The science of dealing with data, once they have been established, while the relation of the
data to what they represent is delegated to other fields and sciences."
1977: The theories and predictions of "pre" data scientists like Tukey and Naur became more
concrete with the establishment of The International Association for Statistical Computing
(IASC), whose mission was "to link traditional statistical methodology, modern computer
technology, and the knowledge of domain experts in order to convert data into information and
knowledge."
1980s and 1990s: Data science began taking more significant strides with the emergence of
the first Knowledge Discovery in Databases (KDD) workshop and the founding of the
International Federation of Classification Societies (IFCS). These two societies were among
the first to focus on educating and training professionals in the theory and methodology of data
science (though that term had not yet been formally adopted).
It was at this point that data science started to garner more attention from leading professionals
hoping to monetize big data and applied statistics.
1994: BusinessWeek published a story on the new phenomenon of "Database Marketing.” It

described the process by which businesses were collecting and leveraging enormous amounts
of data to learn more about their customers, competition, or advertising techniques. The only
problem at the time was that these companies were flooded with more information than they
could possibly manage. Massive amounts of data were sparking the first wave of interest in
establishing specific roles for data management. It began to seem like businesses would need
a new kind of worker to make the data work in their favor.
1990s and early 2000s: We can clearly see that data science has emerged as a recognized and
specialized field. Several data science academic journals began to circulate, and data science
proponents like Jeff Wu and William S. Cleveland continued to help develop and expound
upon the necessity and potential of data science.
2000s: Technology made enormous leaps by providing nearly universal access to internet
connectivity, communication, and (of course) data collection.
2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering
large amounts of data, new technologies capable of processing them became necessary.
Hadoop rose to the challenge, and later on Spark and Cassandra made their debuts.
2014: Due to the increasing importance of data, and organizations’ interest in finding patterns
and making better business decisions, demand for data scientists began to see dramatic growth
in different parts of the world.
2015: Machine learning, deep learning, and Artificial Intelligence (AI) officially enter the
realm of data science. These technologies have driven innovations over the past decade — from
personalized shopping and entertainment to self-driven vehicles along with all the insights to
efficiently bring forth these real-life applications of AI into our daily lives.
2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in
data science.
2020s: We are seeing additional breakthroughs in AI, machine learning, and an ever-more-
increasing demand for qualified professionals in Big Data
The Future of Data Science

Seeing how much of our world is currently powered by data and data science, we can
reasonably ask, Where do we go from here? What does the future of data science hold? While
it's difficult to know exactly what the hallmark breakthroughs of the future will be, all signs
seem to indicate the critical importance of machine learning. Data scientists are searching for
ways to use machine learning to produce more intelligent and autonomous AI.
In other words, data scientists are working tirelessly toward developments in deep learning to
make computers smarter. These developments can bring about advanced robotics paired with
a powerful AI. Experts predict the AI will be capable of understanding and interacting
seamlessly with humans, self-driving vehicles, and automated public transportation in a world
interconnected like never before. This new world will be made possible by data science.
Perhaps, on the more exciting side, we may see an age of extensive automation of labor in the
near future. This is expected to revolutionize the healthcare, finance, transportation, and
defense industries.
Data Science Roles

Today’s world revolves around data—there’s no getting around that fact. Every time we use
our smartphones or log into our computers, we leave a trail of data behind us. That data holds
key insights into our behavior, our likes and dislikes and the products we purchase. How do
businesses make sense of all that data? By hiring skilled individuals to perform careful analysis
and uncover trends, improve business practices and make more profitable decisions.
With data becoming increasingly important to companies’ bottom lines, data scientists are
some of the most valuable individuals in the professional world today. But with more open
roles in the field than ever before, the initial excitement from seeing all those job postings can
quickly turn to anxiety when you realize just how many options are out there. Not to mention
the positions that you don’t even know about yet.
What is a Data Scientist?
The first thing to keep in mind is that the phrase “data scientist” tends to be used as an umbrella
term for just about every job in the field. In fact, if you ask ten different people what they think
a data scientist does, you’ll likely get ten different answers.
The reality is, data science is a vast field that employs individuals in a variety of roles and
responsibilities. They may hold the title of Data Analyst, Business Analyst, Software
Developer or Marketing Data Scientist—just to name a few. Because data has become so
prevalent in our everyday lives, it might surprise you to learn which industries are actively
searching for and hiring data science professionals.
Unfortunately, many companies’ job listings don’t always get the distinction right, so it’s
crucial to understand your personal and professional goals before starting your search for a data
boot camp and eventually, a job in the field.
Data Science Roles
Data science teams are constantly faced with complex problems they need to solve using—you
guessed it—data. Whether it’s analyzing the sentiment of incoming communications (like
Tweets or survey responses), tracking sales leads, or devising a new marketing campaign, there
are a variety of data science jobs assigned to perform the myriad processes required of the field.
While many of these positions share some of the same tools and responsibilities, the day-to-
day experience for each can vary drastically.
Whether you’re preparing to enroll in a boot camp or you’re starting the job hunt for a position
in the field, you should have a basic idea of how you want to apply your skills. Take a look at
some of the most in-demand data science jobs to get a better understanding of how they fit into
their respective teams.
Data Analyst
As a typical entry-level position, a Data Analyst’s primary job is to develop systems that collect
and sift through company data, then use it to extract insights that answer business questions
with actionable solutions. Individuals in this role should have a keen eye for detail and the
ability to brainstorm new approaches to analyzing data. Often times, Data Analysts are tapped
to work with a variety of departments and individuals, so collaboration and communication
skills are a must, especially when explaining technical ideas to non-technical teams.
Responsibilities: Accessing and cleaning data, performing statistical analysis, visualizing and
communicating the results
Programming languages required: Python, R, SQL
Tools/skills required: Data science programming, probability and statistics, collaboration,

communication
Growth potential: Many Data Analysts go on to become senior analysts or take on a
management role at larger companies with data teams
Top industries: Finance, insurance, gambling, retail banking, consumer products, healthcare,
energy
Data Scientist
Think of a Data Scientist as taking the Data Analyst role another step further down the data
science funnel. Data Scientists take on many of the same responsibilities as analysts, but they’re
also responsible for building machine learning models and working with algorithms to make
accurate predictions based on collected data—ultimately making Data Analysts’ jobs a little
easier. Of course, it’s always good to know how analysis fits into the larger picture, and
successful Data Scientists have a solid understanding of handling raw data, analyzing it and
sharing insights in a compelling way. Since the role tends to be more independent, motivation
and curiosity go a long way for these professionals.
Responsibilities: Analyzing data, building and training machine learning models to make
reliable future predictions
Programming languages required: Python, R
Tools/skills required: Everything required from a data analyst, plus strong foundations in math,
analytics and computer science, knowledge of machine learning methods, statistical models,
advanced data science programming and familiarity with Apache Spark
Growth potential: Data Scientists may move on to become a senior data scientist, while some
decide to take the path to become a machine learning engineer or a chief data officer
Top industries: Healthcare, telecommunications, energy, automotive
Interested in learning more about data scientists? Read this article on the Data Scientist Skills
Employers Want to See.
Business Analyst
In order for Data Analysts’ insights to be communicated throughout a company, it’s up to the
Business Analyst to use storytelling techniques to turn them into actionable business insights.
The main goal for individuals in this role is to facilitate potential solutions to organizational
problems, but they should also be prepared to take on additional responsibilities like quality
assurance and management. Needless to say, time management and prioritization are common
traits shared among successful Business Analysts—and you’re not likely to get hired as one
without them. While it’s not a heavily tech-focused role, understanding how to apply a variety
of business processes using high-level strategic thinking is a crucial skill for these data science
specialists.
Responsibilities: Use data-driven insights to clearly communicate initiatives throughout entire

organizations, often acting as the intermediary between a company’s business and tech teams
Programming languages required: SQL, Tableau
Tools/skills required: Understanding of business processes, data visualization tools, listening

and storytelling, data modeling
Growth potential: With experience, many Business Analysts take on a leadership title or move
on to more senior roles in product management
Top industries: Telecom, utilities, real estate, healthcare, government, pharmaceuticals
Software Engineer
Nowadays, most software companies want to leverage their users’ data to optimize their
offerings, while data-driven businesses have turned to creating custom software built around
their specific needs or goals. That’s where Software Engineers come in. Depending on the type
of company, a Software Engineer might be tasked with optimizing certain product features
based on user data, or they might be responsible for building a new program that will ultimately
increase a company’s bottom line. Needless to say, individuals holding these roles should be
well-versed in programming and data analytics to truly be successful.
Responsibilities: Collaborate with data scientists and business analysts to ensure alignment
between the business objectives and the analytics back-end of the software they are working to
produce or modify, as well as ensure the scalability and security of the final product
Programming languages required: Java, Python, C, C++
Tools/skills required: Experience with machine learning and deep learning frameworks,
understanding of mathematics including linear algebra and statistics, strong programming and
debugging skills, data processing, writing and communication and attention to detail
Growth potential: Given the fact that this is a relatively new role within the industry, the
opportunities for individuals holding this role are virtually endless
Top industries: Retail, healthcare, research and development, government and defense, IT
services
Marketing Data Scientist
When a company builds a new campaign, it’s up to the Marketing Data Scientist to analyze
company data and user research to inform the marketing strategy around the launch and
measure its outcomes. On a granular level, this could involve anything from email marketing
and search engine optimization (SEO) to web analytics and growth hacking—and everything
in between! To be a successful Marketing Data Scientist, candidates need to have the ability to
leverage data to enhance key marketing components and achieve desired company outcomes.
Because market data tends to change rapidly, Marketing Data Scientists should be able to adapt
to the pace at which campaigns progress.
Responsibilities: Gather and analyze data to objectively strategize the launch and evolution of
a business’s promotions and marketing campaigns while communicating between stakeholders
Programming languages required: SQL, Python, R, Tableau
Tools/skills required: Solid understanding of data analytics, objective thinking, strong

communication and adaptability
Growth potential: With so many specialties to choose from, the sky’s the limit for individuals
holding a Marketing Data Scientist role, some of whom go on to hold senior-level positions or
even start their own companies
Top industries: Banking and finance, advertising, retail, technology, travel
Machine Learning Engineer
While Data Scientists build a company’s machine learning models and Data Analysts
determine which data is worthy of exploring, it’s the Machine Learning Engineer who wrangles
and applies the algorithms to the datasets. Usually, the ultimate goal for individuals in this role
is to eventually create artificial intelligence. There’s plenty of trial-and-error involved in the
job, so persistence and resilience are key contributors to success. In addition, having a solid
understanding of how long it takes to apply various approaches will also prove advantageous
in this field.
Responsibilities: Processing data provided by a company’s Data Analyst using machine

learning algorithms developed by the Data Scientist to glean insights that will ultimately drive
business decisions
Programming languages required: R, Java, Python, C++
Tools/skills required: Strong communication paired with an understanding of data structures,

vectors, matrices, derivatives and integrals, as well as statistical concepts and probability theory
Growth potential: Many Machine Learning Engineers progress to become more specialized in
deep learning methods, while others transition to machine learning researchers or leads on data
engineering teams
Top industries: Healthcare, financial services, retail, government, transportation

Stages in a Data Science Project
1. Business Understanding: The complete cycle revolves around the enterprise goal. What will
you resolve if you do not longer have a specific problem? It is extraordinarily essential to
apprehend the commercial enterprise goal sincerely due to the fact that will be your ultimate
aim of the analysis. After desirable perception only we can set the precise aim of evaluation
that is in sync with the enterprise objective. You need to understand if the customer desires to
minimize savings loss, or if they prefer to predict the rate of a commodity, etc.
2. Data Understanding: After enterprise understanding, the subsequent step is data

understanding. This includes a series of all the reachable data. Here you need to intently work
with the commercial enterprise group as they are certainly conscious of what information is
present, what facts should be used for this commercial enterprise problem, and different
information. This step includes describing the data, their structure, their relevance, their records
type. Explore the information using graphical plots. Basically, extracting any data that you can
get about the information through simply exploring the data.
3. Preparation of Data: Next comes the data preparation stage. This consists of steps like
choosing the applicable data, integrating the data by means of merging the data sets, cleaning
it, treating the lacking values through either eliminating them or imputing them, treating
inaccurate data through eliminating them, additionally test for outliers the use of box plots and
cope with them. Constructing new data, derive new elements from present ones. Format the
data into the preferred structure, eliminate undesirable columns and features. Data preparation
is the most time-consuming but arguably the most essential step in the complete existence
cycle. Your model will be as accurate as your data.
4. Exploratory Data Analysis: This step includes getting some concept about the answer and
elements affecting it, earlier than constructing the real model. Distribution of data inside
distinctive variables of a character is explored graphically the usage of bar-graphs, Relations
between distinct aspects are captured via graphical representations like scatter plots and
warmth maps. Many data visualization strategies are considerably used to discover each and
every characteristic individually and by means of combining them with different features.
5. Data Modeling: Data modeling is the coronary heart of data analysis. A model takes the
organized data as input and gives the preferred output. This step consists of selecting the
suitable kind of model, whether the problem is a classification problem, or a regression problem
or a clustering problem. After deciding on the model family, amongst the number of algorithms
amongst that family, we need to cautiously pick out the algorithms to put into effect and enforce
them. We need to tune the hyperparameters of every model to obtain the preferred performance.
We additionally need to make positive there is the right stability between overall performance
and generalizability. We do no longer desire the model to study the data and operate poorly on
new data.
6. Model Evaluation: Here the model is evaluated for checking if it is geared up to be deployed.
The model is examined on an unseen data, evaluated on a cautiously thought out set of
assessment metrics. We additionally need to make positive that the model conforms to reality.
If we do not acquire a quality end result in the evaluation, we have to re-iterate the complete
modelling procedure until the preferred stage of metrics is achieved. Any data science solution,
a machine learning model, simply like a human, must evolve, must be capable to enhance itself
with new data, adapt to a new evaluation metric. We can construct more than one model for a
certain phenomenon, however, a lot of them may additionally be imperfect. The model
assessment helps us select and construct an ideal model.
7. Model Deployment: The model after a rigorous assessment is at the end deployed in the
preferred structure and channel. This is the last step in the data science life cycle. Each step in
the data science life cycle defined above must be laboured upon carefully. If any step is
performed improperly, and hence, have an effect on the subsequent step and the complete effort
goes to waste. For example, if data is no longer accumulated properly, you’ll lose records and
you will no longer be constructing an ideal model. If information is not cleaned properly, the
model will no longer work. If the model is not evaluated properly, it will fail in the actual
world. Right from Business perception to model deployment, every step has to be given
appropriate attention, time, and effort.
Applications of Data Science
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to
search for something on the internet, we mostly used Search engines like Google, Yahoo,
Safari, Firefox, etc. So Data Science is used to get Searches faster.
For Example, When we search something suppose “Data Structure and algorithm courses ”
then at that time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This
happens because the GeeksforGeeks website is visited most in order to get information
regarding Data Structure courses and Computer related subjects. So this analysis is Done using
Data Science, and we get the Topmost visited Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help of
Data Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy
Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue
of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in
order to carry out strategic decisions for the company. Also, Financial Industries uses Data
Science Analytics tools in order to predict the future. It allows the companies to predict
customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data Science
is used to examine past behavior with past data and their goal is to examine the future outcome.
Data is analyzed in such a way that it makes it possible to predict future stock prices over a set
timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get suggestions
similar to choices according to our past data and also we get recommendations according to
most buy the product, most rated, most searched, etc. This is all done with the help of Data
Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload our
image with our friend on Facebook, Facebook gives suggestions Tagging who is in the picture.
This is done with the help of machine learning and Data Science. When an Image is
Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the faces
which are present in the picture matched with someone else profile then Facebook suggests us
auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the
user searches on the Internet, he/she will see numerous posts everywhere. This can be explained
properly with an example: Suppose I want a mobile phone, so I just Google search it and after
that, I changed my mind to buy offline. Data Science helps those companies who are paying
for Advertisements for their mobile. So everywhere on the internet in the social media, in the
websites, in the apps everywhere I will see the recommendation of that mobile phone which I
searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it becomes
easy to predict flight delays. It also helps to decide whether to directly land into the destination
or take a halt in between like a flight can have a direct route from Delhi to the U.S.A or it can
halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data
science concepts are used with machine learning where with the help of past data the Computer
will improve its performance. There are many games like Chess, EA Sports, etc. will use Data
Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done with
full disciplined because it is a matter of Someone’s life. Without Data Science, it takes lots of
time, resources, and finance or developing new Medicine or drug but with the help of Data
Science, it becomes easy because the prediction of success rate can be easily determined based
on biological data or factors. The algorithms based on data science will forecast how this will
react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science
helps these companies to find the best route for the Shipment of their Products, the best time
suited for delivery, the best mode of transport to reach the destination, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility
to just type a few letters or words, and he will get the feature of auto-completing the line. In
Google Mail, when we are writing formal mail to someone so at that time data science concept
of Autocomplete feature is used where he/she is an efficient choice to auto-complete the whole
line. Also in Search Engines in social media, in various apps, AutoComplete feature is widely
used.
What is Data Security?
Beginning with ‘What is data security,’ it is defined as the protection from unknown, unwanted
or external access to data. It refers to protection from a data breach, corruption, modification
and theft. The strategies to set up data security include hashing, data encryption and
tokenization. In other words, it refers to protecting the information from unauthorized access
throughout its lifecycle. The protection requiring components of data security include software,
user and storage devices, hardware, organization’s policies and procedures and access and
administrative controls.
Data security is achieved via different tools which enable encryption, data masking and
redaction of confidential information. Data security is achieved by following strict regulations,
and setting up a practical and efficient management process, reducing data security breaches
and human error.
Why Data Security is Important?
Data security is of utmost importance in today’s digital age. It refers to data protection from
unauthorized access, use, disclosure, alteration, or destruction. Here are several reasons why
data security is crucial:
Protection of Confidential Information: Data security protects your sensitive and

confidential information. It includes personal data, financial records, intellectual property,
trade secrets, and customer information. Preventing unauthorized access to this information is
essential to safeguard privacy, prevent identity theft, financial fraud, and maintain trust.
Compliance with Regulations: Many industries are subject to strict regulations regarding the
protection and privacy of data. Compliance with these regulations, such as the General Data
Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act
(HIPAA), is mandatory. Failure to comply can lead to legal consequences, fines, and
reputational damage.
Prevention of Data Breaches: Data breaches can have severe consequences for businesses.
They can lead to financial loss, reputational damage, and loss of customer trust. Implementing
robust data security measures reduces the risk of data breaches and helps protect valuable
assets.
Business Continuity and Disaster Recovery: it is crucial in business continuity and disaster
recovery planning. Regular data backups, secure storage, and disaster recovery plans ensure
that critical business data can be recovered in the event of a data loss or system failure.
Competitive Advantage: Data security can be a differentiator in a highly competitive market.

Businesses prioritizing data security demonstrate their commitment to protecting sensitive
information, giving customers and partners confidence in their operations.
Trust and Customer Confidence: It is vital for building and maintaining customer confidence.
Customers are more likely to engage with organizations that prioritize data protection and are
transparent about their security measures.
Data Security vs Data Privacy
Data Security
Data security protects data from unauthorized access, use, disclosure, alteration, or destruction.
It involves implementing technical, physical, and procedural measures to protect data integrity,
confidentiality, and availability.
It measures include encryption, firewalls, access controls, intrusion detection systems, and data
backup.
The main goal of data security is to prevent data breaches and unauthorized access and protect
data from external threats.
Data Privacy
Data privacy, on the other hand, focuses on controlling the collection, use, disclosure, and
sharing of personal data.
It ensures that individuals control their personal information and how organizations use it.
Data privacy involves implementing policies, procedures, and measures to comply with privacy
laws and regulations.
It includes obtaining consent for data collection and processing, providing transparency about
data usage, and respecting individuals’ rights.
Data privacy also addresses issues such as data anonymization, data retention, and data subject
rights.
Common Threats to Data Security
Malware and Viruses
Malware, also known as malicious software, is a broad category that includes multiple types
of software designed to harm computer systems. This includes various variants such as
spyware, viruses, and ransomware, which can contribute to a data breach. Malware refers to
code created by cyber attackers intending to damage or gain unauthorized access to a system
or data. Malware is activated by clicking on an attachment or malicious link. Once activated,
malware can cause a variety of harmful actions:
Installation of additional harmful software
Damage the system parts rendering them useless
Data transmission without permission
Block access to the network components
The mobile data breach is a well-known example of a data leak of around 37 million customers
through malware. Eventually, the company agreed to pay customers who filed class action
lawsuits around $350 million.
Phishing Attacks
Phishing attacks are fake communication methods with the wrong intent. Users often receive
these as emails depicting sent from a trusted source. The components are a set of instructions
asked for the receiver to follow. The actions may include revealing confidential information
like credit card numbers, login information, CVV and other similar details. The messages or
communication method may also contain links that can compromise the data on clicks.
Social Engineering
Social Engineering is a well-thought and researched attack. It begins by studying specific

targets, their behavior, preferences and needs. The attacker gathers the information, gains the
target’s trust and then walks through the security protocols by using them. It involves exploiting
the target through pretexting, spear phishing, baiting, phishing, scareware, quid pro quo, water
holing, vishing, tailgating, rogue and honey trap.
Insider Threats
These refer to internally generated threats from the company or organization. These can be
non-deliberate or intentional and are as follows:
Malicious insiders aim to steal data or harm the organization for personal benefit.
Non-malicious insider threats are unaware individuals who accidentally set up the trap.
Compromised insiders are unaware of their system or account being compromised. The
harmful activities happen from the person’s account without their knowledge.
Physical Theft or Loss of Devices
Portable devices such as laptops, pen drives, and hard drives are easily stealable things with
the potential to cause excessive harm to the company and user. Limiting access to such devices
is one of the standard methods to protect data.
Best Practices for Improving Data Security
Here are some of the best practices for improving data security:
Use Strong Passwords and Multi-factor Authentication
Generally, online-based components come already coupled with enhanced data security. The
feature includes accepting only strong passwords with variable types of digits, increasing the
possible combination of code if put in by guesswork. Additionally, multi-factor authentication
requires different devices to be in proximity and authority to login into the specific account.
Crossing multiple levels of security checks is uncommon and highly challenging enough.
Keep Software and Systems Up-to-date
The software and systems often encounter bugs. However, software updates aim to resolve
such shortcomings, providing enhanced security. It closes the window for internal or external
data security breaches.
Limit Access to Sensitive Data
Access control is essential in providing data security by limiting access to a restricted number
of users. It promotes accountability and responsibility among a selected group of individuals.
Every organization and department must take this crucial step to ensure data security. Access
control only allows permission or visual access to specific sections corresponding to a user’s
job role. For instance, the finance team does not need access to the software workflow, and
vice versa. By implementing access control measures, an organization can ensure that only
authorized individuals access sensitive data, reducing the risk of unauthorized access and data
breaches.
Encrypt Sensitive Data in Transit
Regardless of the data’s current usage status, ensure to follow data encryption. It refers to
converting the data into an unreadable and non-decodable format. This happens through
algorithm and key, which protects the integrity and confidentiality of data. The data in transit
and the rest are prone to attack and must undergo encryption.
Backup Data Regularly
The above-stated data security threats include system compromise. It leads to an inability to
perform activities due to a lack of data availability. Thus, regular data backup helps modify
and use it to prevent harm. It decreases the harm as the lost information due to data breach may
take longer to recover.
Train Employees on Security Awareness
The updated information on possible attacks and prevention methods can protect company data
from numerous losses. It enables the employees to take mindful actions and precautions while
dealing with unknown or strange data. It also makes them aware of how to identify social
engineering attacks. Enlighten them about ‘what is data security’ and other crucial aspects such
as data security regulations like PCI DSS, HIPAA and others.

Data Science Unit 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Unit 1

Uploaded by

Copyright:

Available Formats

UNIT-I

Data Science Introduction

What is Data Science?

By using Data Science, companies are able to make:

• Better decisions (should we choose A or B)

Where is Data Science Needed?

Examples of where Data Science is needed:

• For route planning: To discover the best routes to ship

How Does a Data Scientist Work?

Here is how a Data Scientist works:

1. Ask the right questions - To understand the business problem.

Data can be categorized into two groups:

How to Structure Data?

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep

45 100 140 280 0 7

60 105 140 290 7 8

60 110 145 300 7 8

60 115 145 310 8 8

75 120 150 320 0 8

75 125 150 330 8 8

Column Column 2 Column 3 Column 4 Column 5 Column 6

Duratio Average_Puls Max_Puls Calorie_Burna Hours_Wor Hours_Slee

Ro 45 100 140 280 0 7

Ro 60 105 140 290 7 8

Ro 60 110 145 300 7 8

Ro 60 115 145 310 8 8

Ro 75 120 150 320 0 8

A row is a horizontal representation of data.

A column is a vertical representation of data

Examples can be characters, numbers or time.

Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep

45 100 140 280 0 7

60 105 140 290 7 8

60 115 145 310 8 8

75 120 150 320 0 8

75 125 150 330 8 8

There are 11 rows, meaning that each variable has 10 observations.

Evolution of Data Science: Growth &

1994: BusinessWeek published a story on the new phenomenon of "Database Marketing.” It

The Future of Data Science

Data Science Roles

Data Science Roles

Programming languages required: Python, R, SQL

Tools/skills required: Data science programming, probability and statistics, collaboration,

Programming languages required: Python, R

Top industries: Healthcare, telecommunications, energy, automotive

Responsibilities: Use data-driven insights to clearly communicate initiatives throughout entire

Programming languages required: SQL, Tableau

Tools/skills required: Understanding of business processes, data visualization tools, listening

Top industries: Telecom, utilities, real estate, healthcare, government, pharmaceuticals

Programming languages required: Java, Python, C, C++

Marketing Data Scientist

Programming languages required: SQL, Python, R, Tableau

Tools/skills required: Solid understanding of data analytics, objective thinking, strong

Top industries: Banking and finance, advertising, retail, technology, travel

Machine Learning Engineer

Responsibilities: Processing data provided by a company’s Data Analyst using machine

Programming languages required: R, Java, Python, C++

Tools/skills required: Strong communication paired with an understanding of data structures,

Top industries: Healthcare, financial services, retail, government, transportation

2. Data Understanding: After enterprise understanding, the subsequent step is data

Medical Image Analysis.

Virtual Medical Bots.