Unit 1 Hints: Data Science Fundamentals

Unit 1 Hints
UNIT I
PART A
1. What is data science?

Data science is the study of data. It involves developing methods of recording, storing, and analyzing data to
effectively extract useful information. The goal of data science is to gain insights and knowledge from any type of data
both structured and unstructured.
2. Explain the role of data science in business?

Data science is the requirement of every business to make business forecasts and predictions based on facts and
figures, which are collected in the form of data and processed through data science. Data science is also important for
marketing promotions and campaigns.
3. Explain the role of data science in Medical Research?

Data science is necessary for research and analysis in health care, which makes it easier for practitioners in both
fields to understand the challenges and extract results through analysis and insights proposed based on data.The
advanced technology of machine learning and artificial intelligence has been infused in the field of medical science to
conduct biomedical and genetic researches more easily.
4. Explain the role of data science in health care?

The clinical history of a patient and the medicinal treatment given is computerized and kept as a digitalized record,
which makes it easier for the practitioner and doctors to detect the complex diseases at an early stage and understand its
complexity.
5.Explain the role of data science in education?

Online personality tests and career counseling suggest students and individuals select a career path based on their
choices. This is done with the help of data science, which has designed algorithms and predictive rationalities that
conclude with the provided set of choices.
6. Explain the role of data science in social media?

Most of the data of consumer behavior, choices, and preferences are being collected through online platforms,
which help the business grow. The systems are smart enough to predict and analyze a user’s mood, behavior, and
opinions towards a specific product, incident, or event with the help of texts, feedbacks, thoughts, views, reactions, and
suggestions.
7. Explain the role of data science in Technology?

The new technology emerging around us to ease life and alter our lifestyle is all due to data science. Some
applications collect sensory data to track our movements and activities and also provide knowledge and suggestions
concerning our health, such as blood pressure and heart rate. Data science is also being implemented to advance
security and safety technology, which is the most important issue nowadays.
8. Explain the role of data science in financial institution?

Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud and risk of
losses. Thus, Financial Industries needs to automate risk of loss analysis in order to carry out strategic decisions for the
company. Also, Financial Industries uses Data Science Analytics tools in order to predict the future. It allows the
companies to predict customer lifetime value and their stock market moves. For Example, In Stock Market, Data
Science is the main part. In the Stock Market, Data Science is used to examine past behavior with past data and their goal
is to examine the future outcome. Data is analyzed in such a way that it makes it possible to predict future stock prices
over a set timetable.
9. Explain the main types/ categories of data?

* Structured
* Unstructured
* Natural language
* Machine-generated
* Graph-based
* Audio, video, and images
* Streaming
10. Explain natural language? Is natural language is instructed data

Yes, Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics. The natural language processing is used in entity
recognition, topic recognition, summarization, text completion, and sentiment analysis.
Example: Predictive text, Search results.
11. What is Machine-generated data with example?

Machine-generated data is information that’s automatically created by a computer, process, application, or other
machine without human intervention. Machine-generated data is becoming a major data resource and will continue to
do so.The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and telemetry.
12. What is Graph-based or network data?

Graph or network data is, in short, data that focuses on the relationship or adjacency of objects. The graph
structures use nodes, edges, and properties to represent and store graphical data. Graph-based data is a natural way to
represent social networks.
13. List the name of the steps involved in data science processing?
1. Setting the research goal
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modelling
6. Presentation and automation.
14. What are OUTLIERS?

An outlier is an observation that seems to be distant from other observations or, more specifically, one observation
that follows a different logic or generative process than the other observations.It is an observation that lies an abnormal
distance from other values in a random sample from a population. A value that "lies outside" (is much smaller or larger
than) most of the other values in a set of data. For example in the scores 25,29,3,32,85,33,27,28 both 3 and 85 are
"outliers".
15. What are the different ways of combining data?
The two operations that are performed in various data types to combine them are as follows:
i. Joining- enriching an observation from one table with information from another table.
ii. Appending or stacking- adding the observations of one table to those of another table.
16. What is big data?

Big data is a blanket term for any collection of data sets so large or complex that it becomes difficult to process
using traditional data management techniques. They are characterized by the four Vs: velocity, variety, volume, and
veracity. Bigdata is a term used to describe a collection of data that is huge in size and yet growing exponentially with
time. Big Data analytics examples includes stock exchanges, social media sites, jet engines, etc. Big Data could be 1)
Structured, 2) Unstructured, 3) Semi-structured.
17. What is data Cleansing?

Data cleansing is a sub process of the data science process that focuses on removing errors in your data so your
data becomes a true and consistent representation of the processes. Data cleaning is a process by which inaccurate,
poorly formatted, or otherwise messy data is organized and corrected. For example, if you conduct a survey and
ask people for their phone numbers, people may enter their numbers in different formats.
18. Define the term setting the research goal (in data science processing).
Understanding the business or activity that our data science project is part of is key to ensuring its success and the
first phase of any sound data analytics project. Defining the what, the why, and the how of our project in a project
charter is the foremost task. Every data science project should aim to fulfill a precise and measurable goal that is
clearly connected to the purposes, workflows, and decision-making processes of the business.
19. Explain the term Retrieving data.

The term retrieving data means Finding and getting access to data needed in your project. This data is either found
within the company or retrieved from a third party. Data retrieval means obtaining data from a Database Management
System (DBMS) such as ODBMS. In this case, it is considered that data is represented in a structured way, and there is
no ambiguity in data. In order to retrieve the desired data the user present a set of criteria by a query.
20. What is data preparation and process?

It is the process of Checking and remediating data errors, enriching the data with data from other data sources, and
transforming it into a suitable format for your models. Data preparation is the process of preparing raw data so that
it is suitable for further processing and analysis. Key steps include collecting, cleaning, and labeling raw data into a
form suitable for machine learning (ML) algorithms and then exploring and visualizing the data
21. Define data exploration process?

Data exploration process of Diving deeper into your data using descriptive statistics and visual techniques. Data
exploration is the first step of data analysis used to explore and visualize data to uncover insights from the start
or identify areas or patterns to dig into more. Using interactive dashboards and point-and-click data exploration,
users can better understand the bigger picture and get to insights faster.
22. What is data modelling?
It is the process of Building a model is an iterative process that involves selecting the variables for the model,
executing the model, and model diagnostics. It is generally Using machine learning and statistical techniques to
achieve our project goal.
23. What are the types of database?

 Column databases
 Document stores
 Streaming data
 Key-value stores
 SQL on Hadoop
 New SQL
 Graph databases.
26. What is a column database?

In this method Data is stored in columns, which allows algorithms to perform much faster queries. Newer
technologies use cell-wise storage. Table-like structures are still important.
27. Explain document stores?

Document stores no longer use tables, but store every observation in a document. This allows for a much more
flexible data scheme.
28. What is Streaming data?

Data is collected, transformed, and aggregated not in batches but in real time. Although we’ve categorized it here
as a database to help you in tool selection, it’s more a particular type of problem that drove creation of technologies
such as Storm.
29. Define Key-value stores?

When Data isn’t stored in a table; rather you assign a key for every value, such as org.marketing.sales.2015:
20000. This scales well but places almost all the implementation on the developer.
30. Explain SQL on Hadoop?
Batch queries on Hadoop are in a SQL-like language that uses the map-reduce framework in the background
31. What is New SQL?
This class combines the scalability of NoSQL databases with the advantages of relational databases. They all
have a SQL interface and a relational data model.
PART B
1.What are the Advantages and disadvantages of Data Science?

Advantages of Data Science :-
In today’s world, data is being generated at an alarming rate. Every second, lots of data is generated; be it from the
users of Facebook or any other social networking site, or from the calls that one makes, or the data which is being
generated from different organizations. And because of this huge amount of data the value of the field of Data Science
has a number of advantages. Some of the advantages are mentioned below :-
 Multiple Job Options :- Being in demand, it has given rise to a large number of career opportunities in its various
fields. Some of them are Data Scientist, Data Analyst, Research Analyst, Business Analyst, Analytics Manager, Big Data
Engineer, etc.
 Business benefits :- Data Science helps organizations knowing how and when their products sell best and that’s why
the products are delivered always to the right place and right time. Faster and better decisions are taken by the
organization to improve efficiency and earn higher profits.
 Highly Paid jobs & career opportunities :- As Data Scientist continues being the sexiest job and the salaries for this
position are also grand. According to a Dice Salary Survey, the annual average salary of a Data Scientist $106,000 per
year.
 Hiring benefits :- It has made it comparatively easier to sort data and look for best of candidates for an organization.
Big Data and data mining have made processing and selection of CVs, aptitude tests and games easier for the recruitment
teams.
Disadvantages of Data Science :-
Everything that comes with a number of benefits also has some consequences.
some of the disadvantages of Data Science :-
 Data Privacy :- Data is the core component that can increase the productivity and the revenue of industry by making
game-changing business decisions. But the information or the insights obtained from the data can be misused against any
organization or a group of people or any committee etc. Extracted information from the structured as well as unstructured
data for further use can also misused against a group of people of a country or some committee.
 Cost :- The tools used for data science and analytics can cost a lot to an organization as some of the tools are complex
and require the people to undergo a training in order to use them. Also, it is very difficult to select the right tools according
to the circumstances because their selection is based on the proper knowledge of the tools as well as their accuracy in
analyzing the data and extracting information.
2. Briefly explain the architecture of Data Mining
Data mining is a significant method where previously unknown and potentially useful information is
extracted from the vast amount of data. The data mining process involves several components, and these components
constitute a data mining system architecture.
Data Mining Architecture

The significant components of data mining systems are a data source, data mining engine, data warehouse
server, the pattern evaluation module, graphical user interface, and knowledge base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and other
documents. You need a huge amount of historical data for data mining to be successful. Organizations typically store
data in databases or data warehouses. Data warehouses may comprise one or more databases, text files spreadsheets,
or other repositories of data. Sometimes, even plain text files or spreadsheets may contain information. Another
primary source of data is the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated, and
selected. As the information comes from various sources and in different formats, it can't be used directly for the data
mining procedure because the data may not be complete and accurate. So, the first data requires to be cleaned and
unified. More information than needed will be collected from various data sources, and only the data of interest will
have to be selected and passed to the server. These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.
Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be processed. Hence, the
server is cause for retrieving the relevant data that is based on data mining as per user request.
Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains several modules for
operating data mining tasks, including association, characterization, classification, clustering, prediction, time-series
analysis, etc.In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from various data sources and
stored within the data warehouse.
Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern by
using a threshold value. It collaborates with the data mining engine to focus the search on exciting patterns.This
segment commonly employs stake measures that cooperate with the data mining modules to focus the search towards
fascinating patterns. It might utilize a stake threshold to filter out discovered patterns. On the other hand, the pattern
evaluation module might be coordinated with the mining module, depending on the implementation of the data
mining techniques used. For efficient data mining, it is abnormally suggested to push the evaluation of pattern stake
as much as possible into the mining procedure to confine the search to only fascinating patterns.
Graphical User Interface:
The graphical user interface (GUI) module communicates between the data mining system and the user. This
module helps the user to easily and efficiently use the system without knowing the complexity of the process. This
module cooperates with the data mining system when the user specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search or
evaluate the stake of the result patterns. The knowledge base may even contain user views and data from user
experiences that might be helpful in the data mining process. The data mining engine may receive inputs from the
knowledge base to make the result more accurate and reliable. The pattern assessment module regularly interacts with
the knowledge base to get inputs, and also update it.
3. KDD in data mining

KDD- Knowledge Discovery in Databases
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of discovering
knowledge in data and emphasizes the high-level applications of specific Data Mining techniques. It is a field of
interest to researchers in various fields, including artificial intelligence, machine learning, pattern recognition,
databases, statistics, knowledge acquisition for expert systems, and data visualization.
The main objective of the KDD process is to extract information from data in the context of large databases.
It does this by using Data Mining algorithms to identify what is deemed knowledge.The Knowledge Discovery in
Databases is considered as a programmed, exploratory analysis and modeling of vast data repositories.
KDD is the organized procedure of recognizing valid, useful, and understandable patterns from huge and
complex data sets. Data Mining is the root of the KDD procedure, including the inferring of algorithms that
investigate the data, develop the model, and find previously unknown patterns. The model is used for extracting the
knowledge from the data, analyze the data, and predict the data.
The availability and abundance of data today make knowledge discovery and Data Mining a matter of
impressive significance and need. In the recent development of the field, it isn't surprising that a wide variety of
techniques is presently accessible to specialists and experts.
The KDD Process

The knowledge discovery process(illustrates in the given figure) is iterative and interactive, comprises of
nine steps. The process is iterative at each stage, implying that moving back to the previous actions might be required.
The process has many imaginative aspects in the sense that one cant presents one formula or make a complete
scientific categorization for the correct decisions for each step and application type. Thus, it is needed to understand
the process and the different requirements and possibilities in each stage.
The process begins with determining the KDD objectives and ends with the implementation of the
discovered knowledge. At that point, the loop is closed, and the Active Data Mining starts. Subsequently, changes
would need to be made in the application domain. For example, offering various features to cell phone users in order
to reduce churn. This closes the loop, and the impacts are then measured on the new data repositories, and the KDD
process again. Following is a concise description of the nine-step KDD process, Beginning with a managerial step:
1. Building up an understanding of the application domain

This is the initial preliminary step. It develops the scene for understanding what should be done with the
various decisions like transformation, algorithms, representation, etc. The individuals who are in charge of a KDD
venture need to understand and characterize the objectives of the end-user and the environment in which the
knowledge discovery process will occur ( involves relevant prior knowledge).
2. Choosing and creating a data set on which discovery will be performed
Once defined the objectives, the data that will be utilized for the knowledge discovery process should be
determined. This incorporates discovering what data is accessible, obtaining important data, and afterward integrating
all the data for knowledge discovery onto one set involves the qualities that will be considered for the process. This
process is important because of Data Mining learns and discovers from the accessible data. This is the evidence base
for building the models. If some significant attributes are missing, at that point, then the entire study may be
unsuccessful from this respect, the more attributes are considered. On the other hand, to organize, collect, and operate
advanced data repositories is expensive, and there is an arrangement with the opportunity for best understanding the
phenomena. This arrangement refers to an aspect where the interactive and iterative aspect of the KDD is taking
place. This begins with the best available data sets and later expands and observes the impact in terms of knowledge
discovery and modeling.
3. Preprocessing and cleansing
In this step, data reliability is improved. It incorporates data clearing, for example, Handling the missing
quantities and removal of noise or outliers. It might include complex statistical techniques or use a Data Mining
algorithm in this context. For example, when one suspects that a specific attribute of lacking reliability or has many
missing data, at this point, this attribute could turn into the objective of the Data Mining supervised algorithm. A
prediction model for these attributes will be created, and after that, missing data can be predicted. The expansion to
which one pays attention to this level relies upon numerous factors. Regardless, studying the aspects is significant and
regularly revealing by itself, to enterprise data frameworks.
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed. Techniques here
incorporate dimension reduction( for example, feature selection and extraction and record sampling), also attribute
transformation(for example, discretization of numerical attributes and functional transformation). This step can be
essential for the success of the entire KDD project, and it is typically very project-specific. For example, in medical
assessments, the quotient of attributes may often be the most significant factor and not each one by itself. In business,
we may need to think about impacts beyond our control as well as efforts and transient issues. For example, studying
the impact of advertising accumulation. However, if we do not utilize the right transformation at the starting, then we
may acquire an amazing effect that insights to us about the transformation required in the next iteration. Thus, the
KDD process follows upon itself and prompts an understanding of the transformation required.
5. Prediction and description
We are now prepared to decide on which kind of Data Mining to use, for example, classification, regression,
clustering, etc. This mainly relies on the KDD objectives, and also on the previous steps. There are two significant
objectives in Data Mining, the first one is a prediction, and the second one is the description. Prediction is usually
referred to as supervised Data Mining, while descriptive Data Mining incorporates the unsupervised and visualization
aspects of Data Mining. Most Data Mining techniques depend on inductive learning, where a model is built explicitly
or implicitly by generalizing from an adequate number of preparing models. The fundamental assumption of the
inductive approach is that the prepared model applies to future cases. The technique also takes into account the level
of meta-learning for the specific set of accessible data.
6. Selecting the Data Mining algorithm
Having the technique, we now decide on the strategies. This stage incorporates choosing a particular
technique to be used for searching patterns that include multiple inducers. For example, considering precision versus
understandability, the previous is better with neural networks, while the latter is better with decision trees. For each
system of meta-learning, there are several possibilities of how it can be succeeded. Meta-learning focuses on
clarifying what causes a Data Mining algorithm to be fruitful or not in a specific issue. Thus, this methodology
attempts to understand the situation under which a Data Mining algorithm is most suitable. Each algorithm has
parameters and strategies of leaning, such as ten folds cross-validation or another division for training and testing.
7. Utilizing the Data Mining algorithm
At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need to utilize the
algorithm several times until a satisfying outcome is obtained. For example, by turning the algorithms control
parameters, such as the minimum number of instances in a single leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective characterized in
the first step. Here we consider the preprocessing steps as for their impact on the Data Mining algorithm results. For
example, including a feature in step 4, and repeat from there. This step focuses on the comprehensibility and utility of
the induced model. In this step, the identified knowledge is also recorded for further use. The last step is the use, and
overall feedback and discovery results acquire by Data Mining.
9. Using the discovered knowledge
Now, we are prepared to include the knowledge into another system for further activity. The knowledge
becomes effective in the sense that we may make changes to the system and measure the impacts. The
accomplishment of this step decides the effectiveness of the whole KDD process. There are numerous challenges in
this step, such as losing the "laboratory conditions" under which we have worked. For example, the knowledge was
discovered from a certain static depiction, it is usually a set of data, but now the data becomes dynamic. Data
structures may change certain quantities that become unavailable, and the data domain might be modified, such as an
attribute that may have a value that was not expected previously.
4. Briefly explain the architecture of Data warehouse
A data warehouse is a relational database that is designed for query and analysis rather than for transaction
processing. It usually contains historical data derived from transaction data, but it can include data from other sources.
It separates analysis workload from transaction workload and enables an organization to consolidate data from several
sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation,
transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and
other applications that manage the process of gathering data and delivering it to business users.Data warehouses are
designed to help you analyze data. For example, to learn more about your company's sales data, you can build a
warehouse that concentrates on sales
Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity recently
dubbed online analytical processing (OLAP). These include applications such as forecasting, profiling, summary
reporting, and trend analysis.
A data warehouse architecture is a method of defining the overall architecture of data communication
processing and presentation that exist for end-clients computing within the enterprise.
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to process the
day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the system must
have a different name.
Meta Data
A set of data that defines and gives information about other data.Meta Data used in Data Warehouse for a
variety of purpose, including:
• Meta Data summarizes necessary information about data, which can make finding and work with particular
instances of data more accessible.
For example, author, data build, and data changed, and file size are examples of very basic document
metadata.
• Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data
generated by the warehouse manager.The goals of the summarized information are to speed up query performance.
The summarized record is updated continuously as new information is loaded into the warehouse.
End-User access Tools

The principal purpose of a data warehouse is to provide information to the business managers for strategic decision-
making. These customers interact with the warehouse using end-client access tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Types of Data Warehouse Architectures
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of data
stored to reach this goal; it removes data redundancies.The figure shows the only layer physically available is the
source layer. In this method, data warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation between
analytical and transactional processing. Analysis queries are agreed to operational data after the middleware interprets
them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a data warehouse
system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation between physically available
sources and data warehouses, in fact, consists of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored initially to
corporate relational databases or legacy databases, or it may come from an information system outside the corporate
walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove inconsistencies and fill
gaps, and integrated to merge heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can combine heterogeneous schemata, extract,
transform, cleanse, validate, filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual repository: a data
warehouse. The data warehouses can be directly accessed, but it can also be used as a source for creating data marts,
which partially replicate data warehouse contents and are designed for specific enterprise departments. Meta-data
repositories store information on sources, access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports, dynamically
analyze information, and simulate hypothetical business scenarios. It should feature aggregate information navigators,
complex query optimizers, and customer-friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source system), the reconciled
layer and the data warehouse layer (containing both data warehouses and data marts). The reconciled layer sits
between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a whole
enterprise. At the same time, it separates the problems of source data extraction and integration from those of data
warehouse population. In some cases, the reconciled layer is also directly used to accomplish better some operational
tasks, such as producing daily reports that cannot be satisfactorily prepared using the corporate applications or
generating data flows to feed external processes periodically to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage of this
structure is the extra file storage space used through the extra redundant reconciled layer. It also makes the analytical
tools a little further away from being real-time.
5. Explain briefly the need for Data Science with real time application
Data Science is the deep study of a large quantity of data, which involves extracting some meaningful from the
raw, structured, and unstructured data. The extracting out meaningful data from large amounts use processing of data
and this processing can be done using statistical techniques and algorithm, scientific techniques, different
technologies, etc. It uses various tools and techniques to extract meaningful data from raw data. Data Science is also
known as the Future of Artificial Intelligence.
For Example, Jagroop loves books to read but every time when he wants to buy some books he was
always confused that which book he should buy as there are plenty of choices in front of him. This Data
Science Technique will useful. When he opens Amazon he will get product recommendations on the basis of his
previous data. When he chooses one of them he also gets a recommendation to buy these books with this one as this
set is mostly bought. So all Recommendation of Products and Showing set of books purchased collectively is one of
the examples of Data Science.
Applications of Data Science
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to search for
something on the internet, we mostly used Search engines like Google, Yahoo, Safari, Firefox, etc. So Data Science
is used to get Searches faster.For Example, When we search something suppose “Data Structure and algorithm
courses ” then at that time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This happens
because the GeeksforGeeks website is visited most in order to get information regarding Data Structure courses and
Computer related subjects. So this analysis is Done using Data Science, and we get the Topmost visited Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of Driverless Cars, it
is easy to reduce the number of Accidents.For Example, In Driverless Cars the training data is fed into the algorithm
and with the help of Data Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy
Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud
and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in order to carry out strategic
decisions for the company. Also, Financial Industries uses Data Science Analytics tools in order to predict the future.
It allows the companies to predict customer lifetime value and their stock market moves. For Example, In Stock
Market, Data Science is the main part. In the Stock Market, Data Science is used to examine past behavior with past
data and their goal is to examine the future outcome. Data is analyzed in such a way that it makes it possible to
predict future stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user experience with
personalized recommendations.For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations according to most buy the
product, most rated, most searched, etc. This is all done with the help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
 Detecting Tumor.
 Drug discoveries.
 Medical Image Analysis.
 Virtual Medical Bots.
 Genetics and Genomics.
 Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload our image
with our friend on Facebook, Facebook gives suggestions Tagging who is in the picture. This is done with the help of
machine learning and Data Science. When an Image is Recognized, the data analysis is done on one’s Facebook
friends and after analysis, if the faces which are present in the picture matched with someone else profile then
Facebook suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the user searches
on the Internet, he/she will see numerous posts everywhere. This can be explained properly with an example:
Suppose I want a mobile phone, so I just Google search it and after that, I changed my mind to buy offline. Data
Science helps those companies who are paying for Advertisements for their mobile. So everywhere on the internet in
the social media, in the websites, in the apps everywhere I will see the recommendation of that mobile phone which I
searched for. So this will force me to buy online
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it becomes easy to
predict flight delays. It also helps to decide whether to directly land into the destination or take a halt in between like
a flight can have a direct route from Delhi to the U.S.A or it can halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data science
concepts are used with machine learning where with the help of past data the Computer will improve its
performance. There are many games like Chess, EA Sports, etc. will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done with full
disciplined because it is a matter of Someone’s life. Without Data Science, it takes lots of time, resources, and
finance or developing new Medicine or drug but with the help of Data Science, it becomes easy because the
prediction of success rate can be easily determined based on biological data or factors. The algorithms based on data
science will forecast how this will react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science helps these
companies to find the best route for the Shipment of their Products, the best time suited for delivery, the best mode
of transport to reach the destination, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility to just type
a few letters or words, and he will get the feature of auto-completing the line. In Google Mail, when we are writing
formal mail to someone so at that time data science concept of Autocomplete feature is used where he/she is an
efficient choice to auto-complete the whole line. Also in Search Engines in social media, in various apps,
AutoComplete feature is widely used
6.How do you set the research goal, retrieving data and Data preparation process in Data Science
process
The data science process is a systematic approach to solving a data problem. It provides a structured framework
for articulating your problem as a question, deciding how to solve it, and then presenting the solution to
stakeholders.The typical data science process consists of six steps.
Step 1 : Setting a research goal
Step 2 : Data retrieval

Step 3 : Data Preparation
Step 4 : Data exploration
Step 5 : Data modeling/model building
Step 6 : Presenting your results and automating the analysis
Step 1: Defining research goals and creating a project charter
A project starts by understanding the what, the why, and the how of your project (figure 2.2). What does the company
expect you to do? And why does management place such a value on your research? Is it part of a bigger strategic
picture or a “lone wolf” project originating from an opportunity someone detected? Answering these three questions
(what, why, how) is the goal of the first phase, so that everybody knows what to do and can agree on the best course of
action.
Figure 2.2. Step 1: Setting the research goal
The outcome should be a clear research goal, a good understanding of the context, well-defined deliverables, and a
plan of action with a timetable. This information is then best placed in a project charter. The length and formality can,
of course, differ between projects and companies. In this early phase of the project, people skills and business acumen
are more important than great technical prowess, which is why this part will often be guided by more senior personnel.
Spend time understanding the goals and context of your research

An essential outcome is the research goal that states the purpose of your assignment in a clear and focused manner.
Understanding the business goals and context is critical for project success. Continue asking questions and devising
examples until you grasp the exact business expectations, identify how your project fits in the bigger picture, appreciate
how your research is going to change the business, and understand how they’ll use your results.
Create a project charter

Clients like to know upfront what they’re paying for, so after you have a good understanding of the business problem,
try to get a formal agreement on the deliverables. All this information is best collected in a project charter. For any
significant project this would be mandatory.
A project charter requires teamwork, and your input covers at least the following:
 A clear research goal
 The project mission and context
 How you’re going to perform your analysis
 What resources you expect to use
 Proof that it’s an achievable project, or proof of concepts
 Deliverables and a measure of success
 A timeline
Client can use this information to make an estimation of the project costs and the data and people required for your
project to become a success.
7. Explain the Data exploration ,data modeling and presentation process in Data Science

Unit 1 Hints: Data Science Fundamentals

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 Hints: Data Science Fundamentals

Uploaded by

Copyright:

Available Formats

Unit 1 Hints

1. What is data science?

2. Explain the role of data science in business?

3. Explain the role of data science in Medical Research?

4. Explain the role of data science in health care?

5.Explain the role of data science in education?

6. Explain the role of data science in social media?

7. Explain the role of data science in Technology?

8. Explain the role of data science in financial institution?

9. Explain the main types/ categories of data?

10. Explain natural language? Is natural language is instructed data

11. What is Machine-generated data with example?

12. What is Graph-based or network data?

14. What are OUTLIERS?

15. What are the different ways of combining data?

16. What is big data?

17. What is data Cleansing?

19. Explain the term Retrieving data.

20. What is data preparation and process?

21. Define data exploration process?

23. What are the types of database?

26. What is a column database?

27. Explain document stores?

28. What is Streaming data?

29. Define Key-value stores?

1.What are the Advantages and disadvantages of Data Science?

Data Mining Architecture

Database or Data Warehouse Server:

Data Mining Engine:

Pattern Evaluation Module:

3. KDD in data mining

The KDD Process

1. Building up an understanding of the application domain

Lightly and highly summarized data

End-User access Tools

Applications of Data Science

Step 2 : Data retrieval

Spend time understanding the goals and context of your research

Create a project charter

You might also like