Auto Final

PROJECT ON: Analysis of Trends in Automobile Industry Using Business Intelligence and Business Analytics
1.EXECUTIVE SUMMARY
Sales prediction is the current numerous trend in which all the business companies thrive and
it also aids the organization or concern in determining the future goals for it and its plan and
procedure to achieve it. The data about car sales are derived from various sources .Sales of
cars does not contain any independent variable since various factors such as horse power,
model, width, fuel type, height, price, city-mileage, highway-mileage and manufacturer are
the various features that influence the sales.
In car sales prediction we first implement the methodology of analytic hierarchy process in
order to get varied idea about how well the various criteria in our dataset works and after this
we apply the machine learning algorithms such as Linear regression to get the best clusters
and we process them in t o random forest to get best accurate feature out of it.
Using Jupiter file browser of Anaconda Navigator which is a tool which helps the researcher
to arrive at a verdict when he/she faces the one or more pattern selection problem and the
final resultant derived from all these methods gives the fittest feature which influences the
customer in purchasing the car which indirectly gives the company or the research market a
result in predicting the future sales for cars. The Jupyter notebooks contain Python code.
Using a regression analysis with predicted variables, the asymmetric relationship between
attribute-level performance and overall customer satisfaction could be confirmed.However,
the analysis of time-series data reveals that a positive and significant relationship exists
between changes in customer satisfaction and changes in the performance of the firm.
Therefore, the impact of an increase in customer satisfaction on profits, although obscured in
the short run by many factors, is significantly positive in the long run.
2. INTRODUCTION
Every human being has got his/her business in this world, so the term business comes
with certain tagged words such as sales, profit and loss. A concerns success is determined by
its sales and the performance of its product in the industry is also identified by its sales.
Hence in order to improve the standard of the firm /concern the strategies, the techniques the
methodologies are built. In this process of development forecasting of certain key terms such
as profit, loss, return of interest may pave the way for the successful venture of the concern
yet when we look deep in to the details all these key terms are conveniently related to the
term called sales.
Sales prediction is one of the master trades of business which may open the gateways
for obtaining knowledge about the existing market trends and the ways to conquer the
market, planning is the first step in every activity we perform and hence knowing what lies
ahead in terms of sales hugely aids the concern/organization in this planning process.
Salesprediction can be a massive support in order to perform tracking,cash flow and
purchasing. Sales forecasting provides the business minds an idea of about how much to buy
and how much not to buy ,what are the risk can be taken in terms of revenue ,how to plan
1
budget ,market trends, introduction of new products according to the organizations capability
and ability ,what changes may happen if the plan fails are the areas where it helps to develop
our ideas ,let us shift our focus from the generic business term to its detailed chunk of
streams such as education,medical,automobile and lots of other streams.
Once great interest in recent days lies in automobile industry. The first boon to the
automobile industry came only after the industrialization renaissance period. Now the
automobile industry creates, packs, runs, moves the world on its wheels and also allows other
fields to flourish parallel along with them ,yet it also has its own defaults since it is not cost
friendly ,any single default can affect the whole system and it cause a huge failure in the
market for its brand ,there is a huge amount of revenue being spend and huge amount of
chunks being produced and exploited on daily basis hence sales forecasting in this industry
can lead to the organized pattern of selling,buying,producing goods and even the taxes to be
implemented also comes in to role.
Forecasting these sales in automobile industries can be performed with various and
variety of technologies and one among them is Data Analytics which uses the machine
learning technique. That may help in classifying the prediction of an automobile for a say, let
us take it as car; the yearly sales of it if know before then it will provide the manufacturer the
huge boost in designing it, getting spare parts, getting key parts and reducing the waste
products and tracking its revenue model its generation and various other activity. The
classifiers used such as logistic regression and random forecast provideus with accurate
predicted results.However, the analysis of time-series data reveals that a positive and
significant relationship exists between changes in customer satisfaction and changes in the
performance of the firm. Therefore, the impact of an increase in customer satisfaction on
profits, although obscured in the short run by many factors, is significantly positive in the
long run.
2.1.COMPANY OVERVIEW
Konigtronics (OPC) Private Limited in Figure 1.1 was started in August 2016 byMr. Vishesh.
S, B.E. (Telecommunication) to educate the students and the public oninnovative methods of
Research and Development in the field of Engineering andTechnology. Konigtronics (OPC)
Private Limited was incorporated as a corporatecompany by the Ministry of Corporate
2
Affairs (MCA) on 9th January 2017. (Pursuant tosub-section (2) of section 7 of the
companies Act, 2013 and Rule 18 the companies(incorporation) Rules, 2014)
Konigtronics (OPC) Pvt Ltd bearing CIN U80904KA2017OPC099075, and its

ManagingDirector Mr. SiddeshVishesh bearing DIN 07685912.
Vishesh S hails from Bangalore, Karnataka, India. He has completed B.E

inTelecommunication Engineering from VTU, Belgaum, Karnataka in 2015. His
researchinterests include Embedded Systems, Wireless Communication, BAN and
MedicalElectronics. He is also the Founder and Managing Director of the corporate company
Konigtronics Private Limited. He has guided over a hundred students/interns/professionals in
their research work and projects and a co-author of many International Research Papers.
Konigtronics Pvt Ltd has extended its services in the field of web and
applicationdevelopment, software engineering and has produced numerous products and
hassuccessfully commercialized it. Konigtronics has conducted many technical workshopson
topics like embedded systems, IoT (Internet of Things), cloud computing, networksimulation
and bio-medical signal processing. Konigtronics has now extended its servicein the field of
real-estate and civil works, career planning and entrepreneurship. Thecompany is located ina
prime location in central Bangalore and as many as 3 dedicatedengineers and several interns
working in it. Konigtronics has till now guided over 400students/professionals in their
projects, research and development activities.
A. Research and Development
• Research on various topics like WBAN, signal processing, software engineering,
data engineering, etc.
• Submission of the outcome of the Research as a report to an International Jury.
(International Research papers/ journals).
• Development of various products after in depth research or following an iterative
approach in development.
B. Software engineering
• Web design
• Application design
• Mobile application design.
C. Technical and Non-technical training
• Training for students of engineering.
3
• Professional training.
• Non-Engineers
• Placement training and Job consultancy
3.PROJECT PROFILE
3.1 OBJECTIVES OF THE STUDY

In today’s automobile market, competitors are huge in number, resulting in high completion.
Almost all the automobile industries have same or similar products with minor changes
making them unique on their own. Due to this, there is a huge pricing issue of the product. In
any industry, there will be huge data set about all the resources, that is human, raw materials
and so on, forecast that is inventory, production, sales and so on. The evaluation of the
product using these huge data is a very difficult and fatigue process. Using business analytics
and business intelligence we can simplify the work and is more profitable to the industry.
3.2. METHODOLOGY
3.2.1 Data acquisition

We study data acquisition for business analytics considering both data quality and acquisition
cost.
Customer data is the lifeblood of business in today’s data-driven business environment.

Organizations rely on data to perform business analytics to improve operations, gain
managerial insights, and develop business strategy in order to gain competitive advantages.
Organizations often already have a large amount of customer data in their databases. In many
applications, however, it is necessary to acquire new data for business analytics. The two
scenarios below demonstrate two different customer data acquisition tasks for different
business applications. Scenario 1. An online company considers whether or not to carry an
additional product line. The product line is new to the company, but related products exist on
the market. The company has a database of its customers, but has no data for the new product
line. If the company can acquire its customers’ purchase data for the related products from
other market sources, then the company will be able to make a reasonable estimate about the
sales of these new products. In this scenario, the objective is to obtain some aggregated
information to perform descriptive analytics about some new attributes (e.g., average
purchase amount per customer for the new products). Scenario 2. Similar to Scenario 1, the
company would like to acquire its customers’ purchase data for the new products from other
market sources. But the objective now is to develop a predictive model to analyze the
relationships between the purchase amount for a new product and customers’ profiles (e.g.,
demographics, purchase amounts for existing products, etc.), so that the company can focus
its marketing effort on right customers. In this case, the acquired data will be used for the
4
new target variable (purchase amount for a new product) in predictive analytics.
Alternatively, the objective for predictive analytics can also be to acquire data that will be
used for predictor variables (rather than the target variable) that are not available in the
current customer databases. A company may have millions of customer records. It is
practically impossible to acquire the data from all of its customers due to cost constraints. For
the purposes of information gathering and business analytics, it suffices to acquire the data
from a representative sample of the customer population.
Data acquisition includes abstracting information from hard copy records, acquiring
information from electronic records, acquiring data from files that reside on the internet or in
other repositories, linking extant data files to other data repositories or to existing primary
survey data. Data acquisition is growing in demand because it allows researchers to take
advantage of a wide variety of existing data sources.
Survey -Survey data is defined as the resultant data that is collected from a sample of
respondents that took a survey. This data is comprehensive information gathered from a
target audience about a particular topic of interest to conduct research on the basis of this
collected data.
There are many methods used to gather survey data for statistical analysis in research.
Various mediums are used to collect feedback and opinions from the desired sample of
individuals. While conducting survey research, researchers prefer multiple sources to gather
data such as online surveys, telephonic surveys, face-to-face surveys, etc. The medium of
5
gathering survey data decides the sample of people that are to be reached out to, to reach the
requisite number of survey responses.
Factors of collecting survey data such as how the interviewer will contact the respondent
(online or offline), how the information is communicated to the respondents etc. decide the
effectiveness of gathered data.
There are four main survey data collection methods – Telephonic Surveys, Face-to-face
Surveys, and Online Surveys.
• Online Surveys
Online surveys are the most cost-effective and can reach the maximum number of people in
comparison to the other mediums. The performance of these surveys is much more
widespread than the other data collection methods. In situations where there is more than one
question to be asked to the target sample, certain researchers prefer conducting online surveys
over the traditional face-to-face or telephone surveys.
• Face-to-face Surveys
Gaining information from respondents via face-to-face mediums is much more effective than
the other mediums because respondents usually tend to trust the surveyors and provide honest
and clear feedback about the subject in-hand.
• Telephone Surveys
Telephone surveys require much lesser investment than face-to-face surveys. Depending on
the required reach, telephone surveys cost as much or a little more than online surveys.
Contacting respondents via the telephonic medium requires less effort and manpower than the
face-to-face survey medium.
• Paper Surveys
The other commonly used survey method is paper surveys. These surveys can be used where
laptops, computers, and tablets cannot go and hence they use the age-old method of data
collection; pen and paper. This method helps collect survey data in field research and helps
strengthen the number of responses collected and the validity of these responses.
Reports
A report is a document that presents information in an organized format for a specific

audience and purpose. Although summaries of reports may be delivered orally,
complete reports are almost always in the form of written documents.
The types are: 1. Formal or Informal Reports 2. Short or Long

Reports 3. Informational or Analytical Reports 4. Proposal Report 5. Vertical or
6
Lateral Reports 6. Internal or External Reports 7. Periodic Reports 8. Functional

Reports.
3.2.2 Data mining

It is the process of discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database systems Data mining is
an interdisciplinary subfield of computer science and statistics with an overall goal to extract
information (with intelligent methods) from a data set and transform the information into a
comprehensible structure for further use Data mining is the analysis step of the "knowledge
discovery in databases" process or KDD. Aside from the raw analysis step, it also involves
database and data management aspects, data pre-
processing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered structures, visualization,
and online updating.
The term "data mining" is a misnomer, because the goal is the extraction of patterns and
knowledge from large amounts of data, not the extraction (mining) of data itself.[6] It also is
a buzzword HYPERLINK "https://en.wikipedia.org/wiki/Data_mining"[7] and is frequently applied to any form of
large-scale data or information HYPERLINK
"https://en.wikipedia.org/wiki/Information_processing"processing (collection, extraction, war
ehousing, analysis, and statistics) as well as any application of computer decision support
system, including artificial intelligence (e.g., machine learning) and business intelligence.
The book Data mining: Practical machine learning tools and techniques with Java (which
covers mostly machine learning material) was originally to be named just Practical machine
learning, and the term data mining was only added for marketing reasons. Often the more
general terms (large scale) data analysis and analytics – or, when referring to actual
methods, artificial intelligence and machine learning – are more appropriate.
7
Consider a basic set of tasks that constitute a data acquisition process:
• A need for data is identified, perhaps with use cases

• Prospecting for the required data is carried out
• Data sources are disqualified, leaving a set of qualified sources
• Vendors providing the sources are contacted and legal agreements entered into for
evaluation
• Sample data sets are provided for evaluation
• Semantic analysis of the data sets is undertaken, so they are adequately understood
• The data sets are evaluated against originally established use cases
• Legal, privacy and compliance issues are understood, particularly with respect to
permitted use of data
• Vendor negotiations occur to purchase the data
• Implementation specifications are drawn up, usually involving Data Operations who will
be responsible for production processes
• Source onboarding occurs, such that ingestion is technically accomplished
3.2.3 Data Inspection
8
Data scientists spend a large amount of their time cleaning datasets and getting them down to
a form with which they can work. In fact, a lot of data scientists argue that the initial steps of
obtaining and cleaning data constitute 80% of the job.
Therefore, if you are just stepping into this field or planning to step into this field, it is
important to be able to deal with messy data, whether that means missing values, inconsistent
formatting, malformed records, or nonsensical outliers.
Steps involved
• Dropping unnecessary columns in a DataFrame
• Changing the index of a DataFrame
• Using .str() methods to clean columns
• Using the DataFrame.applymap() function to clean the entire dataset, element-wise
• Renaming columns to a more recognizable set of labels
• Skipping unnecessary rows in a CSV file
3.2.4 Data preprocessing
When we talk about data, we usually think of some large datasets with huge number of rows and
columns. While that is a likely scenario, it is not always the case — data could be in so many different
forms: Structured Tables, Images, Audio files, Videos etc.
Machines don’t understand free text, image or video data as it is, they understand 1s and 0s.
So it probably won’t be good enough if we put on a slideshow of all our images and expect
our machine learning model to get trained just by that!
9
In any Machine Learning process, Data Preprocessing is that step in which the data gets
transformed, or Encoded, to bring it to such a state that now the machine can easily parse it. In
other words, the features of the data can now be easily interpreted by the algorithm.
• Categorical: Features whose values are taken from a defined set of values. For instance,
days in a week {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday} is a
category because its value is always taken from this set. Another example could be the
Boolean set {True, False}
• Numerical: Features whose values are continuous or integer-valued. They are

represented by numbers and possess most of the properties of numbers. For instance,
number of steps you walk in a day, or the speed at which you are driving your car at.
Because data is often taken from multiple sources which are normally not too reliable and that
too in different formats, more than half our time is consumed in dealing with data quality
issues when working on a machine learning problem. It is simply unrealistic to expect that the
data will be perfect. There may be problems due to human error, limitations of measuring
devices, or flaws in the data collection process. Let’s go over a few of them and methods to
deal with them,
1.Missing values:
It is very much usual to have missing values in your dataset. It may have happened during data
collection, or maybe due to some data validation rule, but regardless missing values must be
taken into consideration.
• Eliminate rows with missing data :

Simple and sometimes effective strategy. Fails if many objects have missing values. If a
feature has mostly missing values, then that feature itself can also be eliminated.
10
• Estimate missing values :

If only a reasonable percentage of values are missing, then we can also run
simple interpolation methods to fill in those values. However, most common method of
dealing with missing values is by filling them in with the mean, median or mode value of
the respective feature.
2. Inconsistent values :
We know that data can contain inconsistent values. Most probably we have already faced this
issue at some point. For instance, the ‘Address’ field contains the ‘Phone number’. It may be
due to human error or maybe the information was misread while being scanned from a
handwritten form.
• It is therefore always advised to perform data assessment like knowing what the data type
of the features should be and whether it is the same for all the data objects.
3. Duplicate values :
A dataset may include data objects which are duplicates of one another. It may happen when
say the same person submits a form more than once. The term deduplication is often used to
refer to the process of dealing with duplicates.
• In most cases, the duplicates are removed so as to not give that particular data object an
advantage or bias, when running machine learning algorithms.
3.2.5 Data visualization

Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.
When you think of data visualization, your first thought probably immediately goes to simple
bar graphs or pie charts. While these may be an integral part of visualizing data and a
common baseline for many data graphics, the right visualization must be paired with the right
set of information. Simple graphs are only the tip of the iceberg. There’s a whole selection of
visualization methods to present data in effective and interesting ways.
Data visualization: involves specific terminology, some of which is derived from statistics.
For example, author Stephen Few defines two types of data, which are used in combination to
support a meaningful analysis or visualization:
11
• Categorical: Text labels describing the nature of the data, such as "Name" or "Age". This
term also covers qualitative (non-numerical) data.
• Quantitative: Numerical measures, such as "25" to represent the age in years.
Two primary types of information displays are tables and graphs.
• A table contains quantitative data organized into rows and columns with categorical
labels. It is primarily used to look up specific values. In the example above, the table
might have categorical column labels representing the name (a qualitative variable) and
age (a quantitative variable), with each row of data representing one person (the
sampled experimental unit or category subdivision).
• A graph is primarily used to show relationships among data and portrays values encoded
as visual objects (e.g., lines, bars, or points). Numerical values are displayed within an
area delineated by one or more axes. These axes provide scales (quantitative and
categorical) used to label and assign values to the visual objects. Many graphs are also
referred to as charts.
A bar chart or bar graph is a chart or graph that presents categorical
data with rectangular bars with heights or lengths proportional to the values that they
represent. The bars can be plotted vertically or horizontally. A vertical bar chart is sometimes
called a column chart.
A bar graph shows comparisons among discrete categories. One axis of the chart shows the
specific categories being compared, and the other axis represents a measured value. Some bar
graphs present bars clustered in groups of more than one, showing the values of more than
one measured variable.
12
A histogram is an approximate representation of the distribution of numerical or categorical

data. Taller bars show that more data falls in that range. A histogram displays the shape and
spread of continuous sample data. Histograms are particularly useful for large data sets. A
histogram divides the variable values into equal-sized intervals. We can see the number of
individuals in each interval.
Scatter plot
A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram,

or scatter diagram)[3] is a type of plot or mathematical diagram using Cartesian
coordinates to display values for typically two variables for a set of data. If the points are
coded (color/shape/size), one additional variable can be displayed. The data are displayed as
a collection of points, each having the value of one variable determining the position on the
horizontal axis and the value of the other variable determining the position on the vertical
axis.[4]
13
A heat map (or heatmap)
It is a data visualization technique that shows magnitude of a phenomenon as color in two

dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to
the reader about how the phenomenon is clustered or varies over space. There are two
fundamentally different categories of heat maps: the cluster heat map and the spatial heat
map. In a cluster heat map, magnitudes are laid out into a matrix of fixed cell size whose
rows and columns are discrete phenomena and categories, and the sorting of rows and
columns is intentional and somewhat arbitrary, with the goal of suggesting clusters or
portraying them as discovered via statistical analysis. The size of the cell is arbitrary but large
enough to be clearly visible. By contrast, the position of a magnitude in a spatial heat map is
forced by the location of the magnitude in that space, and there is no notion of cells; the
phenomenon is considered to vary continuously.
14
KDE In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the

probability density function (PDF) of a random variable. This function uses Gaussian kernels
and includes automatic bandwidth determination.
15
3.2.6 Correlation
Correlation is used to test relationships between quantitative variables or categorical
variables. In other words, it’s a measure of how things are related. The study of how variables
are correlated is called correlation analysis.
Correlation values range between -1 and 1.
There are two key components of a correlation value:

• magnitude – The larger the magnitude (closer to 1 or -1), the stronger the correlation
• sign – If negative, there is an inverse correlation. If positive, there is a regular correlation.
A correlation could be positive, meaning both variables move in the same direction, or
negative, meaning that when one variable’s value increases, the other variables’ values
decrease. Correlation can also be neutral or zero, meaning that the variables are unrelated.
• Positive Correlation: both variables change in the same direction.

• Neutral Correlation: No relationship in the change of the variables.
• Negative Correlation: variables change in opposite directions.
The performance of some algorithms can deteriorate if two or more variables are tightly
related, called multicollinearity. An example is linear regression, where one of the offending
correlated variables should be removed in order to improve the skill of the model.
We may also be interested in the correlation between input variables with the output variable
in order provide insight into which variables may or may not be relevant as input for
developing a model.
The structure of the relationship may be known, e.g. it may be linear, or we may have no idea
whether a relationship exists between two variables or what structure it may take. Depending
what is known about the relationship and the distribution of the variables, different
correlation scores can be calculated.
Correlation Matrix is basically a covariance matrix. Also known as the auto-covariance

matrix, dispersion matrix, variance matrix, or variance-covariance matrix. It is a matrix in
which i-j position defines the correlation between the ith and jth parameter of the given dataset.
When the data points follow a roughly straight-line trend, the variables are said to have an
approximately linear relationship. In some cases, the data points fall close to a straight line,
but more often there is quite a bit of variability of the points around the straight-line trend. A
summary measure called the correlation describes the strength of the linear association.
Correlation summarizes the strength and direction of the linear (straight-line) association
between two quantitative variables. Denoted by r, it takes values between -1 and +1. A
positive value for r indicates a positive association, and a negative value for r indicates a
16
negative association.
The closer r is to 1 the closer the data points fall to a straight line; thus, the linear association
is stronger. The closer r is to 0, making the linear association weaker.
4. OBSERVATIONS AND ANALYSIS

Amid huge data available in an automobile industry, it became a tedious job in analysing and
gathering the data, to get over this, data analytics/business analytics comes into picture.
Analysis of automobile data had to be done on:

• Attributes
• Standard features
• Sales
• Manufacturing competitors
• Customer satisfaction
Before data analytics came into picture, documentation and analysis of data was done
manually and aided by computer using MS Excel as tool.
17
But the use of data analyticshelps in better code readability, helps in solving huge data
analyzation.
Our objective was to analyse attributes of automobile data making it more clear and easy
analysis of data, which led to effective time management in analysis of data, comparison of
data to reduce redundancy and get a future prediction of sales.
The basic steps involved in obtaining these results include,
1. Data acquisition,
The first step in analyzation of data is the acquisition of the data.The data acquired for our
purpose is from survey reports.
2. Data inspection,
The acquired data is then inspectedfor inspections help to identify and record hazards for
corrective action.During the inspection of the data variables like import, size, shape, head,
tail, describe, data types, null().any(), null().sum(), are inspected.
IMPORT
Import the class libraries, master data.
SIZE
Determine the size of the whole data present.
18
SHAPE
Determine the number of rows and coloumns.
HEAD
Specified number of rows are displayed from top of the data.
TAIL
Specified number of rows are displayed from bottom of the data.
DESCRIBE
Gives the value of count, mean, standard deviation and so on.
19
DATA TYPES
Describes the type of data which is present such as int, float, object and so on.
null().any()
Gives all the missing data in boolean data type.
20
null(). sum()
Gives the number of missing values.
3. Data Pre-processing
The obtained data contained unstructured, raw, crude, missing and string values which is
not recognizable to the machine.Therefore, the data needed to be pre-processed.Further
this data is converted into categorical data form of integer datatype.
Some of the keys used here are:
21
4. Data visualization
22
• It is the graphic representation of data by using visual elements like charts, graphs,
maps etc.
• These tools helps to see and understand trends, outliers and patterns in data
• The goal is to communicate information clearly and efficiently to users.
SOME OF THE ATTRIBUTES PRESENT ARE

• Make/Brand
• Fuel type
• Engine type
• Number of doors
• Body Style
• Drive wheels
• Engine location
• Wheel base
• Length
• Width
• Height
• Curb weight
• Number of cylinders
• Engine size
• Bore
• Compression ratio
• Horsepower
• Peak RPM
• City MPG
• Highway MPG
• Price
• Airbags
Below are the few graphs representing the count of the different attributes used:
• Bar graph representing the car models and their count in units
• Alpha-numero represented by 1 is having a count of 18.
• Audi represented by 2 is having a count of 48.
• BMW represented by 3 is having a count of 40.
Simultaneously, different brands are having their count.
23
• Second graph represents the Satisfaction V/S Make count.
Yellow-Satisfied customers
Blue-Unsatisfied customers
• Fuel Type
24
Green-DIESEL
Blue-PETROL
5. Correlation
Correlation is a statistical technique that can show whether and how strongly pairs of
variables are related.
Below represented is a heat map of the assignment.
A Heat Map is a representation of data in the form of a map or diagram in which data
values are represented as colours.
25
COEFFICIENTS OF HEAT MAP
6. Machine learning
Machine learning is an application of artificial intelligence (AI) that provides systems

the ability to automatically learn and improve from experience without being explicitly
programmed. This is carried out in three stages,
• Implementation
• Training and Testing
• Accuracy Testing
After the implementation of the machine learning, training and testing is done to get
the accuracy by random splitting and by rearranging all the present data.
Data splitting is the act of partitioning available data into two portions, usually for
cross-validatory purposes. One portion of the data is used to develop a predictive model
and the other to evaluate the model's performance.
The first random split takes place with 70% testing and 30% training taking place
giving the accuracy of 84.39% of test accuracy.
26
Similarly, the second random splitting is done with 80% testing and 20% training, giving
an accuracy of 86.07% test accuracy.
27
To predict the probability of a target variable, logistic regression plays its role.
The nature of target or dependent variable is dichotomous, which means there would be only
two possible classes.
In simple words, the dependent variable is binary in nature having data coded as either 1
(stands for success/yes) or 0 (stands for failure/no).
Types of Logistic Regression
Generally, logistic regression means binary logistic regression having binary target
variables, but there can be two more categories of target variables that can be predicted by it.
28
Based on those number of categories, Logistic regression can be divided into following
types −
• Binary or Binomial
In such a kind of classification, a dependent variable will have only two possible types either
1 and 0. For example, these variables may represent success or failure, yes or no, win or loss
etc.
PREDICTION
When the reasons behind a model’s outcomes are as important as the outcomes themselves,
Prediction Explanations can uncover the factors that most contribute to those outcomes.
29
PERFORMANCE MEASURES
The next step after implementing a machine learning algorithm is to find out how effective is
the model based on metric and datasets. Different performance metrics are used to evaluate
different Machine Learning Algorithms.
30
Performance metric for evaluation of machine learning algorithms is accuracy,precision,

recall, specificity which can be used for sorting algorithms primarily used by search engines.
The formulae for the performance metrics are represented in the above fig.
CONFUDION MATRIX
The Confusion matrix is one of the most intuitive and easiest (unless of course, you are not
confused)metrics used for finding the correctness and accuracy of the model. It is used for
Classification problem where the output can be of two or more types of classes.
A confusion matrix is a table that is often used to describe the performance of a classification
model on a set of test data for which the true values are known.
• After the description of the performance, we get the following values, where,
Accuracy=79.74%
Precision=64.37%
Sensitivity=46.66%
Recall=42.66%
F1 score=54.10%
31
5.FINDINGS
After analysing the heatmap and figuring out the correlation between different columns/
physiological parameters, Logistic regression needs to be carried out to create a prediction model.
Figure 12 shows the results of logistic regression model. Figure 13 shows the Accuracy score of the
designed model. From this data, precision, f1 score and reliability can be calculated. Figure 14 shows
the R-squared calculation for the linear regression models. [6]
6. CONCLUSION
The evaluation of the product using these huge data is a very difficult and fatigue process.
Using business analytics and business intelligence we can simplify the work and is more
profitable to the industry.
7.LEARNING OUTCOME
32
7.1 Process control: Activities involved in ensuring a process is predictable, stable, and
consistently operating at the target level of performance with only normal variation.
For this project methodology we have used anaconda, a statistical tool for process control.
7.2 Anaconda
Anaconda is a complete, open source data science package with a community of over 6

million users. It is easy to download and install, and it is supported on Linux, MacOS, and
Windows.It is developed and maintained by Anaconda, Inc., which was founded by Peter
Wang and Travis Oliphant in 2012. As an Anaconda, Inc. product, it is also known
as Anaconda Distribution or Anaconda Individual Edition, while other products from the
company are Anaconda Team Edition and Anaconda Enterprise Edition, which are both not
free.
Package versions in Anaconda are managed by the package management system conda. This
package manager was spun out as a separate open-source package as it ended up being useful
on its own and for other things than Python.There is also a small, bootstrap version of
Anaconda called Miniconda, which includes only conda, Python, the packages they depend
on, and a small number of other packages.
• Anaconda Navigator
Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda
distribution that allows users to launch applications and manage conda packages,
environments and channels without using command-line commands. Navigator can search for
packages on Anaconda Cloud or in a local Anaconda Repository, install them in an
environment, run the packages and update them. It is available
for Windows, macOS and Linux.
The following applications are available by default in Navigator[17]:
33
• JupyterLab
• Jupyter Notebook
• QtConsole
• Spyder
• Glue
• Orange
• RStudio
• Visual Studio Code
Jupyter Notebook
It was created to make it easier to show one’s programming work, and to let others join in.
Jupyter Notebook allows you to combine code, comments, multimedia, and visualizations in
an interactive document — called a notebook, naturally — that can be shared, re-used, and
re-worked.
And because Jupyter Notebook runs via a web browser, the notebook itself could be hosted
on your local machine or on a remote server.
Originally developed for data science applications written in Python, R, and Julia, Jupyter
Notebook is useful in all kinds of ways for all kinds of projects:
• Data visualizations. Most people have their first exposure to Jupyter Notebook by way of a
data visualization, a shared notebook that includes a rendering of some data set as a graphic.
Jupyter Notebook lets you author visualizations, but also share them and allow interactive
changes to the shared code and data set.
34
• Code sharing. Cloud services like GitHub and Pastebin provide ways to share code, but
they’re largely non-interactive. With a Jupyter Notebook, you can view code, execute it, and
display the results directly in your web browser.
• Live interactions with code. Jupyter Notebook code isn’t static; it can be edited and re-run
incrementally in real time, with feedback provided directly in the browser. Notebooks can
also embed user controls (e.g., sliders or text input fields) that can be used as input sources
for code.
• Documenting code samples. If you have a piece of code and you want to explain line-by-
line how it works, with live feedback all along the way, you could embed it in a Jupyter
Notebook. Best of all, the code will remain fully functional—you can add interactivity along
with the explanation, showing and telling at the same time.
REFERENCES
[1]. Interactions between kidney disease and diabetes- dangerous liaisons- Roberto Pecoits-
Filho, Hugo Abensur, Carolina C.R. Betônico, Alisson Diego Machado, Erika B. Parente,
Márcia Queiroz, João Eduardo Nunes Salles, Silvia Titan and Sergio Vencio- 2016- article
50.
[2]. The Python Standard Library — Python 3.7.1rc2 documentation
https://docs.python.org/3/library/
[3]. Data Warehousing Architecture & PreProcessing- Vishesh S, Manu Srinath, Akshatha C
Kumar, Nandan A.S.- IJARCCE, vol6, iss5, May 2017
[4]. Data Mining and Analytics: A Proactive Model -
http://www.ijarcce.com/upload/2017/february-17/IJARCCE%20117.pdf
[5]. A comparative analysis on linear regression and support vector regression- DOI:
10.1109/GET.2016.7916627- https://ieeexplore.ieee.org/abstract/document/7916627
[6]. Stock market predication using a linear regression- DOI: 10.1109/ICECA.2017.8212716,
https://ieeexplore.ieee.org/abstract/document/8212716 BIOGRAPHY
35

Auto Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Auto Final

Uploaded by

Copyright:

Available Formats

PROJECT ON: Analysis of Trends in Automobile Industry Using Business Intelligence and Business Analytics

Konigtronics (OPC) Pvt Ltd bearing CIN U80904KA2017OPC099075, and its

Vishesh S hails from Bangalore, Karnataka, India. He has completed B.E

A. Research and Development

• Research on various topics like WBAN, signal processing, software engineering,

data engineering, etc.

• Submission of the outcome of the Research as a report to an International Jury.

(International Research papers/ journals).

• Development of various products after in depth research or following an iterative

• Mobile application design.

C. Technical and Non-technical training

• Training for students of engineering.

• Placement training and Job consultancy

3.1 OBJECTIVES OF THE STUDY

3.2.1 Data acquisition

Customer data is the lifeblood of business in today’s data-driven business environment.

A report is a document that presents information in an organized format for a specific

The types are: 1. Formal or Informal Reports 2. Short or Long

Lateral Reports 6. Internal or External Reports 7. Periodic Reports 8. Functional

3.2.2 Data mining

Consider a basic set of tasks that constitute a data acquisition process:

• A need for data is identified, perhaps with use cases

3.2.3 Data Inspection

• Numerical: Features whose values are continuous or integer-valued. They are

• Eliminate rows with missing data :

• Estimate missing values :

3.2.5 Data visualization

A histogram is an approximate representation of the distribution of numerical or categorical

A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram,

It is a data visualization technique that shows magnitude of a phenomenon as color in two

KDE In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the

Correlation values range between -1 and 1.

There are two key components of a correlation value:

• Positive Correlation: both variables change in the same direction.

Correlation Matrix is basically a covariance matrix. Also known as the auto-covariance

4. OBSERVATIONS AND ANALYSIS

Analysis of automobile data had to be done on:

Some of the keys used here are:

SOME OF THE ATTRIBUTES PRESENT ARE

• Second graph represents the Satisfaction V/S Make count.

COEFFICIENTS OF HEAT MAP

Machine learning is an application of artificial intelligence (AI) that provides systems

Performance metric for evaluation of machine learning algorithms is accuracy,precision,

Anaconda is a complete, open source data science package with a community of over 6

You might also like