You are on page 1of 30

Assignment Cover Sheet

Qualification Module Number and Title


Higher National Diploma in Computing & CSE5014- Business Analytics
Software Engineering
Student Name & No. Assessor

Hand out date Submission Date

Assessment type Duration/Length of Weighting of Assessment


Coursework- Assessment Type
Individual (3000 100 %
words equivalent )

Learner declaration

I, ………………………………………….<name of the student and registration number>,


certify that the work submitted for this assignment is my own and research sources are fully
acknowledged.

Marks Awarded
First assessor

IV marks

Agreed grade

Signature of the assessor Date

1
FEEDBACK FORM

Module:
Student:
Assessor:
Assignment:

Strong features of your work:

Areas for improvement:

Marks Awarded:

2
Coursework

Learning outcomes covered

 Explain Business Analytics methodologies, tools and the techniques.


 Evaluate business advantages produced by business analytics.

 Develop propose solution for a business problem or creating opportunity using


appropriate business analytics methodologies, tools and the techniques.
Scenario and the Task

Introduction

The Business Analytics subject domain is considered to be one of the major area
where most of the companies and various profitable and non- profitable institutions consider for
achieving the best decision support in their respective various operations life cycles. Because of
the economy is growing rapidly to achieve tangible and intangible as well as financial and non-
financial benefits all most all the government and private organizations required to consider
precision and accuracy of their management decisions in all major three managerial levels;
operational, tactical and strategic. The amount of information generated in modern agile
economic environment is very high and maintaining the consistency also a challenge. Because of
the fact that utilization of big data analysis considered as one of the prime concern to deal with
credible and valuable informed decision making. During the process of conversion data in to
decision supportive information, it is very important to use different data analysis methods,
techniques and tools that are comprehensively explained in data analytics. The inclusion of data
analytics subject elements in modern management information systems and decision support
information systems for enabling effective and efficient online analytical processing incorporated
with connected operational databases, data marts and data warehouses considered to be vital.

In performing big data analysis, it is very important to use good statistical software. At present
there are many such products available under generic or bespoke software category considering
open source or closed source. Usually open source products are financially feasible for many

3
organizations compared to closed source products. At present big data analysis rapid growth
identifiable in open source category with frequently released updated versions and many feature
extensions compared to closed source. On the other hand, much reliable many software products
have been released by industry pioneer solution providers. Therefore, section of the best product
for data analysis is also need to be done wisely by relevant authorities of organizations for their
objectives to be precisely achieved.

SCENARIO

The government of Sri Lanka is planning to take effective and credible measure to overcome
prevailing escalation of accidents by motorists in the country due to various reasons. As an
initiative step the government has considered what measures would be possible to avoid and
mitigate airbag and other influences on accident fatalities. The main reason behind this is the
higher number of motor accidents reported many parts of the Island during last few years. Many
experts claim that the causes of the motor traffic accidents relate with technical, economic, social
and legal oriented many factors.

The government is seeking a sophisticated mechanism using business analytics to predict the
future accidents and causes of those. The institutional bodies of ministry of law and order,
department of motor traffic, ministry of finance, ministry of urban development and ministry of
health have been considered as prioritized entities to advance on the new initiation. The
government is considering to have full scale data science support to develop essential business
intelligence required for motor traffic accidents management within the country.

The government has appointed a panel of experts in the field of data science to provide
necessary recommendation to bring new laws, tariff and other useful mechanisms to mitigate and
avoid accidents based on airbag and other influences in the Island using local and foreign
intelligence. Assume that you are one of the data scientists in the panel who has been assigned
with a special set of tasks to accomplish and provide a report to the panel head. You are required
to complete the below mentioned tasks and prepare a report based on the findings.

Key information about the survey.

The data gathering was primarily based on U.S. Department of Transportation National Highway
Traffic Safety Administration(NHTSA). The primary contributors were U.S. Department of
Transportation. The data set comprises of US data, for year 1997-2002 period, from police-
reported car crashes in which there is a harmful event (people or property), and from which at
least one vehicle was towed. Data are restricted to front-seat occupants, include only a
subdivision of the variables recorded.

4
Survey Data Dictionary

Data Field Description


(Vector/Variable
)
dvcat ordered factor with levels (estimated impact speeds) 1-9km/h, 10-24,
25-39, 40-54, 55+
weight Observation weights, albeit of uncertain accuracy, designed to account
for varying sampling probabilities.
dead factor with levels alive dead
airbag a factor with levels none airbag
seatbelt a factor with levels none belted
frontal a numeric vector; 0 = non-frontal, 1=frontal impact
sex a factor with levels f ,m
ageOFocc age of occupant in years
yearacc year of accident
yearVeh Year of model of vehicle; a numeric vector
abcat Did one or more (driver or passenger) airbag(s) deploy? This factor has
levels deploy, nodeploy, unavail
occRole a factor with levels driver, pass
deploy a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if
one or more bags deployed.
injSeverity a numeric vector; 0:none, 1:possible injury, 2:no incapacity, 3:incapacity,
4:killed; 5:unknown, 6:prior death
caseid character, created by pasting together the populations sampling unit, the
case number, and the vehicle number. Within each year, use this to
uniquely identify the vehicle.

The entire data analysis should be done using provided AccFat_Info.csv dataset with the
assignment.

5
Tasks
1. Describe with credible examples of possible advantages generated by analytics and business
intelligence found in data science for Sri Lankan government aforementioned authorities in
informed decisions making related to motor traffic fatality based problem solving.
. (5 Marks)

2. Explain tools, techniques and methodologies going to use for this case study based analysis.
. (6 Marks)

3. Find out minimum, maximum, mean, median, mode of age of occupant in years of those faced
accidents. (6 Marks)

4. Find out summary statistics of age of occupant in years, year of accident, injury severity of the
accidents. (6 Marks)

5. Conduct central tendency analysis for age of occupant in years, year of accident and injury
severity and find out standard deviation of those. Represent finding graphically using bell
curves. (10 Marks)

6. Explain whether airbags do contribute for reducing injuries of passengers or driver of the vehicle
during an accident supported by suitable statistical hypothesis testing and graphical analysis.
. (12Marks)

7. Using statistical hypothetical testing prove, whether there is a statistically significant


relationship exist with injury severity and age of occupant of the vehicle.
. (10 Marks)

8. Using statistical hypothetical testing prove, whether there is a statistically significant


relationship exist with injury severity and year of accident.
. (10 Marks)

6
9. Using statistical hypothetical testing prove, whether there is a statistically significant
relationship exist with injury severity and year of model of vehicle of the vehicle.
(10 Marks)

10. Write a conclusion based on the findings of the data analysis and suggest necessary
recommendations as the solution for the problems identified. The answer should be followed
by descriptive justifications. (15 Marks)

 Proper report format carries 10 Marks separately.

 Question no 7,8 and 9 should be incorporated with hypothesis based normality testing

 Note: The conclusion can include the findings of suitable regression analysis as well.

7
Submission Guidelines
Report Structure:
 Executive Summery
 Table of contents, Table of Figures, Table of Tables
 Introduction of the Organization & its operational environment /Scenario
 Data Analysis and Discussion
 Conclusion
 Future Recommendation
 Gantt chart & its Description
 Referencing
 Appendix (Appendix A, Appendix B, etc.) for Group meetings, Samples of Questionnaire

Report Format:
 Submission format: Report  Header and Footer: 1 Inch

 Paper Size: A4  Basic Font Size:12


 Words: 3000 words  Line Spacing: 1.5

 Printing Margins: LHS; RHS: 1 Inch  Font Style: Times New Roman

 Binding Margin: ½ Inch  Referencing should be done strictly using


Harvard system

8
1. The Sri Lankan government and its affiliated authorities stand to gain a great deal from the
application of analytics and business intelligence in the context of motor traffic fatalities. Here
are a few reliable instances:

1. Accident Prediction and Prevention: - Analytics can look for patterns and trends in past
data related to auto accidents. The government may anticipate possible accident hotspots
by using predictive modelling, which gives them the ability to implement preventive
measures like better road signage, traffic calming techniques, and increased law
enforcement in high-risk regions.

2. Identifying Root Causes: - By examining numerous variables like vehicle speed,


seatbelt use, and airbag deployment, business intelligence technologies can assist in
determining the underlying causes of accidents. For example, authorities can concentrate
on awareness campaigns, tighter enforcement, or the implementation of rules to address
these concerns if the data reveals a high incidence of accidents involving the failure to
deploy airbags or the failure to wear seatbelts.

3. Resource Optimisation: - Analytics can help with emergency services and law
enforcement resource allocation. The government can more effectively allocate resources
to ensure a quicker response, lessening the impact of accidents and enhancing overall
safety by identifying peak accident times, locations, and severity.

4. Legislation and Policy Development: - Data science can shed light on how well-
functioning current motor traffic safety legislation and regulations are. The government
may make well-informed judgements on whether to increase or amend current laws to
improve road safety by using analytics to assess the effects of airbag and seatbelt rules.

5. Public Awareness Campaigns: - Analytics can be used to comprehend demographic


aspects, such age and gender, that contribute to accidents. By using this data to target
particular populations with pertinent safety messages, public awareness programmes can
be made more effective.

9
6. Vehicle Safety Standards: - The government can evaluate the efficacy of current
vehicle safety standards by examining data on the year of the vehicle and airbag
deployment. Policymakers can use this information to advise updates to safety
regulations, promote the adoption of cutting-edge safety features in newer cars, and even
impose mandatory safety upgrades on older ones.

7. Weighted Decision-Making: - A more in-depth analysis is made possible by the


dataset's inclusion of observation weights. In order to make sure that judgements are not
biassed by samples but rather are based on a representative understanding of the whole
situation, decision-makers can take into account the impact of variable sampling
probabilities.

8. Evaluation of Emergency Response: - Information on the severity of injuries and the


rates of fatalities can be utilised to assess how well emergency response systems are
working. Adjustments can be made to emergency medical services, ensuring they are
properly staffed and equipped to manage such circumstances, if certain areas frequently
demonstrate increased injury severity.

In conclusion, the Sri Lankan government and related authorities can be better equipped to make
decisions, allocate resources effectively, and create focused interventions to lower the number of
road fatalities if analytics and business intelligence are used in the management of motor traffic
accidents.

2. Data scientists must apply a variety of tools, techniques, and approaches to analyse the data on
motor vehicle accidents in order to make well-informed decisions. An outline of the essential
components that can be used for this analysis based on a case study is provided below:

1. Data Cleaning and Preprocessing: - Tools: Python (NumPy, pandas), R, and SQL -
Techniques: Managing outliers, transforming data types, and standardizing/normalizing variables
are some of the techniques used.

10
2. Investigational Data Analysis (EDA): - Instruments: R (ggplot2), Python (matplotlib,
seaborn) - Approaches: To obtain a preliminary understanding of the relationships between
variables, employ descriptive statistics, data visualisation, and correlation analysis.

3. Predictive Modelling: - Techniques: - Tools: R (caret), Python (scikit-learn) applying


computer learning algorithms to discover influential elements, estimate accident probability, and
evaluate the effectiveness of interventions, such as decision trees, regression analysis, and
ensemble methods.

Time Series Analysis: - Tools: R (forecast), Python (pandas, statsmodels) - Techniques:


examining the trends, seasonality, and cyclical patterns in the temporal patterns of accidents
throughout time.

5. Spatial Analysis: - Tools: Geographic Information System (GIS) tools, Python (geopandas), R
(sf) - Techniques: Mapping accident locations to identify spatial clusters, hotspot analysis, and
evaluating the influence of geographical factors on accident incidence.

6. Weighted Analysis: - Tools: Python, R - Techniques: Taking into account the


representativeness of the data by using observation weights to account for changing sample
probabilities.

7. Association Rules Mining: - Tools: R (arules), Python (mlxtend) - Techniques: Finding


correlations between various elements (such as airbag deployment and injury severity) to reveal
hidden patterns and dependencies.

8. Advanced Statistical Analysis: - Tools: Python (statsmodels, scipy), R - Techniques: Applying


statistical tests (chi-square, t-tests, etc.) to support hypotheses and determine the importance of
patterns that are reported.

11
The process of Feature Engineering: involves the creation of new variables or the transformation
of existing ones in order to enhance the performance of predictive models and provide more
significant insights. The tools used for this process include Python and R.

10. Information Sharing and Reporting: - Resources: RMarkdown, Tableau, Power BI, Jupyter
Notebooks - Methods: Using storytelling techniques, create in-depth reports and visualisations
that effectively explain findings to policymakers and highlight the impact of suggested solutions.

11. Collaborative Decision-Making: - Tools: Version control (Git), collaboration platforms


(Slack, Microsoft Teams), - Techniques: making sure the panel of experts takes a collaborative
approach, using version control to track changes, and preserving transparency in the decision-
making process.

The data scientists on the panel will be able to analyse the motor traffic accident data in a
comprehensive and perceptive manner by using these tools, techniques, and methodologies. As a
result, they will be able to offer the Sri Lankan government useful recommendations for
efficiently addressing the problem.

R Code

# Assuming 'data' is your data frame containing the survey data


# Replace 'data.csv' with the actual file name or provide the
data frame

# Load the data


# data <- read.csv('AccFat_Info.csv')

# Custom function to calculate mode


calculate_mode <- function(x) {
uniq_x <- unique(x)
uniq_x[which.max(tabulate(match(x, uniq_x)))]

12
}

# Calculate statistics for 'ageOFocc'


age_stats <- c(Minimum = min(data$ageOFocc),
Maximum = max(data$ageOFocc),
Mean = mean(data$ageOFocc),
Median = median(data$ageOFocc),
Mode = calculate_mode(data$ageOFocc))

# Display the results


cat("Age of Occupants Statistics:\n")
for (stat in names(age_stats)) {
cat(paste(stat, ":", age_stats[stat]), "\n")
}

13
4.
# Assuming 'data' is your data frame containing the survey data
# Replace 'data.csv' with the actual file name or provide the
data frame

# Load the data


# data <- read.csv('AccFat_info.csv')

# Extract relevant columns


selected_columns <- c('ageOFocc', 'yearacc', 'injSeverity')
selected_data <- data[selected_columns]

# Calculate summary statistics


summary_statistics <- summary(selected_data)

# Display the results


cat("Summary Statistics:\n")
print(summary_statistics)

14
5.
# Assuming 'data' is your data frame containing the survey data
# Replace 'data.csv' with the actual file name or provide the
data frame

# Load the required libraries


library(ggplot2)
library(gridExtra)

# Load the data


# data <- read.csv('AccFat_Info.csv')

# Extract relevant columns


selected_columns <- c('ageOFocc', 'yearacc', 'injSeverity')
selected_data <- data[selected_columns]

# Remove rows with missing or non-finite values


selected_data <- na.omit(selected_data)
selected_data <- selected_data[complete.cases(selected_data), ]

# Calculate mean and standard deviation


mean_values <- sapply(selected_data, mean)
std_dev_values <- sapply(selected_data, sd)

# Create a list to store individual plots


plots <- list()

# Plot bell curves


for (i in seq_along(selected_columns)) {
column <- selected_columns[i]

# Create a bell curve for each variable

15
p <- ggplot(selected_data, aes(x = !!as.name(column))) +
geom_histogram(aes(y = after_stat(density)), bins = 30,
fill = 'skyblue', color = 'black') +
geom_density(color = 'red') +
geom_vline(xintercept = mean_values[i], linetype =
'dashed', color = 'red', linewidth = 1) +
geom_vline(xintercept = mean_values[i] - std_dev_values[i],
linetype = 'dashed', color = 'orange', linewidth = 1) +
geom_vline(xintercept = mean_values[i] + std_dev_values[i],
linetype = 'dashed', color = 'orange', linewidth = 1) +
labs(title = paste('Distribution of', column),
x = column,
y = 'Density') +
theme_minimal() +
theme(legend.position = 'none')

# Add the plot to the list


plots[[i]] <- p
}

# Arrange the plots in a grid


grid.arrange(grobs = plots, ncol = 1)

# Show the plot

16
6.
To determine whether airbags contribute to reducing injuries of passengers or drivers during an
accident, you can perform statistical hypothesis testing and graphical analysis using the available
data. Here's a step-by-step guide:

1. Define Hypotheses:

- Null Hypothesis (H0): Airbags do not contribute to reducing injuries.

17
- Alternative Hypothesis (H1): Airbags contribute to reducing injuries.

2. Select Data:

- Extract relevant columns from the dataset, including 'airbag' and 'injSeverity' (injury severity).

3. Data Exploration:

- Graphical Analysis:
- Create a boxplot or violin plot to visually compare the distribution of injury severity for cases
with and without airbags.
- Use histograms or bar charts to visualize the frequency of different injury severity levels for
each group.

4. Statistical Hypothesis Testing:

- Independent Samples t-test:


- Perform an independent samples t-test to compare the mean injury severity for cases with and
without airbags.
- The null hypothesis assumes that there is no significant difference in injury severity between
the two groups.

# Assuming 'data' is your data frame containing the survey data


# Replace 'data.csv' with the actual file name or provide the
data frame

# Load the data


# data <- read.csv('AccFat_Info.csv')

# Extract relevant columns


selected_columns <- c('airbag', 'injSeverity')

18
selected_data <- data[selected_columns]

# Separate data into two groups: with and without airbags


with_airbag <- selected_data[selected_data$airbag == 'airbag',
'injSeverity']
without_airbag <- selected_data[selected_data$airbag == 'none',
'injSeverity']

# Perform t-test
t_test_result <- t.test(with_airbag, without_airbag, alternative
= 'two.sided', var.equal = FALSE)

# Display results
cat("T-statistic:", t_test_result$statistic, "\n")
cat("P-value:", t_test_result$p.value, "\n")```

19
5. Interpret Results:

- Graphical Analysis:
- Compare the graphical representations to observe any visual differences in the injury severity
distribution between the two groups.
- Statistical Hypothesis Testing:
- If the p-value is below a predetermined significance level (e.g., 0.05), reject the null
hypothesis, indicating that airbags contribute to reducing injuries.

6. Conclusion:

- Based on the graphical analysis and statistical hypothesis testing, draw a conclusion regarding
whether airbags significantly contribute to reducing injuries during accidents.

Remember to interpret the results cautiously, considering factors such as sample size and
potential confounding variables. Additionally, statistical significance does not imply practical
significance, so the impact of airbags on injury reduction should be considered in a broader
context.

7. To assess whether a statistically significant relationship exists between injury severity and the
age of the occupant, you can use statistical hypothesis testing. In this case, a suitable test is the
Analysis of Variance (ANOVA) or Kruskal-Wallis test, depending on the distribution of the
data.

Here's how you can perform the analysis:

1. Define Hypotheses:

- Null Hypothesis (H0): There is no significant relationship between injury severity and the age
of the occupant.

20
- Alternative Hypothesis (H1): There is a significant relationship between injury severity and the
age of the occupant.

2. Select Data:

- Extract relevant columns from the dataset, including 'ageOFocc' (age of the occupant) and
'injSeverity' (injury severity).

3. Data Exploration:

- Graphical Analysis:
- Create boxplots or violin plots to visualize the distribution of injury severity across different
age groups.

4. Statistical Hypothesis Testing:

- ANOVA or Kruskal-Wallis Test:


- If the data meets the assumptions of normality and homogeneity of variances, perform an
ANOVA test.
- If the assumptions are violated, use the Kruskal-Wallis test, a non-parametric alternative.

# Assuming 'data' is your data frame containing the survey data


# Replace 'data.csv' with the actual file name or provide the
data frame

# Load the data


# data <- read.csv('AccFat_Info.csv')

# Extract relevant columns


selected_columns <- c('ageOFocc', 'injSeverity')
selected_data <- data[selected_columns]

21
# Perform ANOVA or Kruskal-Wallis test
result <- NULL
if (class(selected_data$injSeverity) == 'numeric') { # Checking
if 'injSeverity' is numeric
result <- aov(ageOFocc ~ injSeverity, data = selected_data)
} else {
result <- kruskal.test(ageOFocc ~ injSeverity, data =
selected_data)
}

# Display results
cat(ifelse(class(selected_data$injSeverity) == 'numeric', "ANOVA
Test Results:", "Kruskal-Wallis Test Results:"), "\n")
cat("Test Statistic:", result$statistic, "\n")
cat("P-value:", result$p.value, "\n")

22
5. Interpret Results:

- If the p-value is below a predetermined significance level (e.g., 0.05), reject the null hypothesis,
indicating a significant relationship between injury severity and the age of the occupant.

6. Conclusion:

- Based on the statistical analysis, draw a conclusion regarding the existence of a significant
relationship between injury severity and the age of the occupant.

Remember to interpret the results considering the assumptions of the chosen test and potential
confounding variables. Additionally, if the p-value is significant, consider further post-hoc tests
to identify specific differences between age groups.

8. To assess whether a statistically significant relationship exists between injury severity and the
year of the accident, you can use statistical hypothesis testing. A suitable test for this analysis is
the Analysis of Variance (ANOVA) test. Here's a step-by-step guide:

1. Define Hypotheses:

- Null Hypothesis (H0): There is no significant relationship between injury severity and the year
of the accident.
- Alternative Hypothesis (H1): There is a significant relationship between injury severity and the
year of the accident.

2. Select Data:

- Extract relevant columns from the dataset, including 'yearacc' (year of the accident) and
'injSeverity' (injury severity).

3. Data Exploration:

23
- Graphical Analysis:
- Create boxplots or violin plots to visualize the distribution of injury severity across different
years.

4. Statistical Hypothesis Testing:

- ANOVA Test:
- Perform an ANOVA test to assess whether there are statistically significant differences in the
means of injury severity across different years.

# Assuming 'data' is your data frame containing the survey data


# Replace 'data.csv' with the actual file name or provide the
data frame

# Load the data


# data <- read.csv('data.csv')

# Extract relevant columns


selected_columns <- c('yearacc', 'injSeverity')
selected_data <- data[selected_columns]

# Perform ANOVA test


result <- aov(injSeverity ~ yearacc, data = selected_data)

# Display results
cat("ANOVA Test Results:\n")
cat("Test Statistic:", summary(result)$statistic[["F"]], "\n")
cat("P-value:", summary(result)$coef[, "Pr(>F)"][1], "\n")

24
5. Interpret Results:

- If the p-value is below a predetermined significance level (e.g., 0.05), reject the null hypothesis,
indicating a significant relationship between injury severity and the year of the accident.

6. Conclusion:

- Based on the statistical analysis, draw a conclusion regarding the existence of a significant
relationship between injury severity and the year of the accident.

Remember to interpret the results considering the assumptions of the ANOVA test and potential
confounding variables. Additionally, if the p-value is significant, consider post-hoc tests to
identify specific differences between individual years.

25
9. To assess whether a statistically significant relationship exists between injury severity and the
year of the model of the vehicle, you can use statistical hypothesis testing. A suitable test for this
analysis is the Analysis of Variance (ANOVA) test. Here's a step-by-step guide:

1. Define Hypotheses:

- Null Hypothesis (H0): There is no significant relationship between injury severity and the year
of the model of the vehicle.
- Alternative Hypothesis (H1): There is a significant relationship between injury severity and the
year of the model of the vehicle.

2. Select Data:

- Extract relevant columns from the dataset, including 'yearVeh' (year of the model of the
vehicle) and 'injSeverity' (injury severity).

3. Data Exploration:

- Graphical Analysis:
- Create boxplots or violin plots to visualize the distribution of injury severity across different
years of the model of the vehicle.

4. Statistical Hypothesis Testing:

- ANOVA Test:
- Perform an ANOVA test to assess whether there are statistically significant differences in the
means of injury severity across different years of the model of the vehicle.

# Assuming 'data' is your data frame containing the survey data


# Replace 'data.csv' with the actual file name or provide the
data frame

26
# Load the data
# data <- read.csv('data.csv')

# Extract relevant columns


selected_columns <- c('yearVeh', 'injSeverity')
selected_data <- data[selected_columns]

# Perform ANOVA test


result <- aov(injSeverity ~ yearVeh, data = selected_data)

# Display results
cat("ANOVA Test Results:\n")
cat("Test Statistic:", summary(result)$statistic[["F"]], "\n")
cat("P-value:", summary(result)$coef[, "Pr(>F)"][1], "\n")

5. Interpret Results:

- If the p-value is below a predetermined significance level (e.g., 0.05), reject the null hypothesis,
indicating a significant relationship between injury severity and the year of the model of the
vehicle.

27
6. Conclusion:

- Based on the statistical analysis, draw a conclusion regarding the existence of a significant
relationship between injury severity and the year of the model of the vehicle.

Remember to interpret the results considering the assumptions of the ANOVA test and potential
confounding variables. Additionally, if the p-value is significant, consider post-hoc tests to
identify specific differences between individual years of the model of the vehicle.

10.
Data scientists must apply a variety of tools, techniques, and approaches to analyse the data on
motor vehicle accidents in order to make well-informed decisions. An outline of the essential
components that can be used for this analysis based on a case study is provided below:

1. Data Cleaning and Preprocessing: - Tools: Python (NumPy, pandas), R, and SQL -
Techniques: Managing outliers, transforming data types, and standardizing/normalizing variables
are some of the techniques used.

2. Investigational Data Analysis (EDA): - Instruments: R (ggplot2), Python (matplotlib,


seaborn) - Approaches: To obtain a preliminary understanding of the relationships between
variables, employ descriptive statistics, data visualisation, and correlation analysis.

3. Predictive Modelling: - Techniques: - Tools: R (caret), Python (scikit-learn) applying


computer learning algorithms to discover influential elements, estimate accident probability, and
evaluate the effectiveness of interventions, such as decision trees, regression analysis, and
ensemble methods.

Time Series Analysis: - Tools: R (forecast), Python (pandas, statsmodels) - Techniques:


examining the trends, seasonality, and cyclical patterns in the temporal patterns of accidents
throughout time.

28
5. Spatial Analysis: - Tools: Geographic Information System (GIS) tools, Python (geopandas), R
(sf) - Techniques: Mapping accident locations to identify spatial clusters, hotspot analysis, and
evaluating the influence of geographical factors on accident incidence.

6. Weighted Analysis: - Tools: Python, R - Techniques: Taking into account the


representativeness of the data by using observation weights to account for changing sample
probabilities.

7. Association Rules Mining: - Tools: R (arules), Python (mlxtend) - Techniques: Finding


correlations between various elements (such as airbag deployment and injury severity) to reveal
hidden patterns and dependencies.

8. Advanced Statistical Analysis: - Tools: Python (statsmodels, scipy), R - Techniques: Applying


statistical tests (chi-square, t-tests, etc.) to support hypotheses and determine the importance of
patterns that are reported.

The process of Feature Engineering: involves the creation of new variables or the transformation
of existing ones in order to enhance the performance of predictive models and provide more
significant insights. The tools used for this process include Python and R.

10. Information Sharing and Reporting: - Resources: RMarkdown, Tableau, Power BI, Jupyter
Notebooks - Methods: Using storytelling techniques, create in-depth reports and visualisations
that effectively explain findings to policymakers and highlight the impact of suggested solutions.

11. Collaborative Decision-Making: - Tools: Version control (Git), collaboration platforms


(Slack, Microsoft Teams), - Techniques: making sure the panel of experts takes a collaborative
approach, using version control to track changes, and preserving transparency in the decision-
making process.

The data scientists on the panel will be able to analyse the motor traffic accident data in a
comprehensive and perceptive manner by using these tools, techniques, and methodologies. As a

29
result, they will be able to offer the Sri Lankan government useful recommendations for
efficiently addressing the problem.

References:

U.S. Department of Transportation National Highway Traffic Safety Administration (NHTSA).


(1997-2002).
Pandas Development Team. (2022). pandas 1.3.3 documentation.
Scipy Developers. (2022). SciPy v1.7.3 Reference Guide.

30

You might also like