You are on page 1of 15

Table of Contents

Task 1 ................................................................................................................... 2
Task 1.1. Briefly discuss the key results of your exploratory data analysis. .... 2
Task 1.2. Decision Tree Model........................................................................ 2
Task 1.3 Logistic Regression Model ............................................................... 3
Task 1.4 Comparison of Decision Tree and Logistic Regression Model ......... 4
Task 2 ................................................................................................................... 6
Task 2.1 High level data warehouse architecture design ................................ 6
Task 2.2 Main Components of the Proposed High Level Data Warehouse
Architecture Design ........................................................................................ 6
Task 2.3 Key security privacy and ethical concerns for a large stated owned
water utility in using big data analytics capability combined with data
warehousing ................................................................................................... 9
Task 3 ................................................................................................................. 11
Task 3.1 to 3.4 Discussion of the Dashboards. ............................................. 11
Task 3.5. Provide a Rationale for the graphic design and functionality that is
provided in your LAPD Crimes Event Dashboard ......................................... 11
List of References: .............................................................................................. 13
List of Appendices ............................................................................................... 14
Task 1

Task 1.1. Briefly discuss the key results of your exploratory data analysis.

The set of values in the given tables are presented either in words and in numbers
of the factors that greatly affect rainfall in the provided cities. The goal of this task
is to create a prediction model using Rapid Data Miner. Based on the table of the
overall data, the factors that I think could affect the prediction as to whether there
will be a rain tomorrow or not are the following: (a) evaporation, which determines
the amount of water that goes to the atmosphere and have the possibility to go
down the ground surface through precipitation, (b) Humidity at 9 am, (c) Humidity
at 3 pm, which typically states the amount of vapor that is within the surroundings
of a certain location, (d) Maximum Temperature, and, (e) Minimum Temperature,
which makes it easier to say whether it will be a hot or a cold day. These are
factors that also according to researches, mostly affect the possibility of a rain.
Other than all of these factors, the data of the presence of Rain today was used
as it is considered as one main trend setter if there will be a rain tomorrow. All of
these factors are considered to be part of the prediction models that will be used
in order to have a prediction model of whether there will be a rain tomorrow or not.
Keep in mind that this prediction is independent of the location since this factor I
think is considered already by humidity.

Task 1.2. Decision Tree Model

The decision three created is based on the given parameters in Task 1.1. Those
are the input factors in which the outcome shall be the prediction model of
whether a rain will happen tomorrow. Different combinations can be made from
the values given by these factors.but with the help of the Rapid Miner, the
prediction model was easily generated.
The process used in Rapid Studio is as follows:
Retrieve. In here, the whole table was imported into Rapid Data miner.
Select Attributes. In selecting attributes, all the relevant variables werr taken and
other variables were disregarded.
Set Role. In setting the role, the main column ‘label” was chosen, in which it will
be thebgoverning variable among all the variables. In this, the label chosen was
the Rain Tomorrow, as it will be the desired outpit of the prediction model.
Filter. In this part, those rows that have missing values were removed from the
table, and also, the value of NA was remvoed from some columns that contain this
value. After this, the data table will have complete set of values for all columns
without any blank cell.
Decision Tree. Finally, a decision tree model was processed so as to determine
the parameters that would lead in determining whether it is a factor that would
increase the probability of raining tomorrow.
The figure below shows the decision tree produced by the program. Since there
are many combinations that can be considered, the resulting decision tree
requires a lot of spaces. To be able to see the values clearer, a zoomed part of the
main figure was taken.

Task 1.3 Logistic Regression Model

Logistic regression is a statistical analysis method used to predict a data value


based on prior observations of a data set. Logistic regression has become an
important tool in the discipline of machine learning. The approach allows
an algorithm being used in a machine learning application to classify incoming
data based on historical data. As more relevant data comes in, the algorithm
should get better at predicting classifications within data sets. Logistic regression
can also play a role in data preparation activities by allowing data sets to be put
into specifically predefined buckets during the extract, transform, load
(ETL) process in order to stage the information for analysis.
Logistic regression is a statistical analysis method used to predict a data value
based on prior observations of a data set. Logistic regression has become an
important tool in the discipline of machine learning. The approach allows
an algorithm being used in a machine learning application to classify incoming
data based on historical data. As more relevant data comes in, the algorithm
should get better at predicting classifications within data sets. Logistic regression
can also play a role in data preparation activities by allowing data sets to be put
into specifically predefined buckets during the extract, transform, load
(ETL) process in order to stage the information for analysis.
A logistic regression model predicts a dependent data variable by analyzing the
relationship between one or more existing independent variables. For example, a
logistic regression could be used to predict whether a political candidate will win
or lose an election or whether a high school student will be admitted to a particular
college. In this, a logistic regression model was created to determine if there will
be a rain tomorrow.
Task 1.4 Comparison of Decision Tree and Logistic Regression Model

The problem of logistic regression being hard to interpret is much more serious
than it first appears. As most people are not able to interpret it correctly, they end
up not even noticing when they have stuffed it up, leading to a double boo-boo,
whereby they inadvertently create a model that is rubbish, which they then go on
to misinterpret. The great thing about decision trees is that they are as simple as
they appear. No advanced statistical knowledge is required in order to use them
or interpret them correctly. Yes, sure, there are ways you can improve them if you
are an expert, but all that is really required to be successful when you use them is
common sense.

The resulting analytical model can take into consideration multiple input criteria. In
the case of college acceptance, the model could consider factors such as the
student’s grade point average, SAT score and number of extracurricular activities.
Based on historical data about earlier outcomes involving the same input criteria,
it then scores new cases on their probability of falling into a particular outcome
category.
Logistic regression is one of the most commonly used machine learning
algorithms for binary classification problems, which are problems with two class
values, including predictions such as “this or that,” “yes or no” and “A or B.”
The purpose of logistic regression is to estimate the probabilities of events,
including determining a relationship between features and the probabilities of
particular outcomes. One example of this is predicting if a student will pass or fail
an exam when the number of hours spent studying is provided as a feature and
the variables for the response has two values: pass and fail. Organizations can
use insights from logistic regression outputs to enhance their business strategies
so they can achieve their business goals, including reducing expenses or losses
and increasing ROI in marketing campaigns, for example.
An e-commerce company that mails expensive promotional offers to customers
would like to know whether a particular customer is likely to respond to the offers
or not. For example, they’ll want to know whether that consumer will be a
“responder” or a “non responder.” In marketing, this is called propensity to
respond modeling.
Likewise, a credit card company develops a model to decide whether to issue a
credit card to a customer or not will try to predict whether the customer is going to
default or not on the credit card based on such characteristics as annual income,
monthly credit card payments and number of defaults. In banking parlance, this is
known as default propensity modeling.
Logistic regression has become particularly popular in online advertising, enabling
marketers to predict the likelihood of specific website users who will click on
particular advertisements as a yes or no percentage. Logistic regression can also
be used in: Healthcare to identify risk factors for diseases and plan preventive
measures. Weather forecasting apps to predict snowfall and weather
conditions.Voting apps to determine if voters will vote for a particular
candidate.Insurance to predict the chances that a policy holder will die before the
term of the policy expires based on certain criteria, such as gender, age and
physical examination.Banking to predict the chances that a loan applicant will
default on a loan or not, based on annual income, past defaults and past debts.
Task 2

Task 2.1 High level data warehouse architecture design

Data Capture Processing Data Storage Presentation

Transaction Data
Database
Mart
Interactive
ETL Data Reports
Operational Warehouse Reporting/
System Tools Analysis/ Ad hoc
Database reports
Mining
Transaction Data Tools
Database Static
Mart reports

External Transaction
Data Database

Figure 1.1 Big Data Analytics and Data Warehouse Combined

Task 2.2 Main Components of the Proposed High Level Data Warehouse Architecture
Design

The industry of utilities is made up of different organization which provides


services to people in different parts of the world. They give people water, energy,
oil, etc. that serves as daily necessity to all of us. They are mainly responsible for
the safety of distribution of services to people. Water is an essential element to
survive in our world. Without water, we would not live in a week. Water utilities are
highly regulated, so keeping the personal and sensitive information is a must do in
every organization. There comes the concern about data warehouse architecture
design. Choosing the right design that will be suitable for your organization is one
of the most important decisions that has to make in order to work well with your
data. Understanding your organization's data warehouse helps in knowing your
own organization. Data warehouse allow organizations to easily reclaim and store
valuable data about owners, employees, machineries, and products. The data
warehouse database provide a relationship between data from various systems
inside the organization.

Data capture: Data varies depends on where it came from. There are lots of areas
that may be a source of data that will be helpful for the organization.
Organizations can extract data from devices that are usually held by users such
as mobile phones, smart watch, laptops, etc. It is important to know where your
data will come from so an organization can set scope and limitations on the
gathered data.
Operational system- is helpful in data warehousing that indicates a system that is
needed to process the daily arrangement of an organization. The goal of this
system is to make sure daily arrangement will run efficiently and the security of
transactional data is protected.
External data - data that came from outside database are called external data.
This is also known as federated data source that you can access directly even
though the data is not stored in the data warehouse. An organization will have to
create a set of references of external data source.
Processing: After data has been collected, it should be prepared for transfer,
cleansing, and alteration to qualify data and make it more efficient.
Transaction database - with the context of a database, transaction is
independently implemented for data recovery and system updates. With rational
databases, the following must be the components of a transaction database:
Atomicity: for a transaction to be considered successful, it must be entirely
complete, retained, or rolled back.
Consistency: Transactions should be consistent. It should comply with the
condition of what the database has started. It should obey the database’s
restrictions. For example, if the database says it only accepts alphabet letters,
then it should all be alphabet letters nothing more, nothing less. It would not
accept any numeric symbols entry if it says alphabet letters only.
Isolation: Transaction data is only available on the original transaction. It is not
available to the others until the original is accomplished or decreased.
Durability: Transaction data changes must be available, even in the event of
database failure.
ETL tools-With the help of ETL framework, it may increase the chances of greater
connectivity or change in size. A good ETL tool must be efficient in communicating
with various relational databases and reads numerous file formats used in the
entire organization. ETL tools are used by wide-range executive people-from
students who study computer programs and components to the top executive who
manages their own organization. ETL tools has become a reliable tool to process
data by undergoing process (Extract, Transform, Load).

Data storage: Transformed data are ready for storing after being processed and
prepared. Storing of data where you can easily reclaim it and store another
billions of data. Data storage is composed of computer components that are used
to store digital data.
Data warehouse database - a system designed for questions instead of
transaction purposes. The data that is gathered in here came from the historical
data storage but can also accept data from others, like the external data source.
The analysis workload and transaction workload are being separated which
enables organization to extract data from different sources.
Data Mart-serves as an archive that stores data and uses recover set of data to
aid in the requirement needed by the whole organization. Data marts exist in the
organization as a single data warehouse repository. It helps in organizing the time
of users for it gives group view of data with accordance to the need of user.
Presentation: Data that has been reclaimed will be prepared for presentation. It
refers to the exhibition of information on how an organization can use the data in a
good and efficient manner.
Reporting – gathering and interpretation of data which results to the analysis of
the facts given by the data. Inaccurate data usage results to poor decision-making
and may cause failure to the organization. A perfect example is that, a teacher
collecting data from their student’s assessment output in order to determine their
grades. Put in mind that the effectiveness of data also rely on how an organization
will create a report on it. It will determine the organization’s agendas and future
plans.
Analysis – It presents how an organization will classify the data. If it is useful for
campaign, marketing, or maintenance. Through the process, they examine data
and may discover some useful information for the benefit of the name of the
organization.
Mining tools – the organization searches for the best methodologies that can be
applied in the data to extract its true and topmost purpose. It also includes
classification of data and identifying their relationships.
Interactive reports – It presents data to its most understandable state. Its goal is to
give the audience the satisfaction of knowledge within the limited amount of time.
It usually includes highly customized reports. Parts that are important are
highlighted and should be addressed correctly.
Ad hoc reports – a type of report that is created in fliers, displays, charts, etc. that
are easily be seen by audience. In a quick sense, ad hoc report is an easy way to
answer the questions that are not predicted or expected.
Static reports – It provides report in areas such as inventory, or resources that are
generated only periodically. It requires minimal amount of time to maintain such
reports.
The processes stated above will be helpful in creating the architecture design of a
data warehouse. The said topics will be the components of a large data
warehouse which will soon store data for a long time that will serve the
organization for future planning purposes. Creating an architectural design that
you actually understand is the foundation of a good organization. Understanding it
by heart and mind is the key to a success in your business. It is the best to start
sketching your data warehouse architectural design and work on it with the best
employees in you community.

Task 2.3 Key security privacy and ethical concerns for a large stated owned water
utility in using big data analytics capability combined with data warehousing

There can be lots of ways to gather data even in your small steps. In our time,
advancement in technology makes data gathering job easier that it is before
where internet is hard to access, some remote areas has weak signals, some
have expensive internet fees that many cannot afford. But now, it can easily
achieve in a click. With just a click on your mobile phone, data is already
available. A click on smart watch can retrieve data from the user. Data can also be
transmitted just by typing on your laptop. The internet is a great benefit for all of
us. It makes life convenient and accessible. But internet can also be a dangerous
place if we do not take control of it. In an organization, data privacy is a topmost
important for they protect millions of information from the management,
employees, and down to the users. As a user, you should input all your personal
information and credibility to be registered in an organization. An employee needs
to submit their information that may contain sensitive data. The management
stores all the past, present, and future plans and files into their database. We
must have big trust with the data warehouse if we comply our personal
information to it. An organization shall have a high-level key security privacy that
can protect all the information in their community. Failure to do so will lead to a
serious crime. Imagine the sensitive and personal information of employees and
customers leaks online, it is a disaster. In privacy notices, ethical issues should
come into concern of an organization as it gives proper handling of data to ensure
there is no crime committed by an organization. A consent is top priority when
asking for the data of users. They should be willingly offering their information as it
is their first decision. The user also picks who can control their data. They can
decide whether their data can be altered or transformed by a certain person.
Confidentiality is a way of giving an authorized person an access to your data and
information. User can also pick the person who has the responsibility to hold their
information. The authorized person shall take care of the data and also has the
ability to give up the responsibility. User also has the rights to share their data to
the authorized person only. The data from the previous research can also be
reusable for the future researches. The implementation of the data management
should be a top priority in an organization.
Access and preservation of digital data is a must. When the data privacy acts are
implemented right, the organization will run smoothly due to data being organized
by its category. Categories according to its size, class, and purpose. The world
changes so fast that the advancement of technologies should also ride with it. The
same with the data privacy act, it should ride with the fast advancing technologies
as we are approaching quick gathering data devices. The human is evolving so as
its creation. Data warehouse should come up with a more space as everyday we
create more data that are stored in our high-advanced devices. The ethical issues
stated above should be considered for the data warehouse architectural design
that shown in Fig 1.1. It is designed to have a high-level key security privacy so
that an organization can trust the architectural design that are presented above.
Proper handling of data should be a foremost solution of an organization in the
problems regarding the user’s data concerns. The bigger data an organization or
an institution acquire, the greater responsibility in protecting the collected data.
An organization can implement some securities such as password, notifying
emails when an unauthorized login from untrusted devices trying to connect from
the data warehouse. It is always important to maximize security and data privacy
because it will not only secure general data but also personal one. Keeping
private and sensitive information only to yourself is also a big step in improving
your data security. The lesser people who knows your personal account and data,
the safer your data is. You can also consider using advance or modern technology
to enhance your data privacy for example, Google offers notifications and alerts
whenever someone is trying to access your personal account, banks, which is a
customer service type of company is crucial in data privacy because of the
involvement of monetary accounts. That is why they are required to make their
security more advanced.
Task 3

The given design consists of four different cases of dashboard wherein a reader
or any concerned individual, mostly policemen, can directly see the different
events and most important happenings, definitely crimes, within an area or
location included in the tabulated list given.
Task 3.1 to 3.4 Discussion of the Dashboards.

The first sheet shows the specific crimes within each crime category such as theft,
kidnapping, sexual harassment, etc., for a specific area or location with their own
police department area during a specific year given. The second sheet shows the
frequency of occurrence for a selected crime over a 24 hours for a specific
location given a police department area. The third sheet shows the frequency of
crime classification by police department area and by time, specifically by time of
the day. For the first view, it can be seen that the most crimes that happened are
in the form of theft and it literally goes over other crimes, considering all areas and
all period of times.This summarizes everything in the sheets to be made as
dashboards. With the help of the dashboard, we can simply see the difference
between all of those and basically define which place is more dangerous, which
crime is more likely to happen in a place.
Finally in the last sheet, there is a map that shows the locations of the places
involved through the latitude and longitude given. All of these are very much
helpful as you can really see everything in a map view, with each crime happening
in an area being seen in the map, you can simply locate the position and be able
to predict what crime is about to happen again.

Task 3.5. Provide a Rationale for the graphic design and functionality that is provided
in your LAPD Crimes Event Dashboard

For many of us, a car is an essential part of daily life. We invest thousands of
dollars in our cars, and some of us see them as extensions of ourselves and can’t
make it through the day without them. We value our cars—but we’re not the only
ones who do. Our cars are valued by thieves, too.
In 2016, there were 765,484 thefts on carsnationwide. This increased 7.4% since
2015. That’s a statistic no car owner wants to be a part of.
Modern security systems help deter car theft, but you can further minimize your
risk by understanding what thieves look for and how they think, and by taking
proactive steps to keep your car safe. It also helps to know what to do should you
become an unwitting contributor to the FBI’s statistic.
Car thieves are opportunists. They’ll steal any car that’s an easy target, but
certain makes and models rank high on their hit list.
According to the National Insurance Crime Bureau’s (NCIB) 2017 Hot Wheels
Report, the Honda Civic was the number-one stolen car in 2017 and the Honda
Accord a close second, a distinction both cars have held since 2007.
No matter what area of the country you live in, the odds of having your car stolen
are highest in urban areas. Dark, secluded places are also prime sites favored by
thieves because they can work undisturbed.
These include parking garages, shopping centers, large apartment complexes
and anywhere large groups of cars are parked together for extended periods of
time. Areas like these offer choice and also make it easier for thieves to see and
hear when people are coming.
Certain anti-theft systems can be easy to acquire. For instance, you can get a
steering wheel lock at your local Walmart or auto parts store. These can fit old and
new cars. For other anti-theft systems such as kill switches, you may want to hire
an auto professional. Installing them yourself is an option, but can be tricky. Auto
mechanics or other professionals can answer any questions you have regarding
anti-theft devices for your specific vehicle

.
List of References:

Data privacy. 2019. Retrieved from


http://theconversation.com/big-data-is-useful-but-we-need-to-protect-your-privacy
-too-40971
Data Privacy. 2019. Retrieved from
https://www.rappler.com/technology/features/221542-things-to-know-about-data-
privacy-notices
Data Mining Tools. 2019. Retrieved from
https://towardsdatascience.com/data-mining-tools-f701645e0f4c?gi=b4d3e2cae1
d
Decision Trees: A simple way of visualization of decision. Retrieved from
https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decisio
n-dc506a403aeb
Data Mart. 2019. Retrieved from
https://www.techopedia.com/definition/134/data-mart
Logistics Modelling. Retrieved from
https://searchbusinessanalytics.techtarget.com/definition/logistic-regression
Ad Hoc Reporting. 2019. Retrieved from
https://www.inetsoft.com/info/ad_hoc_report_definition/
What Care Thieves Look for. Retrieved from
https://extramile.thehartford.com/auto/car-theft-prevention/
Predictive Analysis. Retrieved from
https://en.wikipedia.org/wiki/Predictive_analytics
Logistic Regression vs. Decision Trees. Retrived from
https://blog.bigml.com/2016/09/28/logistic-regression-versus-decision-trees/
List of Appendices

You might also like