You are on page 1of 8

Root Cause Analysis

1



Root Cause Analysis

What is Root cause Analysis?
Root Cause Analysis (or RCA) is a process by which the root cause of a data quality issue is diagnosed
remotely after a UDR or UDI is completed. Properly diagnosing root causes of cases can assist the Field
Ops team in repairing the issues for future dispatches, as well as help us understand what issues are
causing us the most harm, so we can try to solve them through long term means.

Beginning Root Cause Analysis: Hardware Used in Site Enablement

RCA will require a basic understanding of how sites collect data and how parts commonly malfunction.
Below is the list of the various parts of a metering setup, and a short description of each. These
description is by no means exhaustive, and do not cover the specifics of every enablement, but covers
the setup of a site in the majority of circumstances.

There is a Metering Device, either installed by us or the utility that will collect data about electrical
usage directly from the sites electrical input. When a DQ issue is with the meter itself, it is usually
evidenced by incorrect readings on the DataStream. Drops, spikes, and zeros are usually caused by a
malfunctioning meter. There are two devices used on the majority of enablements.

Veris Meter: Veris meters are EnerNOC metering devices, meaning that we oversee their install and are
responsible for their repair or replacement. Veris meters are attached to the mains entering a site, and
measure electricity as it passes through them. The most common DQ issues caused by Veris meters
are continuous under reading of actual electricity usage (usually by about 2%), spikes, drops, and zeros.

Utility Pulse Meters: Utility Pulse Meters are used by the utility to collect their own data. They are
connected to the rest of our equipment through a pulse board, which is set up by the utility to split the
pulses they read from ours. Utility pulses are more reliable than Veris meters, but still have many
similar issues (i.e. spikes, drops and zeros). A DQ issue unique to Utility Pulse Meters is a Pulse
Multiplier Issue. A Pulse Multiplier issue is created when the Pulse Multiplier (the number that converts
a pulse into a value in kWh).


Root Cause Analysis
2

The metering device sends this data to a Site Server. A site server takes in this raw data, and attempts
to send it, over the internet, to our database. The site server is one of the largest sources of issues, but
most site server issues are evidenced by data not reaching our database. Estimations and missing
readings can be issues with the site server. The site server can also cause several miscellaneous issues,
such as spike drops or negative spikes, in specific circumstances. There are five major site servers that
we use.

Nexus: A Nexus device was the first kind of site server used. Only a few are still in use, but there are
many problems with the Nexus. Nexus site servers do not store data, they only forward it to the
database. Because of this, Nexus meters are the ones most likely to have communication issues.

ILON E3 (also called ILON 100): An ILON E3 is the second kind of meter used by EnerNOC. Due to the
inaccuracy of the clock on ILON E3s, they are know for having massive spike drops, which usually happen
early on Sunday morning (due to the configuration of our platform). Unlike the Nexus, the ILON E3
stores historical data, but does not store historical data concerning connectivity.

ILON E4: ILONE4s are the first site server to use chat protocols to communicate. Unlike the previous
two meters, ILON E4s automatically check for timestamp issues, so smaller spike drops are expected.
ILON E4s also are easier to communicate with remotely. PowerChat can be used to communicate with
an E4 to recollect data, restart the device, and remotely gather information. E4s have a unique issue
though, a Negative Spike, and are the only devices that cause this.

S1 and S2 Servers: S1 and S2 servers are both site servers designed by EnerNOC. S2s are slightly more
reliable, but both function similarly to the E4s (using a program called ToeChat instead of the E4s
PowerChat). Unlike E4s though, S2 and S1 servers continuously check their timestamp, so there should
not be spike drops on these servers.

We usually try to connect the metering device to the site server directly, to reduce DQ errors and the
amount of equipment needed at a site, but sometimes this is not possible. If not, the metering device
will connect with the site server using a wireless transmitter. Wireless transmitters usually fail because
of an interrupted signal, although they can be the reason for bigger problems if they break or their
batteries run out.

















Wireless
Transmitter is one
of the Root Cause
category.
Root Cause Analysis
3

A Wireless Transmitter is usually one of two devices:

Spinwave: a Spin wave sends a count of pulses (the signal from the metering device) to the site server.
When a Spinwave does not connect, information is not sent until the reconnection. This leads to drop
spikes, since when the Spinwave eventually connects it communicates to the site server the amount of
electricity, in pulses, read by the meter since its last connected. This leads to an interval with less
readings than it should have had, followed by an interval with more readings.


Mod hopper: Mod hopper simply sends the pulses directly to the site server. Because of this, when the
signal is not properly received, no data is collected. This is seen as gaps by the sight server, and can lead
to estimations after it has been processed by the database.




General Description of Data base




The final part of any metering setup is how the site server communicates with our database. The site
server through some means sends this information over the internet, and the complexity of that process
leads to many problems, although they are all evidenced by estimated or non-existent readings. It is
usually done in one of three ways.

LAN: LAN stands for Local Area Network. Whenever possible, a site server will be configured to
connect with our database via the clients internet connection. In this case, it is directly
connected to the internet through the customer network, and problems with the customer
network can cause the server to lose communications.

VPN: VPN stands for Virtual Proxy Network. If a client requests a VPN network, the server is
connected physically in the same way as in the LAN, but information is more secure. For the
purpose of Root Cause Analysis, there is little to no difference between a VPN and a LAN
network.

Wireless: If necessary, the site server will connect with our database using a wireless signal
(over a cell phone network). In this case, a DQ issue can be caused either by a bad cell signal, a
malfunctioning modem device (the part that actually communicates with the cell towers), or a
general issue with the cellular carriers network.

Completing RCA: Diagnosing a Root Cause
After assigning the case to our name and understanding the parts which commonly malfunction during
their operation, then we have to identify which malfunction is occurring. There are some typical
Root Cause Analysis
4

examples which show the malfunction issues at sites. Below is the information regarding the RCA
process.

The Flow Chart
In a large portion of cases, the root cause of a case can be diagnosed by only one or two pieces of
evidence. For these cases, the RCA flow chart can be used to simply diagnose the root cause. The
flowchart can be found on the O drive here: O:\NetworkOperations\Data_Quality\Jake\Root Cause
Flow Chart.pdf
An additional resource for use is the DQ Symptoms + Root Cause 2.0 spreadsheet. The spreadsheet
provides a few examples of DQ issues. The document can be found here:
O:\NetworkOperations\Data_Quality\Jake\DQ Symptoms + Root Cause 2.0.xlsx
However, in two cases (Zeros and Missing data) there is not a clear diagnosis. The text below details
how to go about finding the root cause in these specific circumstances.

Invalid Zeros
Invalid Zeros are usually caused by an improperly functioning meter, but they can be caused by a
variety of different problems. When looking at a case for invalid zeros, you should first check the cases
page. To navigate to the cases page, first go to the site page. The site page can be reached by simply
clicking on the site name on the original case. Once on the site page (which will be your main resource
for most of the Root Cause Analysis process), scroll down about of the page to find the Cases fi eld.
Click on the Go to list (#) link to get to the cases page.

Root Cause Analysis
5





GO TO LIST shows the
cases to which we have to
perform the root cause
analysis. Action shows the
updated case owner name
Root Cause Analysis
6



Once you are on the Cases page, you will want to sort cases by Last Modified Data/Time. This is done
by simply clicking on the text field near the top of the table. In the case of invalid zeros, you will want to
look for a Zeros site name case from the time period of the DQ error. Click on each of the Zeros
cases from the time period (from the actual error to about 6 months in the future), and open them in
new windows.
Now that you have found the relevant zeros cases, you will want to see if they can help you with the
root cause. The simplest way this can happen is if the zeros case itself has a root cause and root cause
category. If so, that root cause is the root cause of your case. (Note the root cause of No DQ
Issue/Resolved on its own is not used on DQ. If this is the root cause on the zeros case do not use it.)
If there is no root cause, then read over the case resolution and comments and ask yourself some
questions about the case:

Was any hardware replaced?
Was any device rebooted remotely?
Was there an issue with the power source at the site?

Zeros are most often caused by our metering devices, but can be caused by any of the hardware at a site
being wrong (whether it be the site server, wireless transmitter, or an error with the customers
power). As a rule, if you can figure out any hardware that was replaced as a result of the case, or any
hardware that caused the issue from the comments, than that hardware is the root cause. It will be
possible that no root cause can be found from the zeros cases, if that is true then there is a good chance
the issue is with the meter.

Case reason
shows zeros for
this site
Root Cause Analysis
7


Description of Gaps, Estimation and Missing Data

This might be a bit confusing, but Gaps, Excessive Estimations, and Missing data are all very similar DQ
problems. They are all caused by information not getting back to our database, where sometimes the
VEE program makes estimations to fill gaps and missing data.

When assessing one of these issues, you will first have to look to see if the issue is a Modhopper. When
estimations are caused by a Modhopper, they usually will be short and relatively frequent. These kinds
of estimations are caused by an interruption in the signal between the Modhopper transmitting pulse
data and the Modhopper receiving the pulse data. If estimations only last for about five to twenty
minutes, it might be caused by a Modhopper. The way to confirm if the Modhopper is the issue, if the
estimation is small, is to check if the site has a Modhopper.

This is where it starts to get a little tricky, since Modhoppers are not always properly listed on ECRM.
There are two places where a Modhopper might be listed. Firstly, the Modhopper can be listed on the
EnerNOC Site Servers field on ECRM. This is found on the site page directly above the meter field. It
will usually list if a Modhopper is on the site in the Communications or Additional Notes sections. If
the Modhopper is not listed here, you can look to see if the site survey indicated that a Modhopper is
needed. To do this, look at the Installation Notes in the Site Survey and Design field, found on the
site page. If evidence of a Modhopper is found, and the estimations are short, than the root cause is
Wireless Transmitter/Interrupted Signal.












Site Survey notes help
us to know whether any
mod Hooper is installed
at the site or not?
Root Cause Analysis
8

ZEUS PAGE









If still we didnt able to identify the root cause which may or may not be, because of communication
problem at the site then there is still something we can try i.e. the Zeus tool.
To confirm this, go to the ZEUS page (http://encorpops07:8080/zeusweb/ZEUS.jsp), and filter on the
correct site. Once you find the correct site, click on the Site Server link for that site. If the server is an
E4, S1, or S2, then it will bring up the server page.
Once you find the server page, scroll to the bottom and review the table labeled Device Public IP.
When a server loses communications with our database, its IP address is listed as null. To confirm if
there was no communications see if the IP address was null during the issue. If it is null, the root cause
category is communications.










Null shows that there is
communication issue
at site