You are on page 1of 5

NESUG 2010

Coders' Corner

Know What You Are Missing: How to Catalogue and Manage Missing Pieces of Historical Data
Shankar Yaddanapudi, SAS® Consultant, Washington DC

ABSTRACT
In certain applications it is necessary to maintain and update historical data on a periodic basis. However the data might be missing in one or more chunks, resulting in gaps in history. This might have an adverse impact on the performance of the application consuming the historical data, and typically various techniques are used to fill in these missing chunks. Before such methods are employed however, the number and size of the missing chunks need to be assessed to decide the appropriate course of action. This is important because the best course of action in the presence of few missing values scattered all over, might be different from the case when there are large number of missing values concentrated in few spots. This situation is common in Risk Management Applications. In this paper a situation is presented where it is critical to maintain a historical database without any missing values. SAS coding techniques are then presented to assess the data, and make an inventory of the data, including the number and size of gaps in the data. Possible courses of action based on this information are also briefly discussed.

INTRODUCTION
The problem of dealing with missing data has been studied extensively from several view points. Much of this work has traditionally focused on various imputation techniques ranging from simple methods to sophisticated methods, such as those encapsulated in PROC MI. Numerous papers can be found in SUGI conferences about how to count and tabulate missing data as a percentage of total data, as well as how to impute the missing values. In this paper we present a situation, where it would be helpful to examine the missing data further deeper than is done usually. Then SAS code is presented to help with this analysis, followed by a brief discussion of various options available to deal with the missing data. A typical Risk Management system (RMS) consumes significant amounts of input data, and the SAS flagship product SAS Risk dimensions, is no exception. A major input to such a RMS, when used to implement market risk, is market data in the form of prices of various bonds, stocks, derivatives and other financial instruments. This data is typically utilized by the RMS to compute various risk metrics like VaR (Value-at-Risk), using a variety of methods like Historical simulation, and Monte Carlo based methods. While there is certain flexibility in choosing the particular risk computation methods based on the objectives, it is generally agreed that the input market data should be of high quality, i.e., enough historical data should be available, and should be accurate without missing values. Financial firms go to great lengths in securing market data, Bloomberg and Reuters being two major vendors of such data. However, in spite of best efforts, sometimes the input market data does have missing data, in one or more chunks spread across the time series. There could be several reasons for this, for example, certain low volume equity options might not have been traded on several business days at a stretch. The data analyst needs to know the reason why the data is missing in order to come up with good solutions to suit his or her objectives. A first step in studying the missing data would be to simply tabulate the number and size of the gaps in the data, and try to observe the patterns.

SUMMARIZING THE MISSING DATA
The missing data can be summarized in several ways, but a simple DATA step method will be presented here. Example data is presented in Fig 1.0 which lists three fictitious stock symbols with their closing prices. Stock AAXX has three gaps with sizes 1,2 and 3, while BBCD has two gaps, of sizes 6 and 1. CCCD has no gaps in the data. SAS code in Fig 2, shows the approach taken to count the number and size of gaps.

1

1+MAX(ngaps) as GapNumber. /* extract the gaps and their sizes.ID THEN DO. BY ID date. END. missgaps gap GT 0 BY ID. /* when a non-missing value is encountered. QUIT. ** ID is stock symbol. BY ID date. flag=0.NESUG 2010 Fig 2. ngaps+1.ID) THEN DO. RUN. and their starting and ending dates */ PROC SQL NOPRINT. size is the gap size. increment gap number and reset */ IF (gap EQ 0 AND flag EQ 1) OR (flag EQ 1 AND LAST. MAX(date) as EndDate format=mmddyy10. size=0. /* count the number and size of the gaps */ DATA missgaps. ELSE gap=0. END. flag=1. size+1. 2 missSummary AS ID. CREATE TABLE SELECT FROM WHERE GROUP ORDER . /* start counting from the first missing data point */ IF gap EQ 1 THEN DO. size=0.MAX(size) as GapSize. Coders' Corner RETAIN ngaps size flag. IF FIRST.ngaps BY ID . Sample Code to Extract Size and umber of Gaps in a Data Set /* Sort the input data set */ PROC SORT DATA=prices. flag=0. ** ngaps is the number of gaps. /* identify missing data */ IF MISSING(price) EQ 1 THEN gap=1.. and flag indicates a missing value. END. MIN(date) as StartDate format=mmddyy10. RUN. SET prices. ngaps=0.

73 35. 78.41 77. A further investigation would reveal that the data is missing because this CMT series was discontinued during this period.treas. for each security all the gaps and their sizes are listed. flag and size are reset to zero. and flag indicates a missing value. either when a non-missing value is encountered. which is the stock symbol. 2002 to February 8.76 ngaps 0 0 0 1 1 1 2 2 2 2 2 3 3 3 0 0 0 0 0 0 1 1 2 0 0 0 size 0 1 2 0 0 1 0 0 1 2 3 0 0 0 1 2 3 4 5 6 0 1 0 0 0 0 flag 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 gap 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 and the final data sets are presented in figures 3 and 4. In the following DATA step. an important piece of market data used in a typical RMS. and can be modified to be more efficient and compact. . Finally a PROC SQL query extracts the number of gaps and their sizes.39 36. three variables are initialized using RETAIN statement: ngaps is the number of gaps.22 . The variable size is incremented with each successive missing value. . 3. A listing of this data can be found at: http://www.23 77. or when data for a new security begins.38 . 2006. .76 . MISSING function is then used to identify the missing values and flag indicator is set accordingly. . As seen in Fig4. along with the dates showing the starting and ending points of the gap. it is not listed. and can be easily converted into a macro. and that the Treasury has published alternate rates 3 . size is the gap size. As an example consider the case of Constant Maturity Treasury (CMT) rates.44 35. The code presented in Figure 2 demonstrates the basic idea.82 .94 . 33. . If a security does not have any gaps.0 Listing of the data set missgaps Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Date 07/26/2010 07/27/2010 07/28/2010 07/29/2010 07/30/2010 08/02/2010 08/03/2010 08/04/2010 08/05/2010 08/06/2010 08/09/2010 08/10/2010 08/11/2010 08/12/2010 07/26/2010 07/27/2010 07/28/2010 07/29/2010 07/30/2010 08/02/2010 08/03/2010 08/04/2010 08/05/2010 08/10/2010 08/11/2010 08/12/2010 ID AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX BBCD BBCD BBCD BBCD BBCD BBCD BBCD BBCD BBCD CCCD CCCD CCCD Price 35. If needed the SQL query can be modified to extract the size of the maximum gap.81 37. during the period from February 18.gov/offices/domestic-finance/debt-management/interest-rate/yield. 35. 36.16 77. The contents of the intermediate Fig. When the missing series ends. . . 77.NESUG 2010 Coders' Corner The input data set is sorted first by the variable ID. one would notice a very large gap for CMT 30 series. .shtml If this data is summarized as described above. ANALYZING THE MISSING DATA The question arises how this summarized information can be of use to a data analyst.

When some of these risk factors are missing. For example. which would first compute the number and size of gaps. a security which represents either the entire market or the security in question can be used as a “proxy”. In the case of market data.0 Listing of the data set missSummary ID AAXX AAXX AAXX BBCD BBCD GapNumber 1 2 3 1 2 GapSize 2 1 3 6 1 StartDate 07/27/2010 08/02/2010 08/05/2010 07/26/2010 08/04/2010 EndDate 07/28/2010 08/02/2010 08/09/2010 08/02/2010 08/04/2010 On the other hand. In the case where the missing data is scattered across. or some other method to fill in the gap. S&P500 is a well regarded market index which represents the overall market fairly well. depending on his or her objectives. as well as time and computational constraints. The major objective of a RMS is to measure risk across a portfolio. substitution by regression while multiple imputation methods available in PROC MI represent the state of the art. These choices can easily be automated in a SAS program. the data sets tend to have hundreds of risk factors (variables) to be managed. However. due to the unique nature of objectives of a RMS. Some of the simplest methods include LOCF. Now armed with this information. as reflected by the market states. if the number of gaps and the gas sizes are small. some unconventional methods are available to analysts to fill in the missing data. And a typical RMS consumes significant amount of computational resources and can take several hours each day. Advanced methods available in PROC MI are not always feasible under these circumstances. 4 . substitution by mean. there might be other reasons why the data is missing. the analyst might conclude that the more traditional methods like LOCF or substitution by regression. DEALING WITH THE MISSING DATA There are several options to deal with missing data as elaborated in numerous papers. The choice of the method employed depends on several factors including the objectives of the analysis. generating simulations for various market states and computing risk measures. Some of these steps are time intensive and there is pressure to minimize processing time in each step. and then use this information to implement various options to fill in the missing data. the analyst can decide whether to use the alternate rates. or the security in question might not have been traded on some days. the market data vendor’s database might not have been updated. Fig. or more appropriate. for gathering and transforming portfolio and market data. 4. To accomplish this. For example.NESUG 2010 Coders' Corner to this series. a covariance matrix is built which represents correlations between all the risk factors.

76 ID BBCD BBCD BBCD BBCD BBCD BBCD BBCD BBCD BBCD CCCD CCCD CCCD DATE 07/26/2010 07/27/2010 07/28/2010 07/29/2010 07/30/2010 08/02/2010 08/03/2010 08/04/2010 08/05/2010 08/10/2010 08/11/2010 08/12/2010 PRICE .44 35.com Fig. product or service names are registered trademarks or trademarks of SAS Institute Inc. 78. ® indicates USA registration. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. before filling in the data. .0 Example Input data ID AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX AAXX DATE 07/26/2010 07/27/2010 07/28/2010 07/29/2010 07/30/2010 08/02/2010 08/03/2010 08/04/2010 08/05/2010 08/06/2010 08/09/2010 08/10/2010 08/11/2010 08/12/2010 PRICE 35. MD 21045 Email: shankar. .16 77. 35. .81 37.76 5 .22 .73 35. in the USA and other countries. . This can be easily done using standard SAS techniques. 33. . 77.82 .23 77. CONTACT INFORMATION Please contact the author if you have any questions or comments: Shankar Yaddanapudi Paradigm Infotec Columbia. . Other brand and product names are registered trademarks or trademarks of their respective companies.38 . 1. .stat@gmail. .94 .41 77.NESUG 2010 Coders' Corner CONCLUSIONS It some situations it would be beneficial to consider the number and size of gaps in the missing data.39 36. 36.