Professional Documents
Culture Documents
2
O International Epidemiological Association 1996 Printed In Great Britain
Development in technology has led to a considerable primary data. Other advantages include the size of the
increase in the number of individual-based data sample, its representativeness, and the reduced likeli-
sources, registers, data bases, and information systems hood of bias due to, for example, recall, non-response
that may be of value in epidemiological research, and and effect on the diagnostic process of attention caused
the number of studies that are based on secondary data by the research question.1'7
may be expected to increase. Secondary data in The disadvantages of secondary data are related to
research are data which have not been collected with the fact that their selection and quality, and the methods
a specific research purpose.1 Such data are often of their collection, are not under the control of the
collected for: 1) management, claims, administration researcher, and that they are sometimes impossible to
and planning;2 2) evaluation of activities within health validate.
care;2 3) control functions;3'4 and 4) surveillance or Despite comprehensive use of secondary data
research.5'6 sources, the literature concerning this is relatively
The main advantage of using secondary data sources modest.7"24 The purpose of this article is to address
is that they already exist; the time spent on the study is issues that are of importance to the use of a secondary
therefore likely to be considerably less than the time data source for research, and to illustrate the subject
spent on studies that use primary data collection. Fur- matter with some examples.
thermore, the costs of the project are reduced markedly, Individual-based data sources usually consist of a
as is the waste of data, compared with collection of series of records for each individual, each record
containing several items of information, but since
data collection and compilation have usually not been
• Department of Internal Medicine V, Aartius University HospitaJ,
DK-8000 Aarhus C, Denmark.
done with a current research purpose in mind,
•* Department of Epidemiology and Social Medicine, University of secondary data will usually not cover all aspects of
Aarhus, DK-8000 Aarhus C, Denmark. interest.
* The Danish Epidemiology Science Centre, The Steno Institute of As in all research, planning a study should aim at
Public Health, University of Aarhus, DK-8000 Aarhus C, Denmark.
Reprint requests to: Henrik Toft Soretuen, The Danish Epidemiology
reducing both systematic and random errors. Any study
Science Centre, Norrebrogade 44, Building 2c, Aarhus University based on secondary data should be designed with the
Hospital, DK-8000 Aarhui C, Denmark. same rigour as other studies, i.e. specifying hypotheses
435
436 INTERNATIONAL JOURNAL OF EPIDEMIOLOGY
TABLE 1 Factors affecting the value of secondary data in epi- based on absolute values obtained by subtracting
demiological research occurrence data in one population from occurrence data
in another. In order to obtain a high specificity,
1. Completeness of registration of individuals all those with the diagnosis in question may only be
a. Comparing the data source with one or more independent accepted after close scrutiny, e.g. by going through
reference sources existing medical files, asking for additional infor-
b. Comprehensive records review
c. Aggregated methods mation, etc. The principle is similar to applying screen-
2. The accuracy and degree of completeness of variables ing tests in sequence when only test positives are
a. Precision screened twice. Should it be possible to exclude every-
b. Validity one without the diagnosis, the case ascertainment
3. The size of the data sources will have a specificity of 1, making all relative mea-
4. Registration period
FACTORS AFFECTING THE VALUE OF TABLE 2 Terminology in relation to evaluation of data source I.
SECONDARY DATA Data source 2 is used for comparison, but is seldom a gold
standard and cannot immediately be used as a reference for the
Completeness of Registration of Individuals true disease occurrence in a population. A prerequisite in the
By this we mean the proportion of individuals in the design of the table is to go through records ensuring that
target population which is correctly classified in the the cases fulfil the criteria of being cases
data source. In this respect, it is important to know
whether the data source is population-based, like many Data source I Data source 2
of the European health insurance systems,2 or whether
it has been through one or more selection procedures Registered cases Non-registered cases
(for instance in Medicaid which may exclude large
groups of people who were included in previous Registered cases a b a+b
comorbidities, d) limits in the specificity of available identified; for example, the results of certain diagnostic
codes, e) errors and variation in clinical diagnosis. It is tests, diagnoses, age, gender, contacts with medical
important to underline that discharge systems are often doctors, and demographic data.
event-based and not person-based. Errors of variables can be divided into two groups:
C. In aggregated methods the total number of cases in 1) random errors and 2) systematic errors. To secure
the data source is compared with the total number in the accuracy of a study it is necessary to try to reduce
other sources, or the expected number of cases is calcu- the occurrence of both types.
lated by applying epidemiological rates from demo- The concept of precision is complementary to the
graphically similar populations or by simulation. concept of random error. Validity is the extent to which
Simulation uses the information system to simulate pat- the study measures what it is intended to measure. Lack
terns of incomplete reporting to examine the possible of validity is referred to as bias or systematic error, and
effect on a specific dependent variable.8'33
Another recurring problem with data base studies criteria and classifications (e.g. the recent change from
concerns missing information about potential con- ICD-9-CM to ICD-10 disease classification system)
founders;43'44 a potential confounder which is often frequently cause problems when comparing data over
lacking in secondary data sources is smoking. It has longer periods.
been calculated that even when one is dealing with Example: Based on data from the Danish Cancer
large effects (i.e. risk ratios of 5=2) different smoking Register in the period 1943-1985 it has been shown
habits will usually only be able to 'explain' part of the empirically that the increase in the incidence of regis-
association.44 tered primary liver cancer in Denmark may be ascribed
In summary, data quality problems can be categ- to changes that have taken place in diagnostic proced-
orized as follows: 1) errors in the data set may reflect ures, which resulted in changes in the classification of
incorrect data entry or lack of entry of available liver cancer from unspecified'liver cancer to primary
The Size of the Data Source Data Accessibility, Availability and Cost
It is essential to know how many people and how many It is often not clear who owns the data and who has the
variables are registered in the data source. Furthermore, right to use them (accessibility).49 It is important to
it may be relevant to know the distribution of the clarify these points and to find out which authorities
various variables since this may be of importance in should approve the use of the data for research pur-
designing the study to provide it with proper dimen- poses. It is well-known that general practitioners and
sion.45 Use of restriction and matching in the control of hospitals do not always respond or do not accept the use
confounding factors and sources of selection often of their records for research.3-37-50 Records may have
require progressively more subjects as the number of been destroyed.51
matching variables increases. It is also important to know the financial costs of
If the data source is very large, even small associ- using the data and for having them made available.
ations will give statistically significant results. It is Data are sometimes available on-line, but more
therefore essential to relate the size of the data source to often special programmes are needed. Information
the clinical relevance of any difference, rather than to on data confidentiality is also essential in order to
look at the P-values. In registry studies, as in other ensure protection of confidentiality of data on indi-
types of study, it is more important to calculate the viduals which are reported to the data sources, so
confidence interval around the estimated difference that information on those registered cannot reach
than to calculate the /"-values."' 46 unauthorized third parties. 52 Confidentiality can
often be maintained by using multiple passwords for
Registration Period data access and by using abbreviated identifiers in
Often data sources only contain cross-sectional regist- the researchers' data.13
rations,47 which reduce the possibility for analytical
studies. With respect to longitudinal studies, informa- Data Format
tion concerning the registration period(s) is essential Data will often be in the form of paper records from
for the design in order to relate exposure and effect to hospitals5 or be computerized, e.g. many health insur-
possible induction and latent periods. The induction ance data. 2 '"- 18 ' 21 " 23 Even computerized records can be
period is the period required for a specific cause to formatted or structured in such a way that their use is
produce disease, the latent period is the delay between made difficult for research, e.g. inappropriate format of
the exposure and the period of manifestation of the variables (e.g. diagnostic categories, age bands).
disease. Data sources with observation periods of a few
years will seldom be suitable for aetiological cancer Possibilities of Linkage with Other Data Sources
research. The length of registration may also be import- (Record Linkage)
ant in ascertaining cases where the diagnosis is delayed, Important research results have been obtained by
e.g. congenital heart diseases are often not diagnosed linkage of different data sources.53f54 Record linkage
until after the neonatal period. techniques can help to identify the same person in dif-
Furthermore, codes, and even the layout of records, ferent files. There are for example several computerized
are often changed periodically. Changes in diagnostic health care data bases in North America. 1018 ' 2 '- 23 ' 35 ' 35 ' 3 *
440 INTERNATIONAL JOURNAL OF EPIDEMIOLOGY
By using computerized billing records, drug exposure inform the people concerned when personal data are
was linked to files which included diagnosis (internal used. The passing of this proposal will pose a serious
record linkage). Linkage is done via a personal identi- problem to research based on information systems.63"65
fication number. Consequently, these data bases con- The proposal has just been examined by the EU, but the
stitute powerful tools for drug evaluation. However, a final text is not yet known.66 Several of these systems
complete high-quality record linkage may not always are made for administrative purposes focusing on
be possible. The best population-based data sources are individuals, while researchers are interested in groups
probably the extensive data linkage networks in and their interaction. The proposal only affects research
Scandinavia, where each person is assigned a unique and has no protective effect on misuse by the
personal registration number at birth (CPR-number), administrative apparatus which collected the data.
allowing record linkage between several independent
to that obtained by interview. Am J Epidemiol 1989; 129: rheumatoid arthritis and lymphoma in the province of
616-24. Saskatchewan, Canada. J Clin Epidemiol 1993; 46: 685-
44
Axelson O. Aspects of confounding and effect modification 95.
in the assessment of occupational cancer risk. J Toiicol ^Rodriquez L A G , Walker A M, Gutthann S P. Nonsteroidal
Environ Health 1980; 6: 1127-31. antiinflammatory drugs and gastrointestinal hospitaliza-
43
Strom B L. Sample size considerations for pharmaco- tions in Saskatchewan: a cohort study. Epidemiology 1992;
epidemiologic studies. In: Strom B L (ed). Pharmaco- 3: 337^2.
37
epidemiology. New York: Churchill Livingstone, 1989, Kramer M S. Clinical Epidemiology and Biostatistics. Berlin:
pp. 27-37. Springer- Verlag, 1988.
31
** Walker A M. Reporting results of epidemiologic studies. Am J Frisch M, Olsen J H, Bautz A, Melbye M. Benign anal lesions and
Public Health 1986; 76: 556-58. the risk of anal cancer. N Engl J Med 1994; 331: 300-02.
47 39
Gable C B. A compendium of public health data sources. Am J S0rensen H T, Larsen B O. A population-based Danish data
Epidemiol 1990; 131: 381-94. resource with possible high validity in pharmaco-
41
epidemiological research. J Med System 1994; 18: 33-38.