You are on page 1of 7


Initial Data Capture done in: Paper Computer Entry Screen at INV INVs site A laboratory Instrument Central Lab System, etc. Must be collected and stored in a computer system or systems that will allow complex data cleaning, review and reporting.

Collection of the data into central system done by: Manual (using data entry from paper or image) Electronic (via transfer of data received from other computer applications; eg. Central lab data) Main focus of study setup is to collect and store study data for further processing.

The study setup steps combine:

Definitions and Creation of a Database Preparation of manual data entry applications Design and Programming of transfer or loading programs

Creating a structured set of data. E.g., Excel spreadsheet, a Microsoft Access application, a collection of SAS tables, etc. Deciding on DESIGN is the first step in creating database. After building, the database is tested before it is released for production use.

Main purpose is to store data accurately. A good database design is balance of various needs, preferences, and limitations such as: Clarity, ease and speed of data entry Efficient creation of analysis data sets for biostatisticians Source data transfer formats Database design theory Database application requirements


It is the means of collecting the data from the site. Example includes CRF pages, electronic data entry screens, files containing lab data, etc. Because it is collected from sites, data capture instruments are carefully designed to be clear and easy to use.

All data Managers whether or not responsible for creating a database design from scratch, should understand the kinds of fields and organizations that affect storage, analysis, and processing of the trial data and be aware of some of the options to balance in making a design decision for those fields.

The most common fields that have significant impact on data management are : hidden hidden text fields, where text appears along with numeric values Dates of all kinds Text fields and annotations Header information Single check boxes Calculated or derived values Special integers


Text in numeric fields occurs in all kinds of studies and in all kinds of clinical data and creates a tension between the need to record exactly what is on CRF and the need to collect data that actually can be analyzed in a sensible way. Examples include word trace trace found among Lab measurements, a value of <5 <5 found in an efficacy measure, or a range of 10 15 15 found in a count of lesions.

Options for handling this type of data include: Design the database field to be a text field so that both numeric and text values can be entered. Use a numeric field to store the data and issue a discrepancy if there is text in the field. This makes the most sense when data represents a critical measurement expected to be numeric. Create two fields: one text and one numeric.

Use a numeric field to store the numeric value and create an associated text or comment field to hold text values when they appear. Set data entry guidelines so that a numeric value is chosen. Trace Trace may be entered as 0, <5 <5 as 4 and 10 15 15 as 13 13.

Frequently problem arise if complete date is not known that is, if value is given as 698 98 for June of 1998 instead of June 12, 1998. Dates on a CRF typically falls into three categories: Known dates related to study (visit date, lab sample date) Historical dates (previous surgery, prior treatment) Dates closely related to the study (concomitant medication, adverse events)

The first kind of date is needed for proper analysis of the study and adherence to the protocol and should always be complete. The second kind of date is often not exactly known to the patient and so partial dates are common. The last type is particularly difficult because the dates are actually useful but not always known exactly if there is a significant time span between visits.

A normal database data type of date date usually works just fine for known dates related to the study. If date is incomplete, nothing is stored in the database, a discrepancy can be issued, and it is likely a full resolution can be found. Historical dates are frequently not analyzed and are collected and stored for reference and medical review. Option for storing these kind of dates is a simple text field that would allow dates of any kind.


The most common text type fields are: Categorical (coded) values Short comments Reported terms Long comments Annotations on the page All of these text data types requires special handling especially when the text data is to be analyzed.

Coded values comprise the largest number of text fields. These fields have a limited list of possible answers and usually can contain only those answers. Examples includes Yes/No Yes/No answers. Male/Female Male/Female answers for Gender. Mild/Moderate/Severe Mild/Moderate/Severe answers for Severity.

Short, freefree-text values that may be part of the analysis plan, or part of a secondary analysis, require special attention. Any kind of review or analysis of these values depends on there being some consistency of the data.

Large codelists (sometimes called dictionaries or thesauri) are available for a very common class of free text fields: adverse events, medications and diagnoses. These kind of free text are often called Reported Terms and the matching of the terms to the codelists is a complex coding process.

Longer texts, those that cover several lines, are more frequently associated with a block of fields or even an entire page or visit. Long comments can be stored in several ways: As one large text field As several, numbered text fields In a separate storage area In CRF images only with a cross reference

Patient information, such as Investigator, patient number and patient initials, always appears with data as it is collected to identify where the data comes from. Patient header information is common to all trials, but there are other fields that may be needed to identify the data, e.g. page number, page type or name (AE, DEMOG, PE, etc), document identifier. Good database design theory would call for techniques to assure that the header information be stored only once and then be related to all the remaining data through one or more key codes.


In good CRF design, questions do not have blank blank as an acceptable answer, since blanks may be confused with overlooked fields or unavailable answers. example: Check if any adverse events: [ ]

All options for a database design to support single check boxes store the information of whether or not the box was checked but they have different philosophical angles: Associate a Yes/No codelist with field. Associate a YesYes-only codelist with field. Don Dont use a codelist. Since codelist offers little, some companies use a single character (not associated with a codelist) to indicate a box or field was checked.


Internal fields that can be convenient and even very important to the processing of data, which are calculated from other data using mathematical expressions or are derived from other data using text algorithms or other logic. Examples of calculated values includes: Age (if date of birth is collected) Number of days in treatment (when collecting treatment dates) Weight in kilograms (if collected in pounds) SI lab values (when a variety of lab units are collected) Examples of derived values includes: Extracting a site identifier from long patient identifier Assigning a value to indicate whether a date was complete Matching dictionary codes to reported adverse event terms

Some of these values are calculated or derived in analysis data sets; others are calculated or derived in the central database. Database designers should identify the necessary calculated and derived fields and determine whether they will be assigned values as part of analysis or as part of the central database.

Patient number fields look perfectly benign and numeric. When patient number field is defined as an integer field but the values are very long, many databases and analysis systems will display the number in scientific notation. E.g., 110001999 may display as 1.1e8; other examples are document IDs and batch numbers. Also fields may have leading zero problem. E.g., patient number 010004 for a site 01 will be recognized as 10004 and 1. It is better to define these special integer fields as text fields.


DATABASE NORMALIZATION: Process of creating a design that allows for efficient access and storage. Usually involves a series of steps to avoid duplication or repetition of data often achieved by reducing the size of data groupings or records.

Data Storage in a Short-fat form:

Pt. ID Visit BP_DI A_1 BP_SY S_1 BP_DI A_2 BP_SY S_2 BP_DI A_3 BP_SY S_3








Data Storage in a Tall-skinny form:

Pt. ID Visit Measurement BP_DIA BP_SYS

1001 1001 1001

2 2 2

1 2 3

120 118 117

72 70 68

Both kinds of structures store the data accurately and allow for its appropriate retrieval, yet the choice impacts data management and analysis in several different ways. Clinical data contains many examples of repeated measurements that lend themselves well to storage in talltallskinny form. These includes PE; MH; AE; CM etc.


Considering all the design options, data manager or programmer starts building the database. The design has to be committed to paper first to act as a specification for the database design document. The database design is the specification of the program program to store the data and acts as the basis of testing.

The design document can be quite concise; but to adequately provide the information necessary to build a test set , it must list all the database objects that are to be built or referenced. The design document is essential to the validation and testing of a study, but it can provide other benefits. It can be presented to staff in other groups, such as clinical and biostatistics, for review and comment on how study will be performed.

Printouts of the structure and descriptions of the database objects are the technical documentation that show how the design specifications were implemented. The combination of design specification and technical documentation provides information needed to test the system before production use. Common forms of documentation includes: Screen prints; Results of database queries; Discrepancy reports from cleaning rules; and Other standard reports.


Preparation of the manual entry of data into database usually involves the creation of data entry screens and may require several additional setup or configuration steps for related applications. Screens may be set up for: Single pass entry Double pass entry Entry through OCR (optical character recognition) Remote data entry.

PRINCIPLES OF DATA ENTRY SCREENS Ideally data entry screens would look like a picture of CRF page with the entry fields in exactly those places where they appear on the CRF. Users of entry screens may be: 1. headsheads-down down entry staff who look only at the paper/image and not at the entry fields or keyboard. 2. headsheads-up up entry staff who are reviewing the data as they enter it. 3. Data coordinators or managers who are resolving discrepancies and making edits.


Some clinical data may never be collected on paper. Lab data may be stored directly in a central database. This kind of data rather loaded or transferred to a central database directly. The preparation for receiving data through transfer or loading should include some process issues.

The issues might include: How to track the transfer. The handling of data that cannot be loaded. How and when to run cleaning checks What to do if edits, changes, or updates are required. All these needs to be addressed before setting up the database.

USING STANDARDS: Database design document. QUALITY ASSURANCE. SOPs FOR STUDY SETSET-UP: Database design Using Standards Database creation and Testing Loading or transferring data