You are on page 1of 33

Data Structures and CAATTs

for Data Extraction


Data Structures
Two fundamental components:

Organization: the way records are


physically arranged on the secondary
storage device

Access method: technique used to


locate records and to navigate through
the database or file

2
Access:
Non-Index
Methods

Hashing
Pointers
INDEX File DATA File

Access: Data
Index Methods Organization

SEQUENTIAL SEQUENTIAL
ISAM RANDOM RANDOM
3
File Processing Operations
1. Retrieve a record by key
2. Insert a record
3. Update a record
4. Read a file Individual
Records
5. Find next record
6. Scan a file
7. Delete a record

4
Data Structures
 Flat file structures
 Sequential structure
All records in contiguous storage spaces in specified
sequence (key field)

Sequential files are simple & easy to process

Application reads from beginning in sequence

If only small portion of file being processed, inefficient


method

Does not permit accessing a record directly

Efficient: 4, 5 – sometimes 3
5
Inefficient: 1, 2, 6, 7 – usually 3
Data Structures
 Flat file structures
 Indexed structure
In addition to data file, separate index
file
Contains physical address in data file
of each indexed record

6
Data Structures
 Flat file structures
 Indexed random file [Figure 8-2]
Records are created without regard to physical
proximity to other related records

Physical organization of index file itself may be


sequential or random

Random indexes are easier to maintain, sequential


more difficult

Advantage over sequential: rapid searches

Other advantages: processing individual records,


efficient usage of disk storage

Efficient: 1, 2, 3, 7
7


Data Structures
 Flat file structures
 Virtual Storage Access Method (VSAM)
 Large files, routine batch processing
 Moderate degree of individual record processing
 Used for files across cylinders
 Uses number of indexes, with summarized content
 Access time for single record is slower than Indexed
Sequential or Indexed Random
 Disadvantage: does not perform record insertions efficiently
– requires physical relocation of all records beyond that
point – SOS
 Has 3 physical components: indexes, prime data storage
area, overflow area [Figure 8-4]
 Might have to search index, prime data area, and overflow
area – slowing down access time
 Integrating overflow records into prime data area, then
reconstructing indexes reorganizes ISAM files
 Very Efficient: 4, 5, 6
 Moderately Efficient: 1, 3
 Inefficient: 2, 7
8
DBMS etc.

Legacy systems

Legacy systems

1960 1970 1980 1990


EVOLUTION OF ORG./ACCESS METHODS

9
Efficient

Inefficient

Access single records Access entire files

10
Hashing Structure
 Employs algorithm to convert primary key
into physical record storage address
 No separate index necessary
 Advantage: access speed
 Disadvantage
 Inefficient use of storage
 Different keys may create same
address
 Efficient: 1, 2, 3, 6
 Inefficient: 4, 5, 7

11
Pointer Structure
 Stores the address (pointer) of related record in a
field with each data record
 Records stored randomly
 Pointers provide connections b/w records
 Pointers may also provide links of records b/w files
 Types of pointers:
 Physical address – actual disk storage location
• Advantage: Access speed
• Disadvantage: if related record moves, pointer must be changed
& w/o logical reference, a pointer could be lost causing
referenced record to be lost
 Relative address – relative position in the file (135th)
• Must be manipulated to convert to physical address
 Logical address – primary key of related record
• Key value is converted by hashing to physical address
 Efficient: 1, 2, 3, 6
 Inefficient: 4, 5, 7

12
Database Conceptual Models
• Refers to the particular method used to organize
records in a database.
• a.k.a. “logical data structures”
• Objective: develop the database efficiently so that
data can be accessed quickly and easily.
• There are three main models:
• hierarchical (tree structure)
• network
• relational
• Most existing databases are relational. Some legacy
systems use hierarchical or network databases.

13
The Relational Model
• The relational model portrays data in the form
of two dimensional ‘tables’.
• Its strength is the ease with which tables may be
linked to one another.
• a major weakness of hierarchical and network
databases
• Relational model is based on the relational
algebra functions of restrict, project, and join.

14
The Relational Algebra Functions
Restrict, Project, and Join

15
Associations and Cardinality
• Association
• Represented by a line connecting two entities
• Described by a verb, such as ships, requests, or receives
• Cardinality – the degree of association between
two entities
• The number of possible occurrences in one table that
are associated with a single occurrence in a related table
• Used to determine primary keys and foreign keys

16
Examples of Entity Associations

17
Properly Designed Relational Tables
• Each row in the table must be unique in at least one
attribute, which is the primary key.
• Tables are linked by embedding the primary key into the
related table as a foreign key.
• The attribute values in any column must all be of the
same class or data type.
• Each column in a given table must be uniquely
named.
• Tables must conform to the rules of normalization,
i.e., free from structural dependencies or anomalies.

18
Three Types of Anomalies
• Insertion Anomaly: A new item cannot be added
to the table until at least one entity uses a
particular attribute item.
• Deletion Anomaly: If an attribute item used by
only one entity is deleted, all information about
that attribute item is lost.
• Update Anomaly: A modification on an attribute
must be made in each of the rows in which the
attribute appears.
• Anomalies can be corrected by creating additional
relational tables.

19
Advantages of Relational Tables
• Removes all three types of anomalies.
• Various items of interest (customers,
inventory, sales) are stored in separate
tables.
• Space is used efficiently.
• Very flexible – users can form ad hoc
relationships.

20
The Normalization Process
• A process which systematically splits unnormalized
complex tables into smaller tables that meet two
conditions:
• all nonkey (secondary) attributes in the table are
dependent on the primary key
• all nonkey attributes are independent of the other
nonkey attributes
• When unnormalized tables are split and reduced to
third normal form, they must then be linked
together by foreign keys.

21
Steps in the Normalization Process

22
Accountants and Data Normalization
• Update anomalies can generate conflicting and
obsolete database values.
• Insertion anomalies can result in unrecorded
transactions and incomplete audit trails.
• Deletion anomalies can cause the loss of accounting
records and the destruction of audit trails.
• Accountants should understand the data
normalization process and be able to determine
whether a database is properly normalized.

23
Six (6) Phases in Designing
Relational Databases
1. Identify entities
• identify the primary entities of the
organization
• construct a data model of their relationships
2. Construct a data model showing entity
associations
• determine the associations between entities
• model associations into an ER diagram

24
Six (6) Phases in Designing
Relational Databases
3. Add primary keys and attributes
• assign primary keys to all entities in the model
to uniquely identify records
• every attribute should appear in one or more
user views
4. Normalize and add foreign keys
• remove repeating groups, partial and transitive
dependencies
• assign foreign keys to be able to link tables

25
Six (6) Phases in Designing
Relational Databases
5. Construct the physical database
• create physical tables
• populate tables with data
6. Prepare the user views
• normalized tables should support all required
views of system users
• user views restrict users from having access
to unauthorized data

26
Auditors and Data Normalization
 Database normalization is a technical matter
that is usually the responsibility of systems
professionals.
 The subject has implications for internal control
that make it the concern of auditors also.
 Most auditors will never be responsible for
normalizing an organization’s databases; they
should have an understanding of the process
and be able to determine whether a table is
properly normalized.
 In order to extract data from tables to perform
audit procedures, the auditor first needs to
know how the data are structured. 27
Embedded Audit Module (EAM)
 Identify important transactions live
while they are being processed and
extract them
 Examples
 Errors
 Fraud
 Compliance

• SAS 109, SAS 94, SAS 99 / S-OX

28
Embedded Audit Module
 Disadvantages:
 Operational efficiency – can decrease
performance, especially if testing is
extensive
 Verifying EAM integrity - such as
environments with a high level of
program maintenance
 Status: increasing need, demand, and
usage of COA/EAM/CA
29
Generalized Audit Software (GAS)
 Brief history
 Most widely used CAATT
 Usages include:
1) Footing and balancing entire files or selected data
items (e.g., extending inventory)
2) Selecting and reporting detail data

3) Selecting stratified statistical samples from data files

4) Formatting results into audit reports (auto work papers!)

5) Printing confirmations

6) Screening / filtering data

7) Comparing multiple files for differences 30


Generalized Audit Software
 Popular because:
1. GAS software is easy to use and requires
little computer background
2. Many products are platform independent,
works on mainframes and PCs
3. Auditors can perform tests independently
of IT staff
4. GAS can be used to audit the data
currently being stored in most file
structures and formats

31
Generalized Audit Software
 Simple structures [Figure 8-27]
 Complex structures [Figures 8-28, 8-29]
 Auditing issues:
 Auditor must sometime rely on IT personnel to produce
files/data
 Risk that data integrity is compromised by extraction
procedures
 Auditors skilled in programming better prepared to avoid
these pitfalls

32
ACL
 ACL is a proprietary version of GAS
 Leader in the industry
 Designed as an auditor-friendly meta-
language (i.e., contains commonly
used auditor tests)
 Access to data generally easy with
ODBC interface

33

You might also like