Professional Documents
Culture Documents
Unit Iv: Data AND Knowledge Management
Unit Iv: Data AND Knowledge Management
ND K
DATA A NOWLEDGE M
ANAGEMENT
OBJECTIVES:
6) explain the concepts of data warehousing and data mining and their use in
business;
2
UNIT IV
ND K
DATA A NOWLEDGE M
ANAGEMENT
Data are usually not collected in a way that makes them immediately useful to business people.
Imagine building a model palace from a pile of building blocks. You have a good idea of what you
want to build, but first you have to organize the block so it is easy for you to find and select only the
blocks you need. Then you can combine them into substructures that eventually will be integrated
into your model. Similarly, data collected by organizations and knowledge gathered by its members
must be organized and stored so that useful information can be extracted from them flexibly.
We can roughly distinguish between two different approaches to maintaining data:
traditional file organization – which has no mechanism for tagging, retrieving, and manipulating
data – and the database approach, which does have that mechanism. To appreciate the benefits of
the database approach, you must keep in mind the inconvenience involved in accessing and
3
manipulating data in the traditional file approach: program/data dependency, high data redundancy
and low data integrity.
Program/Data Dependency
They are different ways that data can be stored in files. That is sequential file storage, t he bits and bytes
that make up records are laid down on the storage medium one after another; other is the direct file
storage, t he bits and bytes may be written to the media in any order because they are accessed via
physical addresses; and last is the indexed sequential file storage, the bits and bytes are organized
sequentially but can also be accessed directly by physical address. These storage methods are all
considered traditional, or flat file, format types of organization.
For example, a human resource files in traditional file format. Suppose a programmer wants to
retrieve and print out only the last name and department number of each employee from this file.
The programmer must clearly instruct the computer to first retrieve the data between position 10 and
position 20. Then he must instruct the computer to skip the positions up to position 35 and retrieve
the data between position 36 and 39. He cannot instruct the computer to retrieve a piece of data by
its column or category name, because column a category names do not exist in this format. To create
the reports, the programmer must insert the appropriate headings, “Last Name” and “Department”,
so that the reviewer of the output can understand what the data are. If the programmer miscounts
the positions, the printout may include output like “677Rapap” as a last name instead of “Rapaport”.
This illustrates one major problem with traditional file storage: the interdependency of programs and
data. The programmer must know how data are stored to use them. Perhaps most importantly, the
very fact that manipulation of the data requires a programmer is probably the greatest disadvantages of
the file approach.
Moving to Databases
Continuing to our analysis of the university records, would it be easier if the four pieces of data
used by all three departments were stored only once and accessed and manipulated by the different
departments into whatever reports they needed? In fact, would it be easier if all the data were stored
4
only once and accessible to everyone in the university system to do with as they please? The answer is
yes, and that is the basic idea behind the database approach.
In the database approach, we maintain and manipulate data about entities. An entity is any
object about which an organization chooses to collect data. It may be a student enrolled in a class, a
sales transaction in a business, or a part in an inventory. In the context of data management, entity
refers to all the occurrences sharing the same types of data. Therefore, it does not matter if we
maintain a record of one class or many classes; the entity is “class”. To understand how data are
organized in a database, you must first understand the data hierarchy which shows a compilation of
information about students: their first names, last names, years of birth, SSNs, majors (department),
and campus phone numbers. The smallest piece of data is a character (a letter in a first or last name
or address, and so on). Several characters make up data in a field (also called a data item) , such as last
name, first name, and the like. A field is one piece of information about an entity, such as the last
name or first name of a student. Several fields related to the same entity make up a record. A
collection of related records is called a file. Often, several related files must be kept together. A
collection of such files is referred to as a database. However, the features of a database can be
enjoyed by builders and users of databases even when a database consists of a single file.
To continue with the above example, let’s assume the university is now storing all nine
unique pieces of student’s data once in a single database, which is a union of the necessary data.
Once the fields are assigned names, including Last Name, First Name, SSN, and the like, the data in
each field carry a tag – a field name – and can be easily accessed by that field name, no matter where
the data are physically stored. One of the greatest strengths of databases is their promotion of
application-data independence. In other words, if an application is written to process data in a
database, the application designer only needs to know the names of the fields, not their physical
organization on their length.
While a database itself is a collection of several related files, the program used to build database,
populate them with data, and manipulate the data is called a database management system
(DBMS). The files themselves are the database, but DBMSs do all the work –structuring files,
storing data, and linking records. If you wanted to access data from files that were stored in a
traditional file approach, the records would have to be organized in a very specific way, and you
would have to know exactly how many characters were designated for each type of data. A DBMS,
however, does much of this work (and a lot of other work) for you.
5
Queries
Data are accessed in a database by sending messages called queries, which request data from specific
fields and direct the computer to display the results on the monitor. Queries are also entered to
manipulate data. Usually, the same software that is used to construct and populate the database, that
is, the DBMS, is also used to present queries. Modern DBMS programs provide fairly user friendly
means of querying a database.
Security
The use of databases raises security and privacy issues. The fact that data are stored only once in a
database does not mean that everyone with access to that database should access to all data in it.
Restricting access is easily dealt with by customizing for different users and requiring users to enter
codes that limit access to certain fields or records. As a result, users have different views of the
database. The ability to limit users’ view to only specific columns or records gives the database
administrator (the person who plans the database and ensures that it is up and running) another
advantage: the ability to implement security measures. The measures are implemented once for the
database, rather than the multiple times for different files. For instance, a human resource manager
has access to all fields of the employee file, the payroll personnel have access only to four fields of
the employee file, and a project manager has access only to the Name and hours worked fields. View
may be limited to certain fields in a database, or certain records, or a combination thereof.
The advantages of storing data in database files far outweigh those storing them in flat files. While
there are some trade-offs, databases generally allow much greater flexibility, easier access by different
application, easier maintenance of data currency and integrity, and savings in both cost and time, all
of which make them far superior to disparate flat files. Database advantages include the following:
1. Reduced data redundancy – although there may still be some redundancy in a database, it is
significantly less than in the traditional file approach. This streamlining saves storages space.
2. Application-data independence – writing an application to use data from a database is much
simpler than writing one to use data from flat files. To access data in a database, a program
can use field names and the names of the data sets in which the data exist, such as a list of
patient records in a hospital. This programming efficiency saves programming time and
allows users with limited knowledge to access data through queries or even to develop
simple applications.
3. Better control – since all data are concentrated in one place in database, it is easier to control
access and maintain data, and it is easier to get an overall view of data about an entity.
In general, the opposite of these database advantages are the disadvantages of using traditional files
to store data. The traditional file approach creates data redundancy and application-data
6
independence. It does not support as tight control over data currency, accuracy, and integrity as the
database approach and it provides less flexibility in data maintenance. However, the traditional file
approach does have some advantages, including the following:
1. Efficiency – applications written for flat files run more efficiently than those written for
database because they do not use the additional CPU time and memory space required by
preprogrammed functions that are part of the DBMS. Often, the easier a program is to use,
the more CPU time and memory space it needs.
3. Customization – the preprogrammed features of a DBMS allow only certain relationships
among data. However, using the more flexible procedural features of a third or fourth
generation language to build files and to access them allows tight tailoring of applications to
business needs- more so than using only the preprogrammed features of a DBMS.
The overwhelming advantages of databases raise the question: why use flat files at all? If you were
starting with a clean slate, you probably wouldn’t choose to use flat files. However, businesses have
accumulated a considerable amount of historical data in flat file format that they will be dealing with
for many years to come. Because considerable amounts of data in businesses are still stored in flat
files and accessed through applications that were written with third generation languages such as
COBOL (which by their nature are designed to access flat files), it may be too costly to switch to
databases. However, almost all new data banks are developed and maintained with the aid of DBMS.
DATABASE MODELS
A database model is the general logical structure in which records are stored within a database.
There are three different database models. They differ in the manner in which records are linked to
each other. These differences, in turn, dictate the manner in which a user can navigate the database
and retrieve desired records. Each model has advantages and disadvantages when compared with the
other two.
To understand the different models, let’s consider a database for storing university data: there are
records about colleges, departments, professors and students. Logically, these four types of university
records are hierarchical, meaning that each category is a subcategory of the next higher level. The
highest level is college; each college has several departments; each department consists of several
professors; and each professor has several students. The hierarchical model follows the pattern of an
upside-down tree and is sometimes referred to as the tree model. Therefore, if the university chose to
follow a hierarchical model, the records would be stored indicates schematically the relationships
among the various levels.
7
There are as many College records as there are colleges in the university. Each record contains
the appropriate values of the following data items: College Name, Dean’s Last Name, College
Address, and College Telephone Number. Each college record has the records of its department
linked to it. For example: the College of Business is linked to the records of the Accounting,
Marketing, Finance, and other departments within the College of Business. The record of each
department is linked to the record of each of its professors. And the record of each professor is
linked to the record of every student he or she has.
How is a child record linked to a parent record? By adding pointer fields to the records.
Pointers maintain the address of the parent record, the first child of their own record, the previous
record in the file, and the next record in the file. The advantage of hierarchical databases is their
suitability for maintaining data on hierarchical environments. But hierarchical databases also have
several disadvantages. To retrieve a certain record, user must start the search at the root, which is the
set of records at the very top level and then navigate the hierarchy until they find the desired record.
If, for some reason, a link is broken, the entire branch that was connected through that pointer to the
other records is lost. And because child records can have only one parent, hierarchical databases
require considerable data redundancy. For example, the records of students who take several classes
with several professors must be stored multiple times, each time as a child of another professor.
Thus, in our example, the entire records of students Khori and Williams must appear in both classes
of Professor AlNajjar.
The reverse of the last disadvantages of the hierarchical model is the greatest advantage of the
network model: the ability to store a record only once in the entire database while creating links that
establish relationships with several record of another type of entity. Remember that in the
hierarchical model there was data redundancy because separate repetitive records for students Khori
and Williams had to be maintained in two different student files, one linked to Professor AlNajjar
and one to Professor Munro. The network model, on the other hand, would allow the same record to
be linked to more than one parent. The records of Williams and Khori are store only once, in the file
containing the records of Professor Munro’s students, but they are also linked to Professor AlNajjar’s
record. When the database user lists Professor AlNajjar’s students, these two records are included in
the list. Similarly, the records of students Hans Bohr and Paul Harmon will appear in only one of the
professor’s student files, AlNajjar’s and will be linked to both parents: AlNajjar and Munro. Now
imagine many such relationships. If you draw the relationships as lines connecting the records, you
will create a network of relationships. These networked links give the model its name. Unlike the
hierarchical model, the network model supports many-to-many (M: M) relationships.
8
Network databases create significantly less data redundancy than hierarchical databases, but
they are complicated to build and difficult to maintain. While the user does not have to start a search
at the root, it is difficult to navigate in the database. The complex network of relationships create
“spaghetti” that is hard to follow. For these reasons, the network structure is the least popular model.
Relational are easier to conceptualize and maintain than hierarchical and network ones. To
build a relational database, you only need to have a clear idea of the different entities. In our example,
the entities are college, department, professor, and student. A single table is built for each object.
Remember that entity in out context refers to a record structure of all the occurrences of a subject.
Thus, when database designers think of “professor,” they know the professor table may include
records of many professors.
Retrieving a desired record is easy. To find a record of a certain professor, you need to
access the Professor table and make an inquiry. Maintenance is easy because the user does not have
to recall any relationships. Each table stands alone. To-add a student record, the user accesses the
Student table. Similar actions take place to change or delete a record. The advantages of this model
make relational database management systems the most popular in the software market. Virtually all
DBMSs that are offered for microcomputers accommodate the relational model.
Keys
Primary Key
If there is more than one record with “Munro” in the L. Name field, you may not retrieve
the record you desired. Depending on the software you use, you may receive the first one that
meets the condition, or a list of all the records with that value in the field. The only way to be
sure you are retrieving the desired record is to use a unique key (such as Social Security number).
A unique key is called a primary key. If your query specified that you wanted the record whose
ID# value is “3343,” the system would retrieve the record of Sarah Munro.
9
Usually, a table in a relational database must have a primary key, and most relational DBMSs
enforce this rule; if the designer does not designate a field as a key, the DBMS creates its own
serial number field as the primary key field for the table. Once the designer of the table
determines the primary key when constructing the records’ format, the DBMS will not allow a
user to enter two records with the same value in that column. Note that there may be situations
in which more than one field may be used as a primary key. Such is the case with motor vehicles,
because both the vehicle identification number (VIN) and the license plate number uniquely
identify a car. Thus, a database designer may establish either field as primary key to retrieve
records.
Many DBMS will force you to designate a primary key in each table you construct. Usually,
the software requires that the primary key be the leftmost field in the record. By default, many
DBMS automatically sort the records the user enters in ascending order of the primary key,
which is not the case in our College, Department, and Professor tables.
Some relational databases use composite keys, a combination of two or more fields that
together serve as primary key. An example, the last name, first name, and department in a table
that holds professors’ records could together be considered a primary key. Unless we expect two
people with the same name to lecture for the same department, the combination will be a valid
primary key.
Linking
To link records from one table with records to another table, the tables must have one field in
common (that is, one column in each table must contain the same type of data), and that field
must be a primary key field for one of the tables. We say that this repeated field is a primary key
in one table, and foreign key field in the other table. In our example, to create a report showing
the last name of every professor and next to each name, a listing of all the students of that
professor, both the Professor and Student tables must contain the unique ID number of the
professor. In the Professor table, the ID number is the primary key. In the Student table, the
Professor ID number is a foreign key field. To generate a list of the students of the professor
whose ID number is 4467, the DBMS calls for the last name of the professor whose ID number
is 4467 (AlNajjar), and a list of names of all student records for which the Professor Id number
field contains 4467. Such a table is called a join table. As long as student’s record contains a field
with his or her professor’s ID number, it is possible to create the join table.
As you can see, all database design requires careful forethought. The designer must include
fields for foreign keys from other tables so that join tables can be created in the future. The
inclusion of foreign keys may cause considerable data redundancy. His complexity has not
diminished the popularity of relational databases, however. Since the relationships between tables
are created as part of manipulating the table, the relational model supports both one-to-many (1:
M) and many-to-many (M: M) relationships between records of different tables.
While the move from traditional file systems to databases was a leap forward in data
management efficiency, recent years have seen a new development that may lead to even greater
10
benefits: object-oriented databases. In object-oriented technology, an object consists of both data
and the procedures that manipulate the data. So, in addition to the attributes of an entity, it also
contains the relationships with other entities. The combined storage of both data and the
procedures that manipulate them is referred to as “encapsulation.” Thus, an object can be
“planted” in different data sets. The ability in object-oriented structures to automatically create a
new object by replicating all or some of the characteristics of a previously developed object
(called the parent object) is called inheritance.
All these capabilities make object-oriented DBMSs (OODBMSs), handy in computer-aided
design (CAD) because they can handle a wide range of data – such as graphics, voice, and text –
more easily than the other models.
Entity-relationship Diagrams
Many business databases consist of multiple files with relationships among them. For example, a
hospital may use a database that has a file holding the records of its entire physician, another one
with all its nurses, another one with all the current patients, and so on. The administrative staff must
be able to create reports that link data from multiple files. Thus, database must be carefully planned
to allow useful data manipulation and report generation. The planning tasks often involve the
creation of a conceptual blueprint of the database. This blueprint is called an entity-relationship
(ER) diagram. An ER diagram is a graphical representation of all entity relationships, and they are
often consulted to determine a problem with a query or to implement changes. Boxes are used to
identify entities, also referred to as objects. Lines are used to indicate a relationship between entities.
When crow’s-feet are pointing to an object, there may be many instances of that object. When a link
with crow’s-feet also includes a cross-bar, then all instances of the object on the side of the
crow’s-feet are linked with a single instance of the object on the side of the cross-bar. A second
cross-bar would denote “mandatory,” which means that the relationship must occur, such as between
a book title and author: a book title must have an author with which it is associated. A circle close to
the box denotes “optional.”
● The crow’s-feet on the Department end of the Department/School relationships indicate
that there are several departments in one school, indicating a one-to-many relationship
between school and department. In addition, the cross-bar at the School end of the
School/Department links indicates that a department belongs to only one school.
● A department has many professors, but a professor may belong to more than one
department; thus, relationship between Professor and Department is many-to-many,
represented by the crow’s-feet at the both ends of the link.
11
● A course is offered by a single department, indicated by the cross-bar at the Department
end of the Department/Course link.
● A professor may teach more one than one student, and a student may have more than one
professor, thus the crow’s-feet at both the Professor and Student ends of the many-to-many
relationship between professor and student.
● However, the ring at the Student end indicates that a professor does not have to have
students at all. The ring means “optional”, and is there for cases in which professors do not
teach.
The designers must also detail the attributes of each object, which will determine the fields for
each record of that object. The attributes are listed in each object box, and the primary key attribute
is underlined. Usually, the primary key attribute appears at the top of the attribute in the box. You
should be aware that database designers may use different notations. Therefore, before you review an
ER diagram, be sure you first understand what each symbol means.
When designers have a clear understanding of how a database should be structured to
accommodate the different data sets and the relationships among them, they select a DBMS to
construct the new database. While DBMSs have different interfaces, they share similar components.
These components allow the user to create sets of data about entities, define fields, organize record
structures, populate the database with data, and manipulate the data in the different fields, records,
and fields. Simple databases can often be designed by lay users, but more complex databases usually
require the involvement of an experienced database designer. The components of a DBMS are the
data definition language (which enables the building of schemas and data dictionaries, described later)
and the data manipulation language (which allows the user to manipulate data).
Note the difference between a record structure and a record: a “record structure” is the general
structure of a record, defining the types of fields that make it up; a “record” is the actual data that
pertain to a specific instance. Therefore, for a file that holds the records of professors, we need to
design a record structure that describes which fields will appear in every a ctual data record (for
instance, ID number, last name, first name, department name, and telephone number). A record will
be the row of data describing a specific professor in the professor’s file (such as, 123-33-76-85,
Weinrib, Janet, English, 209-8256). That is, a record contains the actual data values.
The Schema
When building a new database, users must first build a schema (from the Greek word for “plan”).
The schema describes the structure of the database being designed: the names and types of fields in
each record type and the general relationships among different sets of records or files. It includes a
description of the database’s structure, the names and sizes of fields, and details such as which field is
a primary key. The number of records is never specified because it may change, and the maximum
number of records is determined by the capacity of the storage medium.
12
Types of Data – fields can hold different types of data: numeric, alphanumeric, graphic, or
time related. Numeric fields hold numbers that can be manipulated by addition, multiplication,
averaging, and the like. Alphanumeric fields hold textual values: words, numerals, and special
symbols, which make up names, addresses, and identification numbers. Numerals entered in
alphanumeric fields, such as Social Security numbers or zip codes, cannot manipulated as numbers.
In addition to numeric and alphanumeric fields, DBMSs offer date fields and graphic fields. While
date fields are displayed in the standard of mm/dd/yy, mmm dd,yyyy, or some other format, the
dates are actually stored in the form so many days elapsed from a certain base date, such as January 1,
1901.
An increasing number of microcomputer-based DBMSs offer graphical fields in which
pictures and animation clips can be maintained. For example, employee pictures can be scanned and
entered into the Photo field in a human resource database. Many manufacturers encourage their
maintenance and customer-support personnel to click the picture or number of an assembly in a
database, which invokes a video clip showing movement of the assembly to help solve a mechanical
problem.
In a hierarchical DBMS, the schema includes the relationships between parent and child
record structures. Similarly, relationships must be detailed in the schema of a network database. The
schema of a relational database is simpler. It describes only the record structure of each table, namely,
the fields of which each record in that table will consist.
The builder of a new database must also indicate which fields will be used as primary keys.
Many DBMS also allow a builder to indicate when a field is not unique; meaning the value in that
field may be the same for more than one record.
All the information supplied by the database developer when constructing the schema is maintained
in the data dictionary, which includes the file names, record names and types, field names and types,
and if applicable, the relationships among record types. In addition, the data dictionary contains the
notation of who is responsible for updating each part of the database and descriptions (such as titles)
or names of the people who are authorized to access the different parts of the database.
Data dictionaries are often referred to as metadata , meaning “data about the data.” They are
useful when trying to understand a database designed by some else. Many PC DBMS do not allow
the users direct access to the data dictionary. The user can view, and to a certain extent change, only
the schema. But some mainframe DBMSs provide users with a facility to add to the data dictionary
information such as the name of the database designer, the date the database was built, the purpose
of each field and its minimum and maximum values, the people who may make changes in the
schema, the people who are authorized to access which data in the database, and other valuable
information
13
designer uses to define and name the files, records, and fields in a database before beginning to
populate them. In most PC DBMSs, the user interface of the DDL presents screens and prompts the
designer to enter the appropriate parameters from a menu. These interfaces are intuitive and allow a
database to be created by someone who may have relatively little development experience. In other
DBMSS, the user must know the commands used in the DDL to construct the schema.
It is likely that you will have to deal with DDLs directly, unless you choose a career in MIS.
If you use a modern relational DBMS, you will use a graphical user interface to create a schema, and
the DDL will be transparent to you. However, statements used by the DDL of the DBMS called
NOMAD. The word MASTER indicates a record structure. ITEM indicates a field type, that is, a
column. The letter A indicates that the field is alphanumeric, and the number next to it specifies its
length. When a field is defined with 9s, it is numeric. The number of 9s determines the maximum
number digits that the field will display.
Data manipulation language (DML) is the software that serves the user who is querying the
database. Some DBMSs require the user to type in commands. For example, consider a database
holding personnel data: ID, Last_Name, First_Name, Department, and Salary. Suppose you want a
list showing the last names, department number, and salaries of employees whose department
number is 4530 and whose salary is less than $25,000. It is the DML that allows such a query to be
placed and executed.
In NOMAD, for instance, the user asks for the EMPLOYEE master, the record set that
holds employee records, and specifies the required output, as follows:
Some DBMSs hide the DML from the user. Instead of statements, the user expresses a
query by example (QBE). The user invokes the query module of the program, which displays the
fields available, and then places check marks in the fields to be listed and conditions in the proper
fields. Virtually all the popular PC relational DBMSs provide QBE dialog interfaces. Many DBMSs
are now part of 4GLs are flexible enough to allow programmers to use the language both to develop
applications that retrieve and manipulate data from a database and also to perform tasks that have
nothing to do with the database, all in the same application.
RELATONAL OPERATIONS
14
desired, the user can save the newly created table. Often, the temporary table is needed only for ad
hoc reporting and is immediately discarded.
Data Manipulation
Project is the selection of certain columns from a table, such as the salaries of all the
employees. A query may specify a combination of selection and projection. In the preceding example,
the manager may require only the ID number, last name (project), and salary of employees whose
salaries are greater than $30,000 (select).
One of the most useful manipulations of a relational database is the creation of a new table
from two or more other tables. As you may recall from our discussion of the relational model, the
joining of data from multiple tables is called a join. For example, a relational business database may
have four tables: salespeople, catalog, order log, and customer. The sales manager may wish to create
a report showing, for each salesperson, a list of all of the customers who purchased anything last
month, the items each customer purchased, and the total amount spent by each customer. The new
table is created from a relational operation that draws data from different tables.
In our university example, a report showing the name of every professor with his or her
student’s name is a join table. Note that some DBMSs will not allow the same field name to be used
more than once in a table (even a join table), so the second “L. Name” (which refers to students
rather than professors) may automatically be changed to “L. Name-1.” Also, in this example the user
indicated that for display purposes she does not desire the professor name and ID number repeated
for each student.
The join operation is a powerful manipulation that can create very useful reports for decision
making. A join table is created “on the fly” as a result of a query and only for the duration of the user
wishes to view it or to create a paper report from it. Design features allow user to change the field
headings (although the field names are kept the same in the internal table), and to add graphics and
text to the report. However, the new table may be saved at anytime. The DBMS then treats it like any
other table.
15
Statements like this can be sued for ad hoc queries or integrated in a program that is saved
for repeated use. Commands for updating the database are also easy to remember: INSERT,
DELETE, and UPDATE.
There are numerous database management packages for mainframes and PCs. One of the oldest
mainframe DBMSs is Information Management System (IMS), a hierarchical DBMS developed by
IBM. Others in the mainframe arena include Data Base 2 (DB2) and FOCUS. DB2 was originally
developed by IBM in 1982 and has been continuously improved. Information Builders International
has successfully marketed FOCUS, a 4GL and hierarchical DBMS. Some vendors have created PC
versions of their mainframe packages, such as PC FOCUS and PC NOMAD. Virtually all of these
DBMSs now have GUI, and hence have the word “Visual” as part of their name. Other popular
DBMSs include Visual dBASE, which started as a mere file manager in its early versions. Microsoft
Access, Paradox, Oracle, and Ingres II are also widely used by organizations and individuals. All
include a 4GL for development of a database application. Since so many databases are now accessed
through networks, many of these DBMSs include software tools for development of capabilities for
networked databases and operations on the Web.
DATABASE ARCHITECTURE
Database architecture refers to both the physical and logical layouts of databases in an organization.
In the past, most organization’ databases – data and programs alike – were centrally located on
mainframes and accessed from remote locations throughout the company from dumb terminals.
There have been significant changes in database architecture as both databases and the programs
running them have moved from mainframes to PCs and from a centralized to a distributed model.
Distributed Databases
The database administrator (DBA) can either replicate the database so there are exact
copies in many locations, or fragment it, so that different parts of the database are maintained on
different machines. Replication of the database means that a full copy of the entire database is
stored at all the sites that need access to it. This approach is expensive and not conducive to data
integrity, because all the updates must be performed at all the sites, and the chance of errors
occurring due to delayed updates and copying errors is high.
Many organizations have opted for the other alternative: in a fragmented database, different
parts of the database are stored in the locations where they are accessed most often, but they
continue to be fully accessible to others through telecommunications. Together, all the parts make up
16
the database. The result is just one copy of the database, distributed among the various sites by way
of communication lines. Applications’ use of remote fragments of the database is transparent to the
users. The users do not know, and need to bother to know, which part of the databases resides
locally at their site and which is processed remotely. One advantages of a fragmented database is the
lower communications costs. With only one copy of the database, another advantage is better data
integrity. Many experts refer to fragmented databases as distributed databases. Note that the
telecommunications lines through which data are accessed do not differ in these two approaches.
Nowadays, many multisite companies enable employees and business partners to access databases
through intranets and extranets using Web software.
As we mentioned earlier, some organizations store their databases and the applications that
run them on mainframes or on minicomputers accessible remotely from dumb terminals. Other
distributes their database but leave the processing of the centralized. Some experts refer to these
arrangements as shared resources architecture. The central resource is used by remote terminals and PCs
not only for the data in its databases, but also for the applications that process the data.
To use human analogy, thoughts are processed throughout an office, not just in mind of the
boss. And thoughts are communicated as requirements of the collective process. In a client/server
network, users may have much computing power at their local PC, where they can process data,
produce information, and then decide what to save on the server and what to save locally in their
own computers. This additional computing power is the reason that many experts say the
client/server architecture empowers employees; it gives them more independence and the ability to
make their own decisions regarding information.
Organizations are spending significant a portion of there is budgets moving to client/server
architecture. Since the early 1990s, client/server budgets of U.S. corporations have increased at a
significantly higher rate than total IS budgets. The trend of devoting an increasing proportion of the
IS budget to client/server systems is expected to continue in the foreseeable future.
17
WEB DATABASE
The Internet and its user friendly application, the Web, would be practically useless if people
could not access databases online. The premise of the Web is that people can not only browse
appealing Web pages but also search for and find information in databases. When a shopper accesses
an online store, he or she can look for information about any of thousands or hundreds of thousands
of items offered for sale. For example, when you access the site of CDNow, you can receive online
information (such as an image of a CD’s cover, its popularity ranking, price, and shipping time) for
any of a half million music CD. If you access IBM’s site you can retrieve information on each of
thousands of products and services or select an article from a huge electronic library. In
business-to-business e-commerce, wholesalers make their catalogs and special prices available to
retailers online. Applications at auction sites receive inquires by type of item, color, date and other
attributes and identify records of matching items, which often include pictures and detailed
descriptions. Behind each of these sites is a database. The only way for organizations to conduct
these web-based businesses is to give people outside the organizations access to their databases. In
other words, the organizations must link their databases to the Internet.
● Libraries – of books, articles, CDs, and movie clips. These sites also often include a local
search engine allows a user to search for the key words in a title, author name, or an entire
article. University faculty, staff and students often have access to such large databases
through their schools. Most of these databases are not owned by the school but are operated
by organizations that specialize in library databases such as ABI/Inform and UMI.
● Directories – which can include names, addresses, telephone numbers and email addresses.
For instances, professional associations can provide members with access to membership
lists.
● Client list and profiles – usually, individual users have access to these databases only for the
purpose of inserting or updating their own records. A registered user name and password are
usually required to gain access these databases. For example, ValuPage, a Web site that
provides supermarket coupons online, collects data on shoppers. To receive periodic email
messages with coupons that you can print out and use for supermarket discounts, you must
first enter personal data, including your address, email address and shopping preferences.
The data are sold for profit to other organizations.
Organizations must understand, however, that once a computer is linked to a public network, there is
a risk of unauthorized access to it and any other computer that is linked to it. Thus security measures
are critical to prevent unauthorized access. People often gain unauthorized access to deface Web
18
pages or even destroy data in databases. To screen and block access, a special type of software called
a firewall is used on the server.
DATA WAREHOUSING
The great majority of data collections in business are used for daily transactions and operations:
records of customers and their purchases and information on employees, patients, and other parties
for monitoring, collection, payment, and other business or legal purposes. However, many
organizations have found that if they archive transaction data, they can use them for important
management decisions, such as researching market trends or tracing down fraud. The accumulated
data are like a huge heap of dirt in which precious gems are hidden. If the data are organized well,
and if proper tools are used to analyze the data, those gems may be found. Uncovering the gems is
the purpose of data warehousing.
A data warehouse is a huge collection of data that support management decision making. It
maintains snapshots of business conditions at predetermined points in time, such as the end of each
business day or the first of every month. A data warehouse is a large – usually relational – database.
The purpose of data warehouse is to let managers produce reports or analyze large amounts of
archival data and make decisions. Data warehousing experts must be familiar with the types of
business analyses that will be done with the data. They also have to design the data warehouse tables
to be flexible enough for modification in years to come, when business activities change or when
different information must be extracted.
Once an organization has ensured that it has adequate hardware and software, it can begin building
the data warehouse. Several phases are involved in building a data warehouse from transactional data:
19
chiefly, the extraction, cleansing, and loading phases. In the extraction phase, the builders create the
files from transactional databases and save them on the server that will hold the data warehouse.
In the cleansing phase, the builders modify the data into a form that allows insertion into
the data warehouse. For example, they ascertain whether the data contain any spelling errors, and if
there are any, they fix them. They make sure that all data are consistent. For instance, Pennsylvania
may be denoted as Pa., Pa, Penna, or Pennsylvania. Only one form would be used in a warehouse.
Warehouse builders ensure that all addresses follow the same form, using upper – or lowercase letter
consistently and defining fields uniformly (such as one field for the entire street address and a
separate field for zip codes). All data that express the same type of quantities are “cleansed” to use
the same measurement units.
In the loading phase, the builders transfer the cleansed files to the database that will serve
as the data warehouse. They then compare the data in the data warehouses with the original data
from the transactional database to ascertain completeness. They document the data for users, so the
users know what they can find and analyze in the data warehouse. The new data warehouse is then
ready for use. It is a single source for all the data required for analysis, is accessible to more users
than the transactional databases (whose access is limited only those who record transactions), and
provides a “one-stop shopping” place for data. In fact, it is not unusual for an organization to have
one very large table of data with numerous fields.
DATA MINING
One of the main purposes of maintaining a data warehouse is to be able to “mine” it for useful
information. Data mining is the process of selecting, exploring, and modeling large amounts of data
to discover previously unknown relationships. Data mining software searches through large amounts
of data for meaningful patterns of information. Data mining is most often used by marketing
managers, who are constantly analyzing purchasing patterns, so that potential buyers can be targeted
more efficiently through special sales, product displays, or direct mail and email campaigns. Data
mining is an especially powerful tool in an environment in which businesses are shifting from
mass-marketing a product to targeting the individual consumer with a variety of products that are
likely to satisfy him or her. Some observers call this approach “marketing to one”.
However, data mining is also used in banking, where it is employed to find profitable customers
and patterns of fraud. For example, when Bank of America (BofA) looked for new approaches to
retain customers, it used data mining techniques. It merged various behavior patterns into finely
tuned customer profiles. The data were clustered into smaller groups of individuals who were using
banking services that didn’t best support their activities. Bank employees contacted these customers
and offered advice on services that would serve them better. The result was greater customer loyalty
(measured in fewer accounts closed and fewer moves to other banks). The people who were
contacted thought that the bank was trying to take good care of their money.
KNOWLEDGE MANAGEMENT
Samuel Johnson, the author of the first English dictionary, said that one type of knowledge is what
we know about a subject and the other type knows where to find information about the subject. The
purpose of knowledge management is mainly to gain the second type of knowledge. Knowledge
20
management is the combination of activities involved in gathering, organizing, sharing, analyzing, and
disseminating knowledge to improve an organizations’ performance.
Knowledge is usually perceived as “know-how,” which is usually accumulated through
experience combined with accumulating certain information or, at least, knowing where information
can be found. Much knowledge is kept in people’s minds, on paper notes, on discussion transcripts,
and in other places that are not readily accessible to a company’s employees. Therefore, knowledge
management is a great challenge.
21