2 LifeCycle

CSE494/598 Principles of Information Engineering
Information Life Cycle
Lesson Objectives:
1. 2. 3. 4. 5. 6. Describe the parts of the Information Life Cycle. Explain the advantages and disadvantages of coding and compression. Discuss considerations for information storage. Describe common factors that must be addressed for proper information presentation. Relate information analysis to information retrieval. Classify characteristics of various generations of database management.
Advance Reading Material:

Read The Top 10 Data Mining Questions. It can be found at: http://www.datamining.com/top10
Information Life Cycle

Preservation Re -engineering Storage Analysis/mining/ processing Coding/ Compression Acquisition Retrieval Presentation Packaging/ Visualization Transport Discard
Information Science Life Cycle
1. Information Acquisition
Acquiring of business-related information in digital form Traditionally, record based data mostly in table form Now multimedia data
Conversion to digital form for on-line processing Overall organization for seamless integration
2. Coding & Compression

Coding of data in order to minimize its representation for reducing the storage requirement and reducing the bandwidth requirement in communication Need different techniques for each type of media and even each type of object
Facsimile vs. aerial pictures vs. portrait
Technique must be fast, one-pass, adaptive and invertible, and must not impose unreasonable requirements on resources.
3. Analysis & mining

Raw numbers, words, images and sounds are not immediately useful their contents must be analyzed and represented in machine processable form:
Mining of databases for useful information Extraction of contents from images and video Conceptualization of text Feature analysis of audio segments
4. Storage
Business data can be very large and heterogeneous with respect to all parameters Appropriate storage techniques ensure: proper management, location and distribution, and the flow of objects. Among issues to be considered:
Data placement What technology (medium) to use for storage Distribution: local, remote, out-sourced Speed of delivery
5. Re-engineering
Legacy systems make up most of the business data systems Maintenance and modernization of these systems represents a large portion of IT efforts Important decisions:
Maintenance replace & migrate modernize for co-existence
Is legacy code like Chernobyl?

Remember Chernobyl? The meltdown of the Nuclear reactor. Officials poured concrete over it and hoped that, someday, it would just go away! Legacy code and Chernobyl: Too messy to clean up but too dangerous to ignore!
Legacy code...
Theory: rebuild the legacy system from ground up with
a relational (or OO) database graphical user interfaces client/server architecture
Practice: expensive and risky, because of size, complexity and poor documentation.
Case study 1
700 clients 120,000,000 credit cards (mid-90s figure) Over 14 tera bytes of data 2 billion transactions per month
19 billion disk/tape I/O per month
Around 23 million transactions are processed from 8:00 pm to 2:00 am
Case study 2
22 million telephone customers zero downtime must be guaranteed COBOL code: Hundreds of millions of lines Many tera bytes of data owned by applications
no sharing -> redundant storage Regulatory change: rate of return to price cap
Reengineer 80% of the business process
Case study 2
Incremental migration into a client server computing architecture Began in late 80s ago, still on-going Around 10,000 workstations, and growing Biggest challenge: Inability of mainframe to participate in distributed C/S computing
CICS unable to cooperate in a nested subtransaction Integrity?
About Legacy Systems

Large, with millions of lines of code Over 10 years old Mission-critical - 24-hour operation Difficulty in supporting current/future business requirements 80-90% of IT budget Instinctive solution: Migrate!
Migration Strategies
Complete rewrite of legacy code
Many problems Risky Prone to failure
Incremental migration
Migrate the legacy system in place by small incremental steps Control risk by choosing increment size.
One-step Migration Impediments

Business conditions never stand still Specifications rarely exist Undocumented dependencies Management of large projects
too hard, tend to bloat
Migration with live data Analysis paralysis sets in Fear of change
Incremental Migration
Incrementally analyze the legacy IS Incrementally decompose Incrementally design the target interfaces Incrementally design the target applications Incrementally design the target database Incrementally install the target environment Create and install the necessary gateways Incrementally migrate the legacy database Incrementally migrate the legacy applications Incrementally migrate the legacy interfaces Incrementally cut over to the target IS
A Comparison
One step
uited or isk ailure ene its utlook on-decomposable uge ntire pro ect Immediate npredictable until deadline
Incremental
ecomposable Controllable tep at a time Incremental Conservatively optimistic
6. Preservation
Similar to physical security measures for protecting buildings, cash and other tangible assets, information must be protected while recorded, processed, stored, shared, transmitted, or retrieved. Must protect against loss, alteration, and disclosure Must prevent unauthorized access and unauthorized use of
Computer systems Networks Information
7. Retrieval
Query languages have come a long way from old style navigational queries to todays content-based query languages Important: Any constraint (e.g., a processable feature) may be used as the criterion for search Require efficient retrieval techniques, similar to those for data retrieval, for all types of information
Web Search Engines

A text retrieval system with a Web interface The document collection of a search engine can be either a pre-compiled special collection or a set of Web pages collected from many web servers by a program called Web robot. Each document is preprocessed and represented as a vector of terms with weights.
Web Search Engines (contd..)

The steps are, Stopward removal: Remove non-content words such as a and is from each document. Stemming: Map variations of the same word into a term. Term weighting: Assign a weight to each term in a document to indicate the importance of the term in representing the importance of the term in representing the contents of the document.

A query is also transformed into a vector with weights. The similarity between a query and a document can be measured by the dot product of their respective vectors. When documents are HTML web pages, other factors can influence a term's weight in a document. title or the header enclosed in special tags or in special fonts Google and AltaVista use tag and location information

Web pages are hyperlinked. There are pointers going from one page to another. Associated with each pointer are words (anchor terms), which show the users what trey are likely to find if the pointer is followed. Anchor terms are utilized to index referenced/pointed pages. Linkage information can also be combined with similarity information to improve the retrieval effectiveness.
Meta-Search Engines
Has a number of modules. The user interface module accepts the users query which will be forwarded, with necessary reformatting, by the query dispatcher module to the various search engines. When the search engines return the sets of the retrieved documents to the metasearch engine, these sets are merged by the result merger module into a single ranked list of documents.
Meta-Search Engines (contd..)

Certain number of the top documents from this list are displayed to the user. When the number of search engines underlying a metasearch engine is large, forwarding each user query to each search engine is very inefficient. To overcome this, a database selection module is included. Its function is to identify for each user query the search engines that are likely to return useful documents to the user.
8. Presentation
Information must be presented to the user in a form that is usable
Cookies take care of part of the issue
Issues are diverse and range from formatting, visualization, language, and even cultural barriers In the case of multimedia information, both temporal and spatial issues must be dealt with
9. Transport
Moving of data/information from one location to another
Most common form: digital communication
Technology selection for information transport:

What communication service? What protocols? What quality of service? What physical resources?
10. Information discard

Destruction of information once its useful life is over
Generally, preserve data unless discard is needed
Methods for discard Legal issues must be taken into account
3. Analysis & mining
Additional notes
Information Analysis and Mining

In multimedia objects:
Extraction of features Their representation Indexing on the basis of contents
For data: Mining in order to find useful patterns and correlations For text:
Conceptual representation Ontological classification of concepts
Analysis of Images
Extract features
Color Shape Texture Spatial relationships
Create a logical representation for the image

Semantic nets are effective
Classify and index so that the search process will be efficient
Analysis of Video
Determine video segments by detecting scene cuts (Scene cut detection process) Select a representative frame for each segment Extract Spatial features :
color, texture, shape, and relative object positions
Extract Temporal features:

object trajectories, camera motion, viewing perspective temporal relationships among objects
Represent each segment with an object that can be efficiently indexed by its features
Video Indexing process

Scene Change Detection Representative Frame Selection/Creation
Audio Analysis
Closed Caption Analysis
Camera Operation + Object Motion Extraction
Object Segmentation
Spatial Features Extraction
Text Analysis
Keywords Sound Characteristic
Keywords
Camera Operation Object Trajectory
Objects
Sketch
Spatial Relationships
Shape Color Texture
Description Text Keywords
Analysis of Audio
For Speech:
Textual information from speech (then sound retrieval becomes text retrieval) Speaker Information (identification)
For Generic Sound:

Loudness Pitch Tone Cepstrum Derivatives
For Music:
Rhythm Event Instrument
Analysis of data
The hardest task: Integration of data from multiple databases
Despite many years of work, we still have difficulty in this area
Data mining tasks: descriptive, predictive

Descriptive: Characterize general properties of the data Predictive: Perform inference on the data to make predictions
Most common types: Specialized abstracts and integrated tables
Early days of databases

Data Collection and Database Creation (1960s and earlier) -Primitive file processing
Database management systems

Database Management Systems (1970s-early 1980s) -Hierarchical and network database systems -Relational database systems -Data modeling tools: entity-relationship model, etc. -Indexing and data organization techniques: B+ -tree, hashing etc. -Query languages: SQL, etc. -User interfaces, forms and reports -Query processing and query optimization -Transaction management: recovery, concurrency control,etc. -On-line transaction processing(OLTP)
Current databases
Advanced Database Systems (mid-1980s-present) -Advanced data models: extended-relational, object-oriented, object-relational, deductive -Application-oriented: spatial, temporal, multimedia, active, scientific, knowledge bases
Data Integration
Data Warehousing and Data Mining (late 1980s-present) -Data warehouse and OLAP technology -Data mining and knowledge discovery Web-based Database Systems (1990s-present) -XML based database systems -Web mining
Putting it all Together
Data Collection and Database Creation (1960s and earlier) -Primitive file processing
Database Management Systems (1970s-early 1980s) -Hierarchical and network database systems -Relational database systems -Data modeling tools: entity-relationship model, etc. -Indexing and data organization techniques: B+ -tree, hashing etc. -Query languages: SQL, etc. -User interfaces, forms and reports -Query processing and query optimization -Transaction management: recovery, concurrency control,etc. -On-line transaction processing(OLTP)
Advanced Database Systems (mid-1980s-present) -Advanced data models: extended-relational, object-oriented, object-relational, deductive -Application-oriented: spatial, temporal, multimedia, active, scientific, knowledge bases Data Warehousing and Data Mining (late 1980s-present) -Data warehouse and OLAP technology -Data mining and knowledge discovery
Web-based Database Systems (1990s-present) -XML based database systems -Web mining
New Generation of Integrated Information Systems (2000-)
Information mining process

Data cleaning
Reformatting and conversation may be necessary
Data integration
Heterogeneity possible in any aspect
Data selection Data transformation Data mining and evaluation of patterns Presentation of knowledge
Evaluation and Presentation Knowledge
Data Mining Patterns
Selection and Transformation
Data Warehouse
Cleaning and Integration
Databases
Flat files
Data Warehousing and ETL

An organized repository of data from multiple data sources
A unified schema for all of the participating databases
Provides data analysis capabilities, collectively known as On-Line Analytical Processing (OLAP) A number of pieces are needed: tools, gateways, and conversion routines
Typical architecture of a data warehouse

Client
Data source in Location 1
Clean
Transform Integrate Load
Data Warehouse
Query and analysis tools
Client

2 LifeCycle

Uploaded by

Copyright:

Available Formats

You might also like

2 LifeCycle

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 LifeCycle

Uploaded by

Copyright:

Available Formats

CSE494/598 Principles of Information Engineering

Information Life Cycle

Advance Reading Material:

Information Life Cycle

Information Science Life Cycle

2. Coding & Compression

3. Analysis & mining

Is legacy code like Chernobyl?

Around 23 million transactions are processed from 8:00 pm to 2:00 am

Reengineer 80% of the business process

About Legacy Systems

One-step Migration Impediments

Migration with live data Analysis paralysis sets in Fear of change

Web Search Engines

Web Search Engines (contd..)

Web Search Engines (contd..)

Web Search Engines (contd..)

Meta-Search Engines (contd..)

Technology selection for information transport:

10. Information discard

Methods for discard Legal issues must be taken into account

3. Analysis & mining

Information Analysis and Mining

Create a logical representation for the image

Classify and index so that the search process will be efficient

Extract Temporal features:

Video Indexing process

Closed Caption Analysis

Camera Operation + Object Motion Extraction

Spatial Features Extraction

Keywords Sound Characteristic

Camera Operation Object Trajectory

Shape Color Texture

Description Text Keywords

For Generic Sound:

Data mining tasks: descriptive, predictive

Most common types: Specialized abstracts and integrated tables

Early days of databases

Database management systems

Putting it all Together

New Generation of Integrated Information Systems (2000-)

Information mining process

Evaluation and Presentation Knowledge

Data Mining Patterns

Selection and Transformation

Cleaning and Integration

Data Warehousing and ETL

Typical architecture of a data warehouse

Transform Integrate Load

Query and analysis tools

Data source in Location 3

Data source in Location 4

You might also like