This action might not be possible to undo. Are you sure you want to continue?
Informatica Complex Data Exchange™
Getting Started with Complex Data Exchange Version 4.4 August 2007 Copyright © 2001–2007 Informatica Corporation. All rights reserved. This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. This Software is protected by U.S. Patent Numbers and other Patents Pending. Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable. The information in this software and documentation is subject to change without notice. Informatica Corporation does not warrant that this software or documentation is error free. Informatica, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange, PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica Complex Data Exchange and Informatica On Demand Data Replicator are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners. Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright © Sun Microsystems. All rights reserved. Copyright 1985-2003 Adobe Systems Inc. All rights reserved. Copyright 1996-2004 Glyph & Cog, LLC. All rights reserved. This product includes software developed by Boost (http://www.boost.org/). Permissions and limitations regarding this software are subject to terms available at http://www.boost.org/LICENSE_1_0.txt. This product includes software developed by Mozilla (http://www.mozilla.org/). Your right to use such materials is set forth in the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/licenses/lgpl.html. The Mozilla materials are provided free of charge by Informatica, “as-is”, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. This product includes software developed by the Apache Software Foundation (http://www.apache.org/) which is licensed under the Apache License, Version 2.0 (the “License”). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This product includes software developed by SourceForge (http://sourceforge.net/projects/mpxj/). Your right to use such materials is set forth in the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/licenses/lgpl.html. The SourceForge materials are provided free of charge by Informatica, “as-is”, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. This product includes Curl software which is Copyright 1996-2007, Daniel Stenberg, <email@example.com>. All Rights Reserved. Permissions and limitations regarding this software are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. This product includes ICU software which is Copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permissions and limitations regarding this software are subject to terms available at http://www-306.ibm.com/software/globalization/icu/license.jsp This product includes OSSP UUID software which is Copyright (c) 2002 Ralf S. Engelschall, Copyright (c) 2002 The OSSP Project Copyright (c) 2002 Cable & Wireless Deutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mitlicense.php. This product includes Eclipse software which is Copyright (c) 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to terms available at http://www.eclipse.org/org/documents/epl-v10.php libstdc++ is distributed with this product subject to the terms related to the code set forth at http://gcc.gnu.org/onlinedocs/libstdc++/17_intro/license.html. DISCLAIMER: Informatica Corporation provides this documentation “as is” without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. The information provided in this documentation may include technical inaccuracies or typographical errors. Informatica could make improvements and/or changes in the products described in this documentation at any time without notice. Part Number: CDE-GST-44000-0001
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Quick Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Document Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Other Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Visiting Informatica Customer Portal . . . . . . . . . . . . . . . . . . . . . . . . . . xii Visiting the Informatica Web Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Visiting the Informatica Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . xii Obtaining Customer Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1: Introducing Complex Data Exchange . . . . . . . . . . . . . . . . . 1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Introduction to XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 How Complex Data Exchange Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Using Complex Data Exchange in Integration Applications . . . . . . . . . . . 4 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Installation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Default Installation Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 License File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Tutorials and Workspace Folders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Exercises and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 XML Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Basic Parsing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Opening Complex Data Exchange Studio . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Importing the Tutorial_1 Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A Brief Look at the Studio Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Defining the Structure of a Source Document . . . . . . . . . . . . . . . . . . . . . . . 16 Correcting Errors in the Parser Configuration . . . . . . . . . . . . . . . . . . . . 22 Techniques for Defining Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Tab-Delimited Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Running the Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
. . . . . . . . 48 Creating the Project . . . . . . . . . . . . . . . . . . . . . 75 The Parsing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Using XSD Schemas in Complex Data Exchange . . 70 Chapter 5: Parsing Word and HTML Documents . . . . . . . . . . .Running the Parser on Additional Source Documents . . . . . . . . . . . . . . . . . . . . . . . . . 73 Requirements Analysis . . . . . . . . . . . . . . . . . . . . 28 What's Next? . . . . . . 45 Chapter 4: Positional Parsing of a PDF Document . . . . . . . 69 Points to Remember . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 3: Defining an HL7 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Points to Remember . . . 78 Defining a Variable . . . . . . . . . . . . . . . . . . . . . 41 Testing the Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Basic and Advanced Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 More About Search Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 XML Output . . . . . . . . . . . . 68 Potential Enhancement: Handling Page Breaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Overview . . . . 37 Defining the Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Overview . . . . . . . . . . 32 Requirements Analysis . . . . . 26 Points to Remember . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Creating a Project . . . . . . . . . . . . . . . . . . . . . . 75 Creating the Project . . . . . . . 47 Overview . . 53 Defining the Anchors . . . . . . 72 Scope of the Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Requirements Analysis . . . . . . . . . . . . 79 iv Table of Contents . . . . . . . . . . . . . . . . . 59 Defining the Nested Repeating Groups . . . . . . . . . . . 38 Editing the IntelliScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Using an Action to Compute Subtotals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Source Document . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Configuring the Serializer . . . . . . . . . . . . . . . 94 Global Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Why the Output Contains Empty Elements . . . . . 116 Configuring the Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Explanation of the API Calls . . . . . 124 Source Code . . . . . . . . 92 Using Transformers to Modify the Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Defining Multiple Components in a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 COM API Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Determining the Project Folder Location . . . . . . . . . . . . . . 102 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Creating the Project . . 111 Chapter 7: Defining a Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Deploying a Data Transformation as a Service . . . . . . . 83 Parsing the Optional Currency Line . . . . . . 97 Testing the Parser on Another Source Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Why the Output does not Contain HTML Code . . . . . . 129 v . . . . . . . . . . . . . . . 101 Overview . . . . . . . . . . . . . 121 Overview . . . . . . . 90 Using Count to Resolve Ambiguities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Parsing the Name and Address . . . . 109 Points to Remember . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 6: Defining a Serializer. . . . . . . . . . . . 113 Overview . . . . . . . . . . . . . . . . . . . . . . . 84 Parsing the Order Table . . . 127 Points to Remember . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Chapter 8: Running Complex Data Exchange Engine . . . . . . . . . . . . . . . . . 126 Running the COM API Application . . . . . . . . . . . . . . . . . . . . . . . . 102 Prerequisite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Points to Remember . . . . . . . 114 Creating the Project . 118 Points to Remember . . . . . . 106 Calling the Serializer Recursively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 vi Table of Contents . . . . . . . . . .
Complex Data Exchange Studio is a visual design environment for data transformations. Complex Data Exchange is fully integrated with Informatica PowerCenter and with numerous external systems. or binary representation. whether the data is structured or unstructured. Complex Data Exchange enables organizations to define. deploy. text. transaction-intensive applications and service oriented architectures. Complex Data Exchange Engine is the runtime environment for transformation services. and reuse data transformations without writing code. vii .Preface Welcome to Informatica Complex Data Exchange. PowerCenter applications use the Complex Data Transformation to activate Complex Data Exchange services and perform data transformations. and whether it exists in an XML. The Complex Data Exchange libraries provide predefined transformations supporting industry standard data formats. You can use Complex Data Exchange to transform any data format to any data format. the leading software for automating complex data transformations in high-performance.
and you will be able to apply them to your own data transformation needs. you will be familiar with the Complex Data Exchange procedures. and components that are introduced and explained in each lesson. which teach the basic techniques for working in Complex Data Exchange Studio. You can then proceed through the other lessons in sequence. you will perform several hands-on exercises that teach how to use Complex Data Exchange in real-life datatransformation scenarios. features. see.. “Basic Parsing Techniques” on page 9 “Defining an HL7 Parser” on page 31 “Defining a Serializer” on page 101 “Basic Parsing Techniques” on page 9 “Positional Parsing of a PDF Document” on page 47 “Parsing Word and HTML Documents” on page 71 “Basic Parsing Techniques” on page 9 “Defining an HL7 Parser” on page 31 “Defining a Serializer” on page 101 “Defining a Serializer” on page 101 “Defining a Serializer” on page 101 viii Preface . When you finish the lessons.About This Book Getting Started with Complex Data Exchange is written for developers and analysts who are responsible for implementing data transformations. or you can skim the chapters and skip to the ones that you need. As you read it.. Concept Working in Complex Data Exchange Studio Feature or component Viewing IntelliScript Editing IntelliScript Multiple script files Color coding of anchors Basic and advanced properties Global components Projects Importing Creating Containing multiple parsers or serializers Project properties Determining the project folder location For more information. Quick Reference The following table is a guide to the Complex Data Exchange concepts. We recommend that all users perform the first and second lessons.
Concept Parsers Feature or component Example source documents Creating Running Viewing results Testing on example source Testing on additional source documents Document processors Calling a secondary parser For more information.. see. “Basic Parsing Techniques” on page 9 “Defining an HL7 Parser” on page 31 “Basic Parsing Techniques” on page 9 “Basic Parsing Techniques” on page 9 “Defining an HL7 Parser” on page 31 “Basic Parsing Techniques” on page 9 “Parsing Word and HTML Documents” on page 71 “Positional Parsing of a PDF Document” on page 47 “Parsing Word and HTML Documents” on page 71 “Defining a Serializer” on page 101 “Basic Parsing Techniques” on page 9 “Basic Parsing Techniques” on page 9 “Defining an HL7 Parser” on page 31 “Positional Parsing of a PDF Document” on page 47 “Positional Parsing of a PDF Document” on page 47 “Parsing Word and HTML Documents” on page 71 “Parsing Word and HTML Documents” on page 71 “Basic Parsing Techniques” on page 9 “Defining an HL7 Parser” on page 31 “Defining an HL7 Parser” on page 31 “Defining a Mapper” on page 113 “Parsing Word and HTML Documents” on page 71 “Basic Parsing Techniques” on page 9 “Parsing Word and HTML Documents” on page 71 “Basic Parsing Techniques” on page 9 “Positional Parsing of a PDF Document” on page 47 “Parsing Word and HTML Documents” on page 71 “Parsing Word and HTML Documents” on page 71 “Defining an HL7 Parser” on page 31 “Positional Parsing of a PDF Document” on page 47 “Positional Parsing of a PDF Document” on page 47 “Positional Parsing of a PDF Document” on page 47 Formats Text Tab-delimited HL7 Positional PDF Microsoft Word HTML Data holders Using an XSD schema Adding a schema to a project Creating and editing schemas Using multiple schemas Variables Anchors Marker Marker with count Content Content with positional offsets Content with opening and closing markers EnclosedGroup RepeatingGroup Search scope Nested RepeatingGroup Newlines as markers and separators Preface ix ..
The following paragraph provides additional facts.. This is generic text that should be replaced with user-supplied values.. italicized text boldfaced text italicized monospaced text Note: x Preface . Emphasized subjects. see.Concept Transformers Feature or component Default transformers AddString Replace For more information. This is the variable name for a value you enter as part of an operating system command. “Parsing Word and HTML Documents” on page 71 “Parsing Word and HTML Documents” on page 71 “Parsing Word and HTML Documents” on page 71 “Positional Parsing of a PDF Document” on page 47 “Positional Parsing of a PDF Document” on page 47 “Defining a Mapper” on page 113 “Defining a Serializer” on page 101 “Defining a Serializer” on page 101 “Defining a Serializer” on page 101 “Defining a Serializer” on page 101 “Defining a Serializer” on page 101 “Defining a Mapper” on page 113 “Defining a Mapper” on page 113 “Defining an HL7 Parser” on page 31 “Basic Parsing Techniques” on page 9 “Defining an HL7 Parser” on page 31 “Defining an HL7 Parser” on page 31 “Defining a Serializer” on page 101 “Running Complex Data Exchange Engine” on page 121 “Running Complex Data Exchange Engine” on page 121 Actions SetValue CalculateValue Map Serializers and serialization anchors Creating ContentSerializer RepeatingGroupSerializer EmbeddedSerializer Calling a secondary serializer Mappers and mapper anchors Testing Creating RepeatingGroupMapping Using color coding Viewing events Interpreting events Testing and debugging techniques Selecting which parser or serializer to run Running services in Complex Data Exchange Engine Deploying a Complex Data Exchange service API Document Conventions This guide uses the following formatting conventions: If you see… It means… The word or set of words are especially emphasized.
If you see… It means… The following paragraph provides suggested uses. Tip: Warning: monospaced text bold monospaced text Preface xi . The following paragraph notes situations where you can overwrite or corrupt data. unless you follow the specified procedure. This is a code example. This is an operating system command you enter from a prompt to run a task.
com. technical white papers. newsletters. you can send email.informatica. Visiting the Informatica Knowledge Base As an Informatica customer. and technical tips. Informatica provides these other resources: ♦ ♦ ♦ ♦ Informatica Customer Portal Informatica web site Informatica Knowledge Base Informatica Global Customer Support Visiting Informatica Customer Portal As an Informatica customer.Other Informatica Resources In addition to the product manuals. and access to the Informatica user community. the Informatica Knowledge Base.informatica.com. access to the Informatica customer support case management system (ATLAS). Informatica Documentation Center. You will also find product and partner information. you can access the Informatica Knowledge Base at http://my. Obtaining Customer Support There are many ways to access Informatica Global Customer Support. you can access the Informatica Customer Portal site at http://my. Visiting the Informatica Web Site You can access the Informatica corporate web site at http://www. training and education. or you can use the WebSupport Service.com for general customer service requests xii Preface . and implementation services. Use the following email addresses to contact Informatica Global Customer Support: ♦ ♦ firstname.lastname@example.org. You can also find answers to frequently asked questions. its background.com for technical inquiries support_admin@informatica. The services area of the site includes important information about technical support. You can contact a Customer Support Center by using the telephone numbers listed the following table. Use the Knowledge Base to search for documented solutions to known technical issues about Informatica products. and sales offices. user group information. The site contains information about Informatica. The site contains product information. upcoming events.informatica.
California 94063 United States Europe / Middle East / Africa Informatica Software Ltd.informatica.com. 3rd Floor 150 Airport Road Bangalore 560 008 India Toll Free Australia: 00 11 800 4632 4357 Singapore: 001 800 4632 4357 Standard Rate India: +91 80 4112 5738 Toll Free 877 463 2435 Toll Free 00 800 4632 4357 Standard Rate United States: 650 385 5800 Standard Rate Belgium: +32 15 281 702 France: +33 1 41 38 92 26 Germany: +49 1805 702 702 Netherlands: +31 306 022 797 United Kingdom: +44 1628 511 445 Preface xiii . You can request a user name and password at http://my. Berkshire SL6 3TN United Kingdom Asia / Australia Informatica Business Solutions Pvt. 6 Waltham Park Waltham Road.WebSupport requires a user name and password. Ltd. Diamond District Tower B. White Waltham Maidenhead. North America / South America Informatica Corporation Headquarters 100 Cardinal Way Redwood City.
xiv Preface .
2 Installation. 6 1 .Chapter 1 Introducing Complex Data Exchange This chapter includes the following topics: ♦ ♦ Overview.
you can configure a Complex Data Exchange serializer to transform the XML data to any other format. word-processor documents. and any other format that you can imagine. Complex Data Exchange can process fully structured. You can configure the software to work with text. you can efficiently transform data in any format to XML-based systems. messaging formats. To configure a data transformation. saving weeks or months of programming time. This book is a tutorial introduction.ore_refining. you can use an approach called parsing by example. You can configure a Complex Data Exchange parser to transform the data to any standard or custom XML vocabulary. composed of elements. you can skip this section. Using Complex Data Exchange. intended for users who are new to Complex Data Exchange. It has a tree structure. You can teach Complex Data Exchange how to convert data to XML. If you are already familiar with XML. we present a brief introduction here. binary data. PDF documents. The following is an example of a small XML document: <Company industry="metals"> <Name>Ore Refining Inc. simply by marking up an example in a visual editor environment. semi-structured. In the reverse direction. The top-level element in this 2 Chapter 1: Introducing Complex Data Exchange . As you perform the exercises in this book.com</WebSite> <Field>iron and steel</Field> <Products> <Product id="1">cast iron</Product> <Product id="2">stainless steel</Product> </Products> </Company> This sample is called a well-formed XML document because it complies with the basic XML syntactical rules. you will learn to configure and run your own Complex Data Exchange data transformations.Overview Informatica Complex Data Exchange is a data transformation system. You do not need to do any programming to configure the transformation. You can configure a Complex Data Exchange mapper to perform XML to XML transformations. For the benefit of Complex Data Exchange users who may be new to XML. HTML pages. Introduction to XML XML (Extensible Markup Language) is the de facto standard for cross-platform information exchange. You can configure even a complex transformation in just a few hours or days.</Name> <WebSite>http://www. or unstructured data.
w3schools. Each element begins and ends with tags. and Product. The data transformation engine. The vocabulary can be formalized in a syntax specification called a schema. In the example of a small XML document above.com.org.com</WebSite><Field>iron and steel</Field> <Products><Product id="1">cast iron</Product><Product id="2">stainless steel</Product></Products></Company> The unbroken-string representation is identical to the indented representation. or web sites. and the nested elements are Name. for example. In fact. The schema might specify. articles. The vocabulary can be customized for any application. we sometimes refer to parent and child elements. WebSite. and that the value of the industry attribute must be a member of a predefined list. such as <Company> and </Company>. industry is an attribute of the Company element. To obtain copies of the XML standards. We could have written a long. which does not contain any extra whitespace: <Company industry="metals"><Name>Ore Refining Inc.ore_refining. the document is said to be valid. we made up a vocabulary that might be suitable for a commercial directory. that Company and Name are required elements. see http://www. If an XML document conforms to a rigorous schema definition. The particular system of elements and attributes is called an XML vocabulary. that the other elements are optional. To make the XML document easier to read. For example. we have indented the lines to illustrate how the elements are nested. How Complex Data Exchange Works The Complex Data Exchange system has two main components: Component Complex Data Exchange Studio Complex Data Exchange Engine Description The design and configuration environment of Complex Data Exchange. For example. and id is an attribute of the Product element. the Products element is the child of Company and the parent of Product. To explain the hierarchical relationship between the elements.</Name><WebSite> http://www. The indentation and whitespace are not essential parts of the XML syntax. Field. The indented representation is how XML is conventionally presented in a book or on a computer screen because it is easier to read. The elements may also have attributes. For an excellent tutorial. For More Information You can get information about XML from many books.example is called Company. Products. see http://www. unbroken string such as the following. a computer might store the XML as a string like this. Overview 3 .w3. in addition to being well-formed.
database. It can be stored or accessed in a file. or any other location. If you are building a parser.Complex Data Exchange Studio The Studio is a visual editor environment where you can design and configure data transformations such as parsers. HL7 messages have a flexible structure that supports optional and repetitive data fields. by calling the Complex Data Exchange API. This procedure is called parsing by example. see “Positional Parsing of a PDF Document” on page 47and “Parsing Word and HTML Documents” on page 71. and mappers. It works entirely in the background. For more information. Another possibility is to use a Complex Data Exchange integration agent. For more information. for example. you must deploy the transformation as a Complex Data Exchange service. The fields are separated by a hierarchy of delimiter symbols. A request specifies the data to be transformed and the service that should perform the transformation. In a typical integration application. buffer. HL7 Integration HL7 is a messaging standard used in the health industry. Complex Data Exchange Engine Complex Data Exchange Engine is an efficient data transformation processor. The chapter on HL7 parsers. The Engine executes the request and returns the output to the calling application. see “Defining an HL7 Parser” on page 31. The chapters on positional parsing and parsing Word documents describe parsers that process various types of unstructured documents. An integration application can communicate with the Engine by submitting requests in a number of ways. illustrates how to use Complex Data Exchange to parse HL7 messages. stream. you can use a select-and-click approach to identify the data fields in an example source document. Note that we use the term document in the broadest possible sense. Use the Studio to configure Complex Data Exchange to process data of a particular type. you will get experience using Complex Data Exchange for these types of data transformations. a major health maintenance organization (HMO) uses Complex Data Exchange to transform messages that are transmitted to and from its HL7based information systems. and it can have any size. and define how the software should transform the fields to XML. executing the data transformations that you have previously defined in Studio. As you perform the exercises in this book. messaging system. serializers. To move a data transformation from the Studio to the Engine. for example. A document can contain text or binary data. It has no user interface. 4 Chapter 1: Introducing Complex Data Exchange . URL. Using Complex Data Exchange in Integration Applications The following paragraphs present some typical examples of how Complex Data Exchange data transformations are used in system integration applications.
Converting HTML Pages to XML Information in HTML documents is usually presented in unstructured and unstandardized formats. The goal of the HTML presentation is visual display. PDF files are less useful for information processing.Processing PDF Forms The PDF file format has become a standard for formatted document exchange. and graphics.—on a wide variety of supported platforms. For example. The format permits users to view fully formatted documents—including the original layout. since applications cannot access and analyze their unstructured. locate. for storage in a database. Complex Data Exchange has many features that can navigate. however. retailers who present their stock on the web can convert the information to XML. letting them share the information with a clearing house or with other retailers. making the information accessible to software applications. For example. The software enables conversion of information from HTML to a structured XML representation. binary representation of data Complex Data Exchange solves this problem by enabling conversion of PDF documents to an XML representation. fonts. Complex Data Exchange can convert invoices that suppliers send in PDF format to XML. rather than information processing. and store information found in HTML documents. Overview 5 .
and registration. The following paragraphs contain brief instructions to help you get started. installation. Be sure to install at least the following components: Component Engine Studio Document Processors Description The Complex Data Exchange Engine component. 6 Chapter 1: Introducing Complex Data Exchange . you should install the Complex Data Exchange software. XP Professional. The Complex Data Exchange Studio design and configuration environment. see “Running Complex Data Exchange Engine” on page 121.0 or higher Microsoft .NET Framework. see the Complex Data Exchange Administrator Guide. version 1. This is necessary only in the lesson on using the Complex Data Exchange API. required for all lessons in this book. you should install Complex Data Exchange on a computer that meets the following minimum requirements: ♦ ♦ ♦ ♦ Microsoft Windows 2000. License File No license file is required to use the Complex Data Exchange Studio design and configuration environment. For more information. System Requirements To perform the exercises in this book. Running services in Complex Data Exchange Engine requires a license file. if you have not already done so. required for all lessons in this book. required for the lessons on parsing PDF and Microsoft Word documents.1 or higher At least 128 MB of RAM Installation Procedure To install the software. For more information about the system requirements. Complex Data Exchange is installed in the following location: c:\Program Files\Informatica\ComplexDataExchange The setup prompts you to change the location if desired. Optional components. or 2003 Server Microsoft Internet Explorer. version 6. double-click the setup file and follow the instructions.Installation Before you continue in this book. Default Installation Folder By default.
in case you need them again. you can configure the Studio to use Microsoft Internet Explorer as a read-only XML editor.Tutorials and Workspace Folders To do the exercises in this book. As you perform the exercises.0\workspace You should work on the copies in the workspace. XML Editor By default. you will create Complex Data Exchange projects that have names such as Tutorial_1 andTutorial_2. Optionally. You can import the projects to your workspace and compare our solutions with yours. Throughout this book. you need the tutorial files. By default. Exercises and Solutions The tutorials folder contains two subfolders: ♦ Exercises. the location is: c:\Program Files\Informatica\ComplexDataExchange\tutorials As you perform the exercises. ♦ Solutions to Exercises. it can help make a complex XML structure much easier to understand. located in the tutorials folder of your main Complex Data Exchange program folder. you will import or copy some of the contents of this folder to the Complex Data Exchange Studio workspace folder. The solutions are projects having names such as TutorialSol_1 and TutorialSol_2. This folder contains our proposed solutions to the exercises. The projects will be stored in your Complex Data Exchange Studio workspace folder. Note that there might be more than one correct solution to the exercises. Because Internet Explorer displays XML with color coding and indentation. This folder contains the files that you need to do the exercises. Complex Data Exchange Studio displays XML files in a plain-text editor. Installation 7 . We recommend that you do not modify the originals in the tutorials folder. The default location of the workspace is: My Documents\Informatica\ComplexDataExchange\4. we will refer you to files in this folder.
To select Internet Explorer as the XML editor: 1. click Add and enter the *. click Add and browse to c:\Program Files\Internet Explorer\ IEXPLORE. Close and re-open Complex Data Exchange Studio. click Window > Preferences.xml file type. select the *. 5. If it is not displayed. 2.EXE. select General> Editors> File Associations. On the left side of the Preferences window. Click the Default button to make IEXPLORE the default XML editor. Open Complex Data Exchange Studio. 6. 7. On the upper right.xml file type. On the lower right.The XML illustrations throughout this book use the Internet Explorer display. On the menu. 3. 4. 8 Chapter 1: Introducing Complex Data Exchange .
16 Running the Parser. 25 Points to Remember. 11 Importing the Tutorial_1 Project.Chapter 2 Basic Parsing Techniques This chapter includes the following topics: ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ Overview. 12 A Brief Look at the Studio Window. 28 What's Next?. 29 9 . 10 Opening Complex Data Exchange Studio. 14 Defining the Structure of a Source Document.
such as: ♦ ♦ ♦ ♦ ♦ ♦ Importing and opening a project Defining the structure of the output XML by using an XSD schema Defining the source document structure by using anchors of type Marker and Content Defining a parser based on an example source document Using the parser to transform multiple source documents to XML Viewing the event log. The main purposes of this exercise are to: ♦ ♦ Demonstrate the parsing-by-example approach Start learning how to define and use a parser Along the way. The project is called Tutorial_1. Working in the Complex Data Exchange Studio environment. You will then use the parser to convert a few sample text documents to XML. you will use some of the important Complex Data Exchange Studio features.Overview To help you start using Complex Data Exchange quickly. we provide partially configured project containing a simple parser. which displays the operations that the data transformation performed 10 Chapter 2: Basic Parsing Techniques . you will edit and complete the configuration.
Opening Complex Data Exchange Studio To open the Studio: 1. On the Start menu. This is a good idea if you have previously worked in Complex Data Exchange Studio. 3. 4. To display the Complex Data Exchange Studio Authoring perspective. click Window > Reset Perspective. Optionally. and toolbars that you can use to edit projects in Eclipse. Opening Complex Data Exchange Studio 11 . menus. 2. click Window > Open Perspective > Other > Complex Data Exchange Studio Authoring. The Complex Data Exchange Studio Authoring perspective is a set of windows. and you have moved or resized the windows. You can close the welcome page by clicking the X at the right side of the editor tab. To display introductory instructions on how to use the Studio. The command restores all the windows to their default sizes and locations. click Programs > Informatica > Complex Data Exchange > Studio. click Help> Welcome and select the Complex Data Exchange Studio welcome page.
To import the Tutorial_1 project: 1. click File > Import. On the menu. The Eclipse workspace folder now contains the imported Tutorial_1 folder: My Documents\Informatica\ComplexDataExchange\4.Importing the Tutorial_1 Project To open the partially configured Tutorial_1 Project file. select the option to import an Existing Complex Data Exchange Project into Workspace. 3. At the prompt. Click Finish to complete the import. you must first import it to the Eclipse workspace.0\workspace\Tutorial_1 12 Chapter 2: Basic Parsing Techniques . Click the Next button and browse to the following file: c:\Program Files\Informatica\ComplexDataExchange\tutorials\Exercises\ Tutorial_1\Tutorial_1.cmw Note: Substitute your path to the Complex Data Exchange installation folder. 2. Accept the default options on the remaining wizard pages.
This folder is temporarily empty.project Description The main project file. but you need it to open the Complex Data Exchange project in Eclipse. The Tutorial_1 folder contains additional files which do not display in the Complex Data Exchange Explorer. which is the sample input that you will use to configure the parser. Complex Data Exchange will store its output in this folder. This is not a Complex Data Exchange file. A file generated by the Eclipse development environment. Scripts XSD Results Most of these folders are virtual. Only the Results folder is a physical directory. which defines the XML structure that the parser will create. which Complex Data Exchange creates when your data transformation generates output. Importing the Tutorial_1 Project 13 . the Complex Data Exchange Explorer displays the Tutorial_1 files that you have imported. They are used to categorize the files in the display. for example: File Tutorial_1. A TGP script file. containing the project configuration properties. but they do not actually exist on your disk.cmw . The following is a brief description of the folders and files: Folder/File Examples Description This folder contains an example source document. which stores the parser configuration. An XSD schema file.4. In the upper left of the Eclipse window. When you configure and run the parser.
stacked views. By rightclicking or double-clicking in this view. An editor lets you edit the configuration of a project freely. serializers. and variables. starting from the upper left and moving counterclockwise around the screen. By right-clicking or double-clicking. You can switch between them by clicking the tabs on the bottom. you can add existing files to a project.A Brief Look at the Studio Window Complex Data Exchange Studio displays numerous windows. or open files for editing. transformers. For an explanation. such as parsers. The following paragraphs describe the views and editors. The windows are of two types. IntelliScript Assistant view 14 Chapter 2: Basic Parsing Techniques . View Component view Description Displays the main components that are defined in a project. called views and editors. see the description of the IntelliScript editor below. Lower Left The lower left corner of the Complex Data Exchange window displays two. The view helps you configure certain components in the IntelliScript configuration of a data transformation. you can open a component for editing. A view displays data about a project or lets you perform specific operations. mappers. create new files. Upper Left View Complex Data Exchange Explorer view Description Displays the projects and files in the Complex Data Exchange Studio workspace.
The lower right displays several views, which you can select by clicking the tabs.
View Help view Description Displays help as you work in an IntelliScript editor. When you select an item in the editor, the help scrolls automatically to an appropriate topic. You can also display the Complex Data Exchange help from the Complex Data Exchange Studio Help menu, or from the Informatica > Complex Data Exchange folder on the Windows Start menu. These approaches let you access the complete Complex Data Exchange documentation. Displays events that occur as you run a data transformation. You can use the events to confirm that a transformation is running correctly or to diagnose problems. Displays the binary codes of the example source document. This is useful if you are parsing binary input, or if you need to view special characters such as newlines and tabs. Displays the XSD schemas associated with a project. The schemas define the XML structures that a data transformation can process. Displays the services that are deployed for running in Complex Data Exchange Engine.
Events view Binary Source view Schema view Repository view
Complex Data Exchange Studio displays editor windows on the upper right. You can open multiple editors, and switch between them by clicking the tabs.
Editor IntelliScript editor Description Used to configure a data transformation. This is where you will perform most of the work as you do the exercises in this book. The left pane of the IntelliScript editor is called the IntelliScript pane. This is where you define the data transformation. The IntelliScript has a tree structure, which defines the sequence of Complex Data Exchange components that perform the transformation. The right pane is called the example pane. It displays the example source document of a parser. You can use this pane to configure a parser. Used to configure an XSD schema. For an explanation of how to use this editor, see the Complex Data Exchange XSD Editor manual.
Figure 2-1. IntelliScript Editor
A Brief Look at the Studio Window
Defining the Structure of a Source Document
You are ready to start defining the structure of the source document. You will use the parsingby-example approach to do this.
To define the structure of the source document: 1. 2.
In the Complex Data Exchange Explorer, expand the Tutorial_1 files node and double-click the Tutorial_11.tgp file. The file opens in an IntelliScript editor. The IntelliScript displays a Parser component, called MyFirstParser. Expand the IntelliScript tree, and examine the properties of the Parser. They include the following values, which we have configured for you:
Property example_source Description The example source document, which you will use to configure the parser. We have selected a file called File1.txt as the example source. The file is stored in the project folder. We have specified that the example source has a TextFormat, and that it is TabDelimited. This means that the text fields are separated from each other by tab characters.
The example source, File1.txt, should be displayed in the right pane of the editor. If it is not, right-click MyFirstParser in the left pane, and click Open Example Source.
Note: You can toggle the display of the left and right panes. To do this, open the IntelliScript menu and select IntelliScript, Example, or Both. There are also toolbar buttons for these options.
Examine the example source more closely. It contains two kinds of information. The left entries, such as First Name:, Last Name:, and Id: are called Marker anchors. They mark the locations of data fields. The right entries, such as Ron, Lehrer, and 547329876 are the Content anchors, which are the values of the data fields
Chapter 2: Basic Parsing Techniques
The Marker and Content anchors are separated by tab characters, which are called tab delimiters.
Figure 2-2. Marker and Content Anchors
To make your job easier, we have already configured the parser with the basic properties, such as the TabDelimited format. To complete the configuration of the parser, you will configure the parser to search for each Marker anchor and retrieve the data from the Content anchor that follows it. The parser will then insert the data in the XML structure. As you continue in this book, you will learn about many other types of anchors and delimiters. Anchors are the places where the parser hooks into a source document and performs an operation. Delimiters are characters that separate the anchors. Complex Data Exchange lets you define anchors and delimiters in many different ways.
You are now ready to start parsing by example. In the example pane, move the mouse over the first Marker anchor, which is First Name:.
Defining the Structure of a Source Document
press Enter. Your last action caused a Marker anchor to be inserted in the IntelliScript. and the background is gray. This property lets you specify the type of search. On your screen.5. Since this is the desired setting. 6. the background might be white. You are now prompted for the next property that needs your input. You are now prompted for the first property that requires your input. In the images displayed here. The default is the text that you selected in the example source. Click Insert Marker. and the default setting is TextSearch. which is search. You can control this behavior by using 18 Chapter 2: Basic Parsing Techniques . Press Enter again. which is the text of the anchor that the parser searches for. 7. the selected property is white.
the Marker is highlighted in yellow. Now. Right-click the selected text. and prompts you for the first property that requires your input. and confirm that the option to Learn the Example Automatically is checked. you are ready to create the first Content anchor. select the word Ron. which is value. open the IntelliScript menu. The IntelliScript displays the new Marker anchor as part of the MyFirstParser definition. In the example source. In the example source. Defining the Structure of a Source Document 19 . 8. If the color coding is not immediately displayed. 9. This inserts a Content anchor in the IntelliScript. select or deselect the option to Highlight Focused Instance. On the pop-up menu. 10.the Windows > Preferences command. This property lets you specify how the parsing will be performed. On the Complex Data Exchange page of the preferences. click Insert Content.
To define the data holder. or a variable”. you will use variables.xsd. and closing-marker. opening_marker. 11. like this: <First>Ron</First> 12. In a later exercise. To accept the default. select the data_holder property. you will use data holders that are elements or attributes. Expand the no target namespace node.The default is LearnByExample. You can accept the defaults for the next few properties. Your next task is to specify where the Content anchor stores the data that it extracts from the source document. such as example. Selecting a Data Holder 20 Chapter 2: Basic Parsing Techniques . and select the First element. which means that the parser finds the anchor based on the delimiters surrounding it in the example source. This opens a Schema view. Figure 2-3. In this exercise. and double-click or press Enter. Person. This is the desired behavior. This is a generic term which means “an XML element. which displays the XML elements and attributes that are defined in the XSD schema of the project. This means that the Content anchor stores its output in an XML element called First. click Enter. an XML attribute. or you can skip them by clicking elsewhere with the mouse. The output location is called a data holder.
14. Actually. the data_holder property is assigned the value /Person/ In XML terminology. which you can ignore. to help you distinguish them. Defining the Structure of a Source Document 21 . or click the Save button on the toolbar. this is called an XPath expression. So far. Complex Data Exchange Studio highlights the Content anchor in the example source. which is nested inside Person.More precisely. This ensures that your editing is saved in the project folder. you have told the parser to search for the First Name: anchor. 13. retrieve the Ron anchor that follows it. The Marker and Content anchors are in different colors. When you click the OK button. and store the content in the First element of the XSD schema. Click the File > Save command on the menu. The actual output structure generated by the parser will be: <Person> <Name> <First>Ron</First> </Name> </Person> *s/Name/*s/First. the First element is nested inside Name. the standard XPath expression is /Person/Name/First. It is a representation of the XML structure that we have outlined above. The *s symbols are Complex Data Exchange extensions of the XPath syntax.
defining the other Marker and Content anchors in the same way. Complex Data Exchange might fail to find the text. If you make a mistake in the sequence. you can correct it in several ways: 22 Chapter 2: Basic Parsing Techniques .Proceed through the source document. you might occasionally make a mistake such as selecting the wrong text. which is their sequence in the source document. Correcting Errors in the Parser Configuration As you define anchors. If you make a mistake. the Complex Data Exchange Studio window should look like the following illustration. The following table lists the anchors that you need to define. You can expand or collapse the tree to view the illustrated lines. Anchor Last Name: Lehrer Id: 547329876 Age: 27 Gender: M Anchor Type Marker Content Marker Content Marker Content Marker Content /Person/@gender /Person/*s/Age /Person/*s/Id /Person/*s/Name/*s/Last Data Holder Be sure to define the anchors in the above sequence. Save the project again. When you have finished. or setting the wrong property values for an anchor. Using the correct sequence is important in establishing the correct relationship between each Marker and the corresponding Content. 15.
As you gain more experience working in the IntelliScript pane. the colon isn't actually important in this parser. In this case. you inserted markers by using the Insert Marker and Insert Content commands. for example: First Name: Because the tab-delimited format is selected. You can define anchors by typing in the IntelliScript pane. In the instructions. In the IntelliScript pane. such as the colon. see the book Using Complex Data Exchange Studio in Eclipse. If you forget to define an anchor. You can edit the property values in the IntelliScript. Tab-Delimited Format Do you remember that MyFirstParser is defined with a TabDelimited format? The delimiters define how the parser interprets the example source. you can click Edit > Undo. the parser would still find the Marker and the tab following the Marker. where the data_holder property is already assigned. ♦ ♦ We encourage you to experiment with these features. you can select a component that you have added to the configuration and press the Delete key. You can edit the properties of Marker and Content anchors in the IntelliScript Assistant view. without using the example source. Defining the Structure of a Source Document 23 . There are several alternative ways to define anchors: ♦ You can define a Content anchor by dragging text from the example source to a data holder in the Schema view. you can right-click a component and click Delete. In the IntelliScript pane.♦ ♦ ♦ On the menu. and it would read the Content correctly. you can use the following additional techniques: ♦ ♦ ♦ ♦ If you create an anchor in the wrong sequence. This inserts the anchor in the IntelliScript. If you had selected First Name without the colon. You can copy and paste components such as anchors in the IntelliScript. right-click the anchor following the omitted anchor location and click Insert. It would ignore other characters. Techniques for Defining Anchors In the above steps. you can drag it to the correct location in the IntelliScript. the parser understands that the Marker and Content anchors are separated by tab characters. For more information. we suggested that you select the Marker anchors including the colon character.
and not be concerned about the field size. 24 Chapter 2: Basic Parsing Techniques . a person might have a long first name such as Rumpelstiltskin. In another source document. unless the line contains another anchor or tab character. By default. up to the line break.The tab-delimited format also explains why you can select a short Content anchor such as Ron. This is the case. a tab-delimited parser reads the entire string after the tab.
Click the option to Set as Startup Component. expand the Tutorial_1 files/Results node. On the menu. In the Complex Data Exchange Explorer. and confirm that the results are correct. the events list all the Marker and Content anchors that the parser found in the example source. and double-click the file output. The Studio displays the XML file in an Internet Explorer window.xml. You can use the Events view to examine any errors encountered during execution. Examine the output carefully. After a moment. 2. and try again. 3. Complex Data Exchange will activate MyFirstParser. Among other information. This means that when you run the project. the Studio displays the Events view. examine the parser configuration that you created. click Run > Run MyFirstParser. To test the parser: 1. Now you can examine the output of the parser process. If the results are incorrect. Internet Explorer might display a warning about active content in the XML results file. which is named MyFirstParser. 4. Assuming that you followed the instructions carefully. right-click the Parser component. If you are using Windows XP SP2 or higher. You can safely ignore the warning. the execution should be error-free. In the IntelliScript.Running the Parser You are now ready to test the parser that you have configured. correct any mistakes. Running the Parser 25 .
3. You can ignore this issue for now. other than the example source. and double-click or press Enter. This displays the advanced properties of the component. Displaying Advanced Properties 2. Complex Data Exchange Studio displays an Open dialog. Expand the LocalFile node of the IntelliScript. The >> symbol changes to <<. At the right of the Parser component. Figure 2-4. select LocalFile.. You can close the advanced properties by clicking the << symbol. To test the parser on additional source documents: 1. you can run it on additional source documents. You can use one of the test files that we have supplied. and assign the file_name property. Select the sources_to_extract property. click the >> symbol. From the drop-down list.Do not worry if the XML processing instruction (<?xml version="1.0". where you can browse to the file. Running the Parser on Additional Source Documents To further test the parser. This is due to the settings in the project properties..?>) is missing or different from the one displayed above. which are in the folder 26 Chapter 2: Basic Parsing Techniques .
tutorials\Exercises\Tutorial_1\Additional source files For example. containing the data that the parser found in the test file. [ Running the Parser 27 . select File2. which has the following content: First Name: Last Name: Id: Age: Gender: Dan Smith 45209875 13 M 4. Run the parser again. You should get the same XML structure as above.txt.
select the source text. you can add components such as anchors. which the parser seeks and processes. right-click. In the IntelliScript. Marker anchors label the data fields. where you can set its properties. It usually also contains the example source document. The Complex Data Exchange Explorer displays the projects that exist in your Studio workspace. and Content anchors extract the field values. you can set the data holder—an XML element or attribute—where a Content anchor stores its output. and choose the anchor type. For example. To define a Marker or Content anchor. The delimiters define the relation between the anchors. double-click its name in the Complex Data Exchange Explorer. Then use the Run command on the Complex Data Exchange Studio menu. A tab-delimited format means that the anchors are separated by tabs. use the File > Import command. To open a data transformation for editing in the IntelliScript editor. To copy an existing project into the workspace. 28 Chapter 2: Basic Parsing Techniques . double-click its TGP script file in the Complex Data Exchange Explorer. Use the color coding to review the anchor definitions. To view the results file. To test a parser. and other files such as the parsing output.Points to Remember A Complex Data Exchange project contains the parser configuration and the XSD schema. which the parser uses to learn the document structure. The anchors define locations in the source document. This inserts the anchor in the IntelliScript. set it as the startup component.
What's Next? 29 . rigid structure. The documents contained a few Marker anchors. All the techniques are based. each of which was followed by a tab character and by Content. all the documents had exactly the same Marker anchors in the same sequence. Moreover. however. on the simple steps that you learned in this chapter.What's Next? Congratulations! You have configured and run your first Complex Data Exchange parser. very few source documents have such a simple. the source documents that you parsed had a very simple structure. In real-life uses of parsers. In the following chapters. This made the parsing easy because you did not need to consider the possible variations among the source documents. you will learn how to parse complex and flexible document structures using a variety of parsing techniques. Of course.
30 Chapter 2: Basic Parsing Techniques .
42 Points to Remember.Chapter 3 Defining an HL7 Parser This chapter includes the following topics: ♦ ♦ ♦ ♦ ♦ Overview. 32 Creating a Project. 38 Testing the Parser. 35 Defining the Anchors. 45 31 .
It is used worldwide in hospital and medical information systems. Each segment contains a predefined hierarchy of fields and sub-fields. HL7 is a standard messaging format used in medical information systems. which you will use as the source document for parsing. MSH|^~\&|LAB||CDB||||ORU^R01|K172|P PID|||PATID1234^5^M11||Jones^William||19610613|M OBR||||80004^Electrolytes OBX|1|ST|84295^Na||150|mmol/l|136-148|Above high normal|||Final results OBX|2|ST|84132^K+||4.5-5|Normal|||Final results OBX|3|ST|82435^Cl||102|mmol/l|94-105|Normal|||Final results OBX|4|ST|82374^CO2||27|mmol/l|24-31|Normal|||Final results The message is composed of segments. you will use this information to guide the configuration.5|mmol/l|3. that is.hl7. http://www. which are separated by carriage returns. which are delimited by the characters immediately following the MSH designator (|^~\&). see the Health Level 7 web site. In this lesson. you will configure the parser yourself. Input HL7 Message Structure The following lines illustrate a typical HL7 message. retrieving selected data and ignoring the rest Defining a repeating group Using delimiters to define the source document structure Testing and debugging a parser Requirements Analysis Before you start the exercise. you will parse an HL7 message. You will learn techniques such as: ♦ ♦ ♦ ♦ ♦ ♦ Creating a project Creating a parser Parsing a document selectively. you will be able to parse a large variety of documents that are used in real applications.org. such as MSH (message header) or PID (patient identification). After you learn the techniques for processing these features. HL7 Background HL7 is a messaging standard for the health services industry. we will analyze the input and output requirements of the project. For more information about HL7. We will provide the example source document and a schema for the output XML vocabulary. Each segment has a three-character label. As you design the parser.Overview In this chapter. The structure is characterized by a hierarchy of delimiters and by repeating elements. 32 Chapter 3: Defining an HL7 Parser .
.</birth_date> </Patient> <Test_Type test_id=".</status> </Result> <!-.</type> <value>...For example. The message type is specified by a field in the MSH segment..</comment> <status>. which means Unsolicited Transmission of an Observation Message... and the OBX segments list the observation results..." gender=".... you will configure a parser that processes ORU messages such as the above example.">.. Output XML Structure The purpose of this exercise is to create a parser.. Some key issues in the parser definition are how to define the delimiters and how to process the repeating OBX group....</l_name> <birth_date>. In this chapter."> <type>.....</f_name> <l_name>... the patient's name (Jones^William) follows the PID label by five | delimiters....."> <type>...</range> <comment>. The last and first names (Jones and William) are separated by a ^ delimiter.</Test_Type> <Result num=".." id="."> <Patient id="... subtype R01...</status> </Result> <Result num="...</value> <range>.</type> <value>.</value> <range>. In the above example..."> <f_name>. which will convert the above HL7 message to the following XML output: <Message type=".. The OBR segment specifies the type of observation.... the message type is ORU.</range> <comment>..Additional Result elements as needed --> </Message> Overview 33 .</comment> <status>.
The XML has elements that can store much—but not all—of the data in the sample HL7 message. 34 Chapter 3: Defining an HL7 Parser . That is acceptable. The XML structure contains the elements that are required for retrieval. Notice the repeating Result element. In this exercise. you will build a parser that processes the data in the source document selectively. retrieving the information that it needs and ignoring the rest. This element will store data from the repeating OBX segment of the HL7 message.
Since this parser will parse HL7 message of type ORU. 3. call it HL7_ORU_Parser. The Eclipse workspace can contain any number of projects. Under the Complex Data Exchange node. enter a project name. such as Tutorial_2. click File > New > Project. On the following wizard page. On the next page. enter a name for the Parser component.cmw configuration file having this name. you will create a project where Complex Data Exchange Studio can store your work. We have provided a schema for you. you can select an XSD schema. which defines the XML structure where the parser will store its output. On the Complex Data Exchange Studio menu. This opens a wizard where you can select the type of project. To create a project: 1. You don’t have to close or remove the Tutorial_1 project or any other projects that you already have. On the next page of the wizard. 5. 4. 2. select a Parser Project. On the next page. A convenient name is Script_Tutorial_2. 6. enter a name for the TGP script file that the wizard creates.Creating a Project First. the software creates a folder and a *. the location is: Creating a Project 35 . In the Complex Data Exchange installation folder.
10. This causes the parser to assume that the message fields are separated by the HL7 delimiter hierarchy: newline | Other symbols such as ^ and tab For example. 12. You do not need a document preprocessor in this project. The Studio copies the schema to the project folder.tutorials\Exercises\Files_For_Tutorial_2\HL7_tutorial. the encoding is ASCII. In this project. Select the format of the source document. 8. 7. You can skip the document preprocessors page. Select the File option. and ends at the fourth | delimiter. which is the default. 36 Chapter 3: Defining an HL7 Parser . it is a file. which we have provided. specify the example source type. On the next page. In this exercise. The location is: tutorials\Exercises\Files_For_Tutorial_2\hl7-obs. Review the summary page and click Finish. On the next page. 9. the format is HL7. The next page prompts you to browse to the example source file. select the encoding of the source document.txt The Studio copies the file to the project folder. In this exercise. 11. the parser might learn that a particular Content anchor starts at a count of three | delimiters after a Marker.xsd Browse to this file and click Open.
13.org. For more information. Using XSD Schemas in Complex Data Exchange Complex Data Exchange data transformations require XSD schemas to define the structure of XML documents. see “Data Holders” in the Complex Data Exchange Studio User Guide. In the IntelliScript. For your own applications. notice that the example_source and format properties have been assigned according to your specifications. you might already have the schemas. If it is not. or you can create new ones.w3.com.tgp. When you perform the tutorial exercises in this book. For definitive reference information. It opens the script that you have created. see the tutorial on the W3Schools web site. Editing Schemas You can use any XSD editor to create and edit the schemas that you use with Complex Data Exchange.w3schools. we provide the schemas that you need. in an IntelliScript editor. Creating a Project 37 . serializer. To select an editor. http://www. Learning XSD For an excellent introduction to the XSD schema syntax. Script_Tutorial_2. Every parser. see “Complex Data Exchange Studio Preferences” in Using Complex Data Exchange Studio in Eclipse. or mapper project requires at least one schema. see the XML Schema standard at http://www. right-click the Parser component and click Open Example Source. The software creates the new project. The example source should be displayed. It displays the project in the Complex Data Exchange Explorer.
You need to define Marker anchors that identify the locations of fields in the source document. we have done this job for you. which are the first three lines. you need to identify the data fields to retrieve. To save time. which is illustrated above. These are the Content anchors. These labels identify portions of the document that have a well-defined structure. not an element. The symbol means that type is an attribute. The data holders are elements or attributes in the XML output. and Content anchors that identify the field values. you need to define the data holders for each Content anchor. We present the results in the following table: Anchor MSH ORU K172 PID PATID1234^5^M11 Jones William 19610613 M OBR 80004 Electrolytes Anchor Type Marker Content Content Marker Content Content Content Content Content Marker Content Content /Message/*s/Test_Type/@test_id /Message/*s/Test_Type /Message/*s/Patient/@id /Message/*s/Patient/*s/l_name /Message/*s/Patient/*s/f_name /Message/*s/Patient/*s/birth_date /Message/*s/Patient/@gender /Message/@type /Message/@id Data Holder Note the @ symbol in some of the XPath expressions. you can probably figure out which message fields need to be defined as anchors. The most convenient Marker anchors are the segment labels. In addition. PID. Within each segment. and you can map the fields to their corresponding data holders. There are several Content anchors for each Marker. such as /Message/@type.Defining the Anchors Now it is time to define the data transformation anchors. To define the anchors: 1. MSH. If you study the HL7 message and the XML structure that should be produced. and OBR. You will start with the non-repeating portions of the document. 38 Chapter 3: Defining an HL7 Parser .
Defining the Anchors 39 . That is where you should define the new anchor. Alternatively. select the RepeatingGroup anchor. In the list. Inside the RepeatingGroup. The parser should create output for each OBX line that it finds. Now. which tell the parser how to parse each iteration of the segment. you will define an anchor called a RepeatingGroup. you need to teach the parser how to process the OBX lines of the source document. 2. Complex Data Exchange Studio automatically completes the text.. To do this. you can type the text RepeatingGroup in the box. the IntelliScript editor should appear as in the following illustration.Create the anchors in the parser definition. you will nest several Content anchors. the last anchor that you defined.). as you did in the preceding chapter. find Electrolytes.. each having the same format. there is an empty node containing three dots ( . This opens a drop-down list. Immediately below the anchor. which displays the names of the available anchors. The anchor tells Complex Data Exchange to search for a repeated segment. After you type the first few letters. There are several OBX lines. When you finish. In the IntelliScript pane. Select the three dots and press the Enter key.
Press the Enter key again to accept the new entry.
Now, you must configure the RepeatingGroup so it can identify the repeating segments. You will do this by assigning the separator property. You will specify that the segments are separated from each other by a Marker, which is the text OBX. In the IntelliScript pane, expand the RepeatingGroup. Find the line that defines the separator property. By default, the separator value is empty, which is symbolized by a ... symbol. Select the ... symbol, press Enter, and change the value to Marker. Press Enter again to accept the new value. The Marker value means that the repeating elements are separated by a Marker anchor. Expand the Marker property, and find its text property. Select the value, which is empty by default, and press Enter. Type the value OBX, and press Enter again. This means that the separator is the Marker anchor OBX. In the example pane, Complex Data Exchange Studio highlights all the OBX anchors, signifying that it found them correctly.
Now, you will insert the Content anchors, which parse an individual OBX line. To do this, keep the RepeatingGroup selected. You must nest the Content anchors within the RepeatingGroup. Define the anchors only on the first OBX line. Because the anchors are nested in a RepeatingGroup, the parser looks for the same anchors in additional OBX lines.
Chapter 3: Defining an HL7 Parser
Define the Content anchors as follows:
Anchor 1 Na 150 136-148 Above high normal Final results Anchor Type Content Content Content Content Content Content Data Holder /Message/*s/Result/@num /Message/*s/Result/*s/type /Message/*s/Result/*s/value /Message/*s/Result/*s/range /Message/*s/Result/*s/comment /Message/*s/Result/*s/status
Figure 3-1. Nesting Anchors in a RepeatingGroup
When you finish, the markup in the example pane should look like this:
Editing the IntelliScript
The procedure illustrated above is the general way to edit the IntelliScript.
To edit the IntelliScript: 1. 2. 3. 4.
Select the desired location for editing. Press Enter. In most locations, you can also double-click instead of pressing Enter. Choose or type a value. Press Enter again to accept the edited value.
Defining the Anchors
Testing the Parser
There are several ways to test the parser and confirm that it works correctly:
♦ ♦ ♦
You can view the color coding in the example source. This tests the basic anchor configuration. You can run the parser, confirm that the events are error-free, and view the XML output. This tests the parser operation on the example source. You can run the parser on additional source documents. This confirms that the parser can process variations of the source structure that occur in the documents.
In this exercise, you will use the first two methods to test the parser. We will not take the time for the third test, although it is easy enough to do. For more information, see “Running the Parser on Additional Source Documents” on page 26.
To test the parser: 1.
On the menu, click IntelliScript > Mark Example. Alternatively, you can click the button near the right end of the toolbar, which is labeled Mark the Entire Example According to the Current Script. Notice that the color-coding is extended throughout the example source document. Previously, only the anchors that you defined in the first line of the repeating group were highlighted. When you ran the Mark Example command, Complex Data Exchange ran the parser and found the anchors in the other lines. Confirm that the marking is as you expect. For example, check that the test value, range, and comment are correctly identified in each OBX line. If the marking is not correct, or if there is no marking, there is a mistake in the parser configuration. Review the instructions, correct the mistake, and try again.
As an experiment, you can test what would happen if you made a deliberate error in the configuration. Do you remember the HL7 delimiters option, which you set in the New Parser wizard? Try changing the option to another value:
Save your work. This can be helpful if you make a serious error during this experiment. Edit the delimiters property. It's located under Parser/format.
Chapter 3: Defining an HL7 Parser
the Events view appears. however. They stop the data transformation from running. click Edit > Undo to restore the previous configuration.♦ Change the value from HL7 to Positional. ♦ ♦ ♦ 3. Because the failure is nested within an optional failure. 5. Then use the Run > Run command to run it. Change the delimiters property back to HL7. Often. This is expected because the example source contains only four iterations. This event means that Complex Data Exchange failed to find the Marker anchor. Notice that most of the events are labeled with the information event icon ( normal when the parser contains no mistakes. In the right pane of the Events view. In general. pay attention to warning event icons ( ) and to fatal event icons ( ). Alternatively. Warnings are less severe than failures. you can find an event that is labeled with an optional failure icon ( ). 4. try double-clicking one of the Marker or Content events. you can find a failure event icon ( ). For example. measured by numbers of characters. Now. Confirm that Mark Example now works correctly. you should pay attention to a failure event and make sure you understand what caused it. Nested within the optional failure event. Fatal errors are the most severe. This means that the RepeatingGroup failed to find a fifth iteration of the OBX separator. The problem occurs because the parser thinks that the anchors are located at fixed positions on the line. In addition to failure events. after the Marker anchors. The failure is called optional because it is permitted for the separator to be missing at the end of the iterations. Try the Mark Example command again. it is time to run the parser. This is If you search the event tree. and it is labeled Separator before 5. in the second OBX line. ). Save your work again. The event is located in the tree under Execution/RepeatingGroup. The anchors are incorrectly identified. Testing the Parser 43 . This means that the Content anchors are located at fixed positions. After a few moments. a failure indicates a problem in the parser. Right-click the Parser component and set it as the startup component. which defines the OBX separator. the comment is reported as rmal|||Final resu instead of Normal. it is not a cause for concern.
When you do this.xml file. located under the Tutorial_2 files/Results node. 6. You should see the following XML: 44 Chapter 3: Defining an HL7 Parser . Complex Data Exchange highlights the anchor that caused the event in the IntelliScript and example panes. double-click the output. In the Complex Data Exchange Explorer. This is a good way to find the source of failure or error events.
To edit the IntelliScript. select the location that you want to edit. where you can set options such as: ♦ ♦ ♦ ♦ ♦ The parser name The XSD schema for the output XML The example source document. Press the Enter key. Points to Remember 45 . such as a file The source format. Click Run > Run to execute the parser. View the results file. which contains the output XML. and press Enter again. use the Select-Enter-Assign-Enter approach. such as text or binary The delimiters that separate the data fields Content After you create the project. or RepeatingGroup for repetitive structures. such as Marker and for simple data structures. Assign the property value. Click IntelliScript > Mark Example to color-code the markers.Points to Remember To create a new project. This displays a wizard. That is. edit the IntelliScript and add the anchors. run the File > New > Project command.
46 Chapter 3: Defining an HL7 Parser .
Chapter 4 Positional Parsing of a PDF Document This chapter includes the following topics: ♦ ♦ ♦ ♦ ♦ ♦ Overview. 55 Defining the Nested Repeating Groups. 67 Points to Remember. 48 Creating the Project. 70 47 . 53 Defining the Anchors. 60 Using an Action to Compute Subtotals.
Do not be afraid to delete or undo your work if you make a mistake. The data is organized in nested repeating groups. In such cases. and analyze what the data transformation needs to do. which is suitable for further processing. the exercise illustrates several other important Complex Data Exchange features: ♦ The source document is a PDF file. you can configure a parser that uses a positional format to find the data fields. You will define the Content anchors according to their character offsets from the Marker anchors. called Orshava Farms. real-life parsing problem. and account statements. open the file in the Complex Data Exchange installation folder: tutorials\Exercises\Files_for_Tutorial_3\Invoice.com. The parser uses a document processor to convert the document from the binary PDF format to a text format. you will use a positional strategy to parse an invoice form. To configure the parser. you need the Adobe Reader.adobe. Some of the irrelevant data contains the same marker strings as the desired data. Source Document To view the PDF source document. which you can download from http://www. you will solve a complex. which are not present in the source document. Complex Data Exchange has a built-in component for processing PDF documents and does not require any additional PDF software. In this chapter.Overview In many parsing applications. In the Adobe Reader.pdf The document is an invoice that a fictitious egg-and-dairy wholesaler. The exercise introduces the concept of search scope. With a little practice. you will use both the basic properties and the advanced properties of components. invoices. This is true. for example. sends to its customers. The first page of the invoice displays data such as: 48 Chapter 4: Positional Parsing of a PDF Document . the source documents have a fixed page layout. The document contains a large quantity of irrelevant data that is not required for parsing. you will be able to create parsers like this easily Requirements Analysis Before you start to configure a Complex Data Exchange project. The parser uses actions to compute subtotals. You need the Reader only for viewing the document. In addition to the positional strategy. examine the source document and the desired XML output. of bills. Have patience as you do the exercise. which you can use to narrow the search for anchors and identify the desired data reliably. The advanced properties are hidden but can be displayed on demand. ♦ ♦ ♦ ♦ In this exercise.
The second page displays the itemized charges for each buyer.♦ ♦ ♦ ♦ The customer's name. address. there is a page header. At the top of the second page. for each purchase transaction. there is a two-line structure. there is additional boilerplate text. The page has a nested repeating structure: ♦ ♦ The main section is repeated for each buyer. The sample document lists two buyers. At the bottom. and account number The invoice date A summary of the current charges The total amount due The top and bottom of the first page display boilerplate text and advertising. Within the section for each buyer. followed by a blank space. each of whom made multiple purchases. Overview 49 .
A business might store such invoices as PDF files instead of saving paper copies.07</Balance_Due> <Buyer name="Molly" total="217. XML Output For the purpose of this exercise.01</Current_Total> <Balance_Due>457. followed by repeating structures for different account numbers and credit card numbers. is to retrieve the required data while ignoring the boilerplate. It might use the PDF invoices for online billing by email or via a web site. You need to store the data in an XML structure. 2003</Period_Ending> <Current_Total>351. presumably for a system integration application. in designing the parser. Your task. which looks like this: <Invoice account="12345"> <Period_Ending>April 30.This structure is typical of many invoices: a summary page. Since you are doing this. you must do it with a very high degree of reliability. we assume that you want to retrieve the transaction data from the invoice.64"> <Transaction date="Apr 02" ref="22498"> 50 Chapter 4: Positional Parsing of a PDF Document .
and each Buyer element contains multiple Transaction elements. The Parsing Problem Try opening the Invoice.07</Total> </Transaction> <Transaction date="Apr 08" ref="22536"> <Product>large eggs</Product> <Total>58.Additional transaction elements --> </Buyer> <Buyer name="Jack" total="133. For example.14</Total> </Transaction> <!-. The total per buyer is not recorded in the invoice. the quantity of each product.37"> <!-. which is the total of the buyer's purchases. Each Transaction element contains selected data about a transaction: the date.Transaction elements --> </Buyer> </Invoice> The structure contains multiple Buyer elements. The structure omits other data about a transaction. You will see something like this: Overview 51 . product.<Product>large eggs</Product> <Total>29.pdf file in Notepad. We require that Complex Data Exchange compute it. reference number. and total price. and the unit price. which we choose to ignore. Each Buyer element has a total attribute. the structure omits the discount.
That is why the positional format is appropriate for this exercise. the parsing problem seems more tractable: Apr 08 22536 large eggs 60 dozen @ 1. 45.02 per dozen 61. The position of each data field is fixed relative to the left and right margins. which we do not need to retrieve.30 43. The following line is blank.90 2. 52 Chapter 4: Positional Parsing of a PDF Document . A third feature is that the group of transactions is preceded by a heading row. If we extract the text content of the document.61 The transaction data is aligned in columns. such as: Purchases by: Molly The heading contains the buyer's name. We can use the repeating line structure to help parse the data. but it would clearly be very difficult.20 3.06 58. The second line contains the quantity of a product and the unit price. which we need to retrieve. The heading also serves as a separator between groups of transactions. and we would need to work very hard to identify the Marker and Content anchors unambiguously. The first line contains the data that we wish to retrieve. Another feature is that each transaction is recorded in a fixed pattern of lines. We would need a detailed understanding of the internal PDF file format.Parsing this binary data might be possible. This is a perfect case for positional parsing—extracting the content according to its position on the page.14 Apr 08 22536 cheddar cheese 30 lbs.53 per lb. @ 1.
select the PDF to Unicode processor.xsd. select the document format. 2. which is PdfToTxt_3_01. (UTF-8) When you reach the document processor page of the wizard. The options are similar to the ones that you set in the preceding tutorial. For more information. To create the project: 1. Name the script file Pdf_ScriptFile. On the final page. which is CustomFormat. On the first few pages of the wizard. when you edit the IntelliScript. Use the File > New > Project command to create a parser project called Tutorial_3. When prompted for the schema. Specify that the source content type is PDF. 3. On the next wizard page. the Complex Data Exchange Explorer displays the Tutorial_3 project. click Finish. This processor converts the binary PDF format to the text format that the parser requires.Creating the Project Now that you understand the parsing requirements. which is in following folder: tutorials\Exercises\Files_For_Tutorial_3. and the script file is opened in an IntelliScript editor. In the IntelliScript. 4. set the following options. browse to the file OrshavaInvoice. You will change this value later. you are ready to configure the Complex Data Exchange project. Creating the Project 53 . The processor inserts spaces and line breaks in the text file. Browse to the example source file. see “Defining an HL7 Parser” on page 31. Specify that the example source is a file on a local computer.pdf. ♦ ♦ ♦ ♦ ♦ ♦ Name the parser PdfInvoiceParser. which is Invoice. 5. Note: PDF to Unicode (UTF-8) is a description of the processor. in an attempt to duplicate the format of the PDF file as closely as possible. you can view the actual name of the processor. After a few seconds. 6.
8. and change its value from CustomFormat to TextFormat. Try clicking the Browser tab at the bottom of the example pane. which is the output of the document processor. which uses the Adobe Reader as a plug-in component to display the file. This is appropriate because the parser will process the text form of the document. You will configure anchors that process the text. change delimiters to Positional. Under the format property. This means that the parser learns the structure of the example source by counting the characters between the anchors. which is the output of the document processor.7. This tab displays the original document in a Microsoft Internet Explorer window. Expand the format property of the parser. Notice that the example pane displays the example source in text format. 9. you can do some fine tuning by editing the IntelliScript. 54 Chapter 4: Positional Parsing of a PDF Document . Now that the project and parser have been created. Return to the Source tab to configure the anchors. The Browser tab is for display only.
Under TextSearch. The option means that the parser looks only for the string ACCOUNT NO:. Define the string as a Marker anchor by selecting it. Defining the Anchors 55 . and you are prompted to enter the search property setting. On the menu. Account No:. Click the >> symbol at the end of TextSearch to display its advanced properties. and not account no:. For more information about the advanced properties. confirm that the text string is ACCOUNT NO:. which marks the beginning of the text that you need to parse.Defining the Anchors Now. In fact. which is the default. The Marker is automatically displayed in the IntelliScript. after the Marker. To define the anchors: 1. You will use the parse-by-example approach to do this. and choosing Insert Marker. click Insert Offset Content. the spelling Account No: occurs in the header of the second page. right-clicking. Although it is not strictly necessary. Select the value TextSearch. select and rightclick 12345. or any other spelling. Continuing on the same line of the source document. see “Basic and Advanced Properties” on page 66. 2. This can help prevent problems if one of the other spellings happens to occur in the document. find the string ACCOUNT NO:. Select the match case option under Marker. you will define the anchors. we suggest that you select the match case option for all the Marker anchors that you define in this exercise. Near the top of the example source.
4. 3. What will happen if a source document has an account number such as 987654321. which is longer than the number in the example source? According to the current definition of the closing_marker. the parser will retrieve only the first five digits. with property values that are appropriate for positional parsing. The properties are as follows: Property opening_marker = OffsetSearch(1) closing_marker = OffsetSearch(5) Description This means that the text 12345 starts 1 character after the Marker anchor.This inserts a Content anchor. 5. Do this by changing the value of closing_marker from OffsetSearch to NewlineSearch. 56 Chapter 4: Positional Parsing of a PDF Document . Edit the data_holder property of the Content anchor. change the closing_marker to retrieve not only until character 5. Set its value to /Invoice/ This is the data holder where the Content anchor will store the text that it retrieves from the source document. but until the end of the text line. This means that the text 12345 ends 5 characters after its start. There is a small problem in the definition of the Content anchor. To solve this potential problem. @account. Continuing to the next line of the example source. define PERIOD ENDING: as a Marker anchor. truncating the value to 98765.
For example. Define April 30.01. 2003 as a Content anchor with the offset properties. define 351. In positional parsing. It should look like this: Defining the Anchors 57 .01. When you select the Content. 8.01 as an offset Content anchor. In the same way. 7. change closing_marker to NewlineSearch. As above. in case the number is located to the right of its position in the example source. define BALANCE DUE as a Marker and 475. Map the Content to the /Invoice/*s/Period_Ending data holder. to support source documents where the date string is longer than in the example source. At the end of the same line. Examine the IntelliScript. and define CURRENT INVOICE as a Marker. Change the closing_marker to NewlineSearch. and confirm that the sequence of Marker and offset-Content anchors is correct. 10. include a few space characters to the left of the string 351. the content might be 1351. 9. this is important because the area to the left might contain additional characters. Map the Content to /Invoice/*s/Balance_Due. Scroll a few lines down.07 as Content. The next data that you need to retrieve is the current invoice amount.6. Map the Content to the /Invoice/*s/Current_Total data holder.
There is a catch. To prevent this from causing a problem. Examine the color coding in the example pane. The file should look like this: 14. 13. 11. which creates the Buyer elements of the XML output. which is on page 2 of the invoice. The string Purchases by: appears also on the bottom of page 1. Run the parser and view the results file.The precise offsets in your solution might differ from the ones in the illustration. however. Afterwards. which creates the Transaction elements. you will define a nested repeating group. Right-click and define PdfInvoiceParser as the startup component. It should look like this: 12. You will define Purchases by: as the separator of the repeating group. You are now almost ready to define the repeating group. The repeating group starts at the Purchases by: line. you need to start the search scope for the repeating group at the beginning of page 2. 58 Chapter 4: Positional Parsing of a PDF Document . The offsets depend on the number of spaces that you selected to the left or right of each anchor.
More About Search Scope The Marker for Page 2 is used to establish the search scope of the subsequent repeating group. and criteria is one of the most powerful features that Complex Data Exchange offers for parsing complex documents. Therefore. define the string Page 2 as a Marker anchor. For complete information. which is a regular expression. By default. You have already seen many examples of how this works as you performed the preceding exercises. a pattern. the software searches for the Content between the Marker anchors. For example. the parser assumes that the anchors are defined in the order that they appear in the document. called the initial phase. suppose that a Content anchor is located between two Marker anchors. Complex Data Exchange finds the Marker anchors in an initial pass through the document. phase. called the main phase.To do this. There are many ways to refine the search scope and the way in which the software searches for anchors. Every Marker anchor is to establish the search scope of other anchors. or a specified data type Adjusting the search scope. In a second pass. or final phase of the processing Search backwards for anchors Permit anchors to overlap one another Find anchors according to search criteria such as a specified text string. For example. you can configure Complex Data Exchange to: ♦ ♦ ♦ ♦ Search for particular anchors in the initial. see “Anchors” in the Complex Data Exchange Studio User Guide. main. it starts looking for the subsequent repeating group from the end of this Marker. Defining the Anchors 59 .
Defining the Nested Repeating Groups Now. edit the IntelliScript to insert a RepeatingGroup. 2. you need to define a RepeatingGroup anchor that parses the repeating Purchases by: structure. in case the buyer's name contains more than five letters. Define the separator property of the RepeatingGroup as a Marker. 60 Chapter 4: Positional Parsing of a PDF Document . Within it. insert an offset Content anchor that maps the buyer's name. We won't remind you about this any more. which performs a TextSearch for the string Purchases by:. When you define an offset Content anchor. At the end of the anchor list. Complex Data Exchange highlights the Purchases by: strings in the example pane. To define the repeating groups: 1. you will define a nested RepeatingGroup that parses the individual transactions. you should confirm that it supports strings that are longer than the ones in the example source. Do not forget to change the closing_marker to NewlineSearch. 3. see “Defining an HL7 Parser” on page 31. This is appropriate because the separator is located before each iteration of the repeating group. Set the separator_position to before. Molly. Do you remember how to define a repeating group? For more information about the procedure. Within the RepeatingGroup. to the data holder /Invoice/*s/Buyer/@name.
Defining the Nested Repeating Groups 61 . Confirm that the results file contains two Buyer elements. This confirms that the parser identifies the iterations of the RepeatingGroup correctly If you want. you can also run the parser. On the menu.4. are highlighted. Check that both buyer names. click IntelliScript > Mark Example. Molly and Jack.
62 Chapter 4: Positional Parsing of a PDF Document . Find the three-dots symbol that is nested within the RepeatingGroup. Click the >> symbol at the end of the Marker to display its advanced properties. There is no separator before the first transaction or after the last transaction. For some reason. we can use the fact that every transaction occupies exactly four lines of the document. undoubtedly related to the binary structure of the PDF file. set separator_position = between. 6. Instead of a text separator. To do this. the processor inserts an extra line. you will discover that each transaction occupies only three lines. We can use the sequence of four newline characters as a separator. and define another RepeatingGroup. What can we use as the separator of the nested RepeatingGroup? There does not seem to be any characteristic text between the transactions. You can then click the << symbol to hide the advanced properties. set the separator property of the nested RepeatingGroup to a Marker that uses a NewlineSearch. To reflect this situation.5. If you examine the original PDF document carefully. and set count = 4. But we are parsing the output of the document processor.
07 Data Holder /Invoice/*s/Buyer/*s/Transaction/@date /Invoice/*s/Buyer/*s/Transaction/@ref /Invoice/*s/Buyer/*s/Transaction/*s/Product /Invoice/*s/Buyer/*s/Transaction/*s/Total When you are done. In the first line of the first transaction.7. The exact position of the colors depends on the number of spaces that you selected to the left and right of each field. define the following offset Content anchors: Content Apr 02 22498 large eggs 29. The IntelliScript should display the four Content anchors within the nested RepeatingGroup. Defining the Nested Repeating Groups 63 . the example pane should be color-coded like this. which is Molly's first purchase.
64 Chapter 4: Positional Parsing of a PDF Document . you can define the string For your convenience:. Do you remember how we restricted the search scope of the outer RepeatingGroup by defining Page 2 as a Marker? It is a good idea to define a Marker at the end of the nested RepeatingGroup. and total columns are highlighted in all the transactions.8. Confirm that the date. and define some of the boilerplate text in the footer of page 2 as a Marker. 9. product. reference. To do this. For example. return to the top-level anchor list. Click IntelliScript > Mark Example again. too. which is the level at which the outer RepeatingGroup is defined. to make sure that Complex Data Exchange does not search too far.
Collapse the tree and confirm that the two Buyer elements contain 5 and 3 transactions. Defining the Nested Repeating Groups 65 . This is the number of transactions in the example source. Run the parser. Note: On Windows XP SP2 or higher. You can unblock the active content. clicking the . You should get the following output: 11. respectively. If this occurs. and then collapse the tree.10. Internet Explorer might display a yellow information bar notifying you that it has blocked active content in the file.and + icons fails to collapse or expand the tree.
The advanced properties are needed less often. Remember that events marked with the optional failure icon ( ) or the failure icon ( ) are normal at the end of a RepeatingGroup. Basic Properties Figure 4-2. Many Complex Data Exchange components have both basic properties and advanced properties. examine the event log for any problems. The basic properties are the ones that you need to use most frequently. Basic and Advanced Properties When you configured the separator of the nested RepeatingGroup. The distinction between the basic and advanced properties is only in the display. Feel free to use them as needed. Basic and Advanced Properties 66 Chapter 4: Positional Parsing of a PDF Document . For more information. When you click the >> symbol. which displays the advanced properties. Figure 4-1. see “Testing the Parser” on page 42. they are displayed in gray. the >> symbol changes to <<. we told you to click the >> symbol. The advanced properties are not harder to understand or more difficult to use. it turns black. so Complex Data Exchange hides them. When you do this. The IntelliScript displays it like a basic property. so they are displayed by default. If you assign a non-default value to an advanced property.As usual. Click the << symbol to hide the advanced properties.
2. Right-click the nested RepeatingGroup. This lets you insert a new component above the RepeatingGroup. 3.64"> You can compute the subtotals in the following way: ♦ ♦ Before processing the transactions. 4. To compute the subtotals: 1. expand the nested RepeatingGroup. and insert the result in the total attribute of the Buyer elements. 5. After the parser processes each transaction. enter a CalculateValue action. and set its properties as follows: quote = 0.) under the fourth Content anchor. click Insert. Expand the IntelliScript to display the nested RepeatingGroup. Set the properties of the CalculateValue action as follows: params = /Invoice/*s/Buyer/@total /Invoice/*s/Buyer/*s/Transaction/*s/Total result = /Invoice/*s/Buyer/@total expression = $1 + $2 Using an Action to Compute Subtotals 67 .. You will use a CalculateValue action to perform the addition. add the amount of the transaction to the total. Insert the SetValue action. At the three dots (. You will use a SetValue action to perform the initialization step. The desired output is: <Buyer name="Molly" total="217.. initialize the total attribute to 0.00 data_holder = /Invoice/*s/Buyer/@total The SetValue action assigns the quote value to the data_holder. On the pop-up menu. Now.Using an Action to Compute Subtotals The exercise is complete except for one feature. You need to compute the subtotal for each buyer.
but we encourage you to experiment with the features yourself. see “Actions” in the Complex Data Exchange Studio User Guide. For example.Actions are components that perform operations on data. We won't give you an exercise to perform this enhancement. One way to support page headers might be to redefine the separator between the transactions. which would interrupt the repeating structure. Complex Data Exchange provides numerous action components that perform operations such as: ♦ ♦ ♦ ♦ ♦ Computing values Concatenating strings Testing a condition for continued processing Running a secondary parser Running a database query For more information. we have assumed that the repeating groups have a perfectly regular structure. But what happens if the transactions run over to another page? The new page would contain a page header. Using an Action to Compute Subtotals 69 . This lets you define multiple separators that might occur between the transactions. you might use an Alternatives anchor as the separator. Potential Enhancement: Handling Page Breaks So far.
called an offset. You can use this feature to parse nested iterative structures. adjust the offsets in case a source document contains longer data than the example source. you can use a document processor that converts the source to text. so they are hidden to avoid cluttering the display. you can click the >> symbol next to a component to display its advanced properties. The positional strategy is useful when the content is laid out at a fixed number of characters. which specifies how many lines to skip.Points to Remember Here are useful hints that arise out of this tutorial: ♦ ♦ ♦ To parse a binary format such as a PDF file. think about the possible variations that might occur in source documents. The search scope is the segment of a document where Complex Data Exchange searches for an anchor. If you define the content positionally. from the margins or from Marker anchors. Repeating groups can be nested. Newline characters are useful as markers and separators. When you define anchors. In the IntelliScript. The function of Marker anchors is to define the search scope for other anchors. You can set the count of newlines. such as computing totals. ♦ ♦ ♦ ♦ ♦ 70 Chapter 4: Positional Parsing of a PDF Document . Advanced properties are used less frequently than basic properties. You can use actions to perform operations on the data.
Chapter 5 Parsing Word and HTML Documents This chapter includes the following topics: ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ Overview. 100 71 . 94 Testing the Parser on Another Source Document. 86 Using Transformers to Modify the Output. 74 Creating the Project. 78 Defining a Variable. 98 Points to Remember. 79 Parsing the Name and Address. 72 Requirements Analysis. 81 Parsing the Optional Currency Line. 84 Parsing the Order Table.
to interpret the document structure. You can modify the XSD schema to restrict the data types of the elements. You will create a full-featured parser that can be used in a production application. A parser can use the HTML tags. they might vary the layout from one document to another. Taking advantage of the opening and closing tag structure to parse an HTML document.Overview Many real-life documents have less structure than the ones we have described in the previous chapters. 72 Chapter 5: Parsing Word and HTML Documents . In such cases. which you configure once and use multiple times in a project. you can expose the formatting markup by saving the document as HTML. Even if the authors create the documents from a fixed template. It is typical of real-life parsing scenarios. which are essentially format labels. For example. making positional parsing difficult Repeated keywords. The documents might have: ♦ ♦ ♦ Few labels and delimiters to help a parser identify the data fields Wrapped lines and flexible layout. We suggest that you skim the chapter from beginning to end to understand how the parser is intended to work. You will learn techniques such as: ♦ ♦ ♦ ♦ ♦ ♦ ♦ Using the WordToHtml document processor. in situations where the same text is repeated frequently throughout the document. Using variables to store retrieved data for later use. Defining a global component. Using transformers to modify the output of anchors. There is more than one possible solution to this exercise. making it hard to define unambiguous Marker anchors Microsoft Word documents are typical examples of these phenomena. For a Microsoft Word document. In this chapter. which converts a Word document to an HTML document. Feel free to experiment with the parser configuration. and italics to identify the data fields. you need to use the available markup or patterns to parse the documents. tables. you will construct a parser for a Word document. Scope of the Exercise This exercise is the longest and most complex in this book. you can modify the anchor configuration to support numerous variations of the source-document structure. For example. ensuring that the parser retrieves exactly the correct data from a complex document. Identifying anchors unambiguously. Word documents are usually created by hand. and then try the exercise. Testing a parser on source documents other than the example source. although they can also be generated programmatically. you can use paragraphs.
If you do not have Word on the same computer. We presume that you are familiar with HTML code. You can get an introduction at http:// www.org or from any book on web-site development. Overview 73 . you need Microsoft Word 97 or higher on the same computer as Complex Data Exchange Studio. Prerequisites In order to do this exercise. except that you do not need to use the WordToHtml processor because the document is already HTML.We have omitted most of these refinements because we want to keep the scope of the exercise within bounds. you can open the source document in Word on another computer and save it as HTML. You can then transfer the HTML document back to the Complex Data Exchange computer and parse it. When you finish the exercise. The rest of the exercise is unchanged. you can learn much more about the techniques described here by reading the Complex Data Exchange Studio User Guide.w3.
The form contains: ♦ ♦ ♦ ♦ The name of the purchaser A one-line address An optional line specifying the currency such as $ or £ A table of the items ordered Books. These items might be formatted in a different font from the books. and the nature of the parsing problem. The format of the items in the table is not uniform. such as a video cassette. it is always a good idea to plan what the parser must do. we examine the source document structure. In this section. Source Document You can view the example source in Microsoft Word. the required XML output. 74 Chapter 5: Parsing Word and HTML Documents . The document is stored in: tutorials\Exercises\Files_for_tutorial_4\Order.doc The document is an order form used by a fictitious organization called The Tennis Book Club. are not books.Requirements Analysis Before you start configuring a parser. Although the first column heading is some of the possible items in this column.
50</Total> </Book> <!-. the Currency line is optional. by Roland Fasthitter</Title> <Price>$11. In the third instance. The word Total appears three times in the table. we can assume that this is the case in all similar source documents. and we need an unambiguous identifier for the row. Some of the source documents that we plan to parse are missing this line. we assume that the output XML format is as follows: <Order> <To>Becky Handler</To> <Address>18 Cross Court. XML Output For the purpose of this exercise. As a further complication. The parser needs to interpret the cells according to their location in the table. would be difficult for these cells. The table cells do not contain any identifying labels.Additional Book elements --> </Books> <Total>$46. In the source document.50</Price> <Quantity>1</Quantity> <Total>$11. In two of the instances. the currency is stated on the optional currency line. PA</Address> <Books> <Book> <Title>Topspin Serves Made Easy. Positional text parsing.In some of the table cells.19</Total> </Order> Notice that the prices and totals include the currency symbol ( $). the text wraps to a second line. Total appears by chance in the title of the video cassette. The Parsing Problem Processor Selection Complex Data Exchange provides several document processors that can convert a Word document to a format suitable for parsing. This is significant because we want to parse the total-price row of the table. the word is in a column or row heading. and not in the table. Down-the-Line. Among these are: ♦ WordToHtml Requirements Analysis 75 . using the techniques that you learned in the previous chapter.
Table column headers --> </tr> <tr> 76 Chapter 5: Parsing Word and HTML Documents . The WordToTxt processor produces plain-text output. even though the original Word document is less than one page. Most of this code is in the <header> element of the HTML document. In a broad outline. that can help identify the data to retrieve. Because the text lacks formatting information. including the style and format definitions. making it easy to configure the parser. or you can open the file in Notepad and examine the HTML code. You can view the resulting HTML file in a browser. The HTML tags expose a wealth of formatting information. --> </header> <body> <h1>The Tennis Book Club<br>Order Form</h1> <p>Send to:</p> <p>Becky Handler</p> <p>18 Cross Court.. but their output is more difficult to read than that of the WordToHtml processor. but Word uses the information if you later re-import the HTML to Word. The HTML is exceedingly verbose.. Down-the-Line. PA</p> <table> <tr> <!-. The WordToXml and WordToRtf processors might also be good choices. HTML Structure The WordToHtml processor uses the Save As HTML feature of Microsoft Word to generate the HTML. the code looks like this: <html>\ <header> <!-.Very long header storing Word style definitions . it would probably be much harder to parse reliably than the HTML. Much of this information has no significant effect on the appearance of the HTML in a web browser. It can run to many pages. If you wish. you can save the example source document as HTML manually in Word.♦ WordToRtf ♦ WordToTxt ♦ WordToXml We choose to use the WordToHtml processor because: ♦ ♦ The HTML code is easy to read. This occurs because Word saves the complete document features.
5pt.border:solid windowtext 1.</td> element.4pt'><p class=MsoBodyText><i style='mso-bidi-font-style:normal'>Topspin Serves Made Easy</i>. The code might vary somewhat on your computer. Requirements Analysis 77 . we trust that you will quickly learn to find the important structural tags.padding:0cm 5.<td><i>Topspin Serves Made Easy</i>..4pt 0cm 5.50</td> </tr> <!-.Additional tr elements containing the other table rows --> </table> </body> </html> You will use this basic HTML tag structure to parse the document.mso-border-top-alt:solid windowtext .border-top:none. which is the first <td> element of the Topspin Serves row. by Roland Fasthitter</td> <td>$11.50</td> <td>1</td> <td>$11.9pt. You can ignore the irrelevant code such as the lengthy attributes and <span> tags. by Roland <span class=SpellE>Fasthitter</span></p></td> As you begin to work with the Word-generated HTML.msoborder-alt:solid windowtext . depending on your Word version and configuration. The actual code might be considerably more complex than shown above.0pt. <td width=168 valign=top style='width:125.5pt. such as the <td>.. The following is a typical example.
Click Finish to create the project. Select the document processor Microsoft Word To HTML. Select the Microsoft Word encoding. To start the exercise. 78 Chapter 5: Parsing Word and HTML Documents . Browse to the example source file. which is Order. In the New Project wizard.doc. it might display a Microsoft Internet Explorer prompt to open or save the file. which is in the tutorials\Exercises\ Files_For_Tutorial_4 folder.xsd. 2. Name the script file Html_ScriptFile. Select the schema TennisBookClub. This occurs because the Studio uses Internet Explorer to display a browser view of the file. The resulting IntelliScript has the following appearance: Note: When the Studio opens the example source. select the following options: ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ Name the parser MyHtmlParser. Specify that the example source is a file on a local computer.Creating the Project To create the project: 1. This format component is pre-configured with the appropriate delimiters and other components for parsing HTML code. Select the HTML format. create a project called Tutorial_4. The behavior depends on your Internet Explorer security settings. Click Open.
at the global level and not nested within the Parser component. The variable appears in the IntelliScript. To define the variable: 1. but we do not want to map it to an output element or attribute. varCurrency. At the three dots on the right side of the equals sign (=). 2. Type the variable name.) and press Enter. but it isn't included in the XML output. You can use a variable in exactly the same way as an element or attribute data holder. and press Enter. near the top of the example source? We need to retrieve the currency symbol. Variable Definition in the IntelliScript . To do this. At the bottom of the IntelliScript. and in the Component view. we want to store the currency symbol temporarily.Defining a Variable Do you remember the Currency line.. Defining a Variable 79 . select the three dots (. Figure 5-1. 3.. in the Schema view. press Enter and select Variable. and prefix it to the prices in the output. Instead. we will define a temporary data holder called a variable.
you will access the variable in the Schema view.com/Variables Notice the location of the variable in the Schema view. 80 Chapter 5: Parsing Word and HTML Documents .Figure 5-2.Localnamespace. Variable Displayed in the Schema View Project. Later. nested within the www.
3. the parser will not support Word 97. and define the string Send to: as a Marker anchor. Parsing the Name and Address 81 . HTML is not case sensitive. you can select the match case option. A Content anchor that is defined in this way is similar to a Marker Content Marker sequence. Now. select the Find option.Parsing the Name and Address You will now define the anchors that parse the name and address. if you prefer. which you could also use. This causes the parser to skip the header completely. The string signals the beginning of the content that you need to retrieve. just as you did when you scrolled through the document. Define <body as a Marker anchor. 4. Word 2000 and higher generate HTML tags in lower case. you can right-click in the example pane. right-click the string Becky Handler and insert a Content anchor. Here. about 80% of the way down. and find the string <body. which are located at the beginning of the example source. If you select match case. Scroll down a few paragraphs. Make the following assignments in the IntelliScript: Property opening_marker Value Select TextSearch and type the string <p. you are ready to define the first Content anchor. scroll past the header to the body tag. This can be helpful to ensure that you match only the indicated spelling. You will configure the anchor to retrieve the data between specified starting and ending strings. In the example pane. with a starting < symbol but without the ending > symbol. To scroll quickly. but Word 97 generates tags in upper case. 2. Do not select the match_case property for this anchor. To do this. or for any other HTML tags that you define as anchors in this exercise. To parse the name and address: 1.
the example pane highlights <p and </p> as markers. 6. Notice that only the string Becky Handler is highlighted in blue. It learns from the example source that the desired content is located after the > delimiter. It highlights Becky Handler as content. Run the parser. Down-the-Line. and confirm that you get the following result: 82 Chapter 5: Parsing Word and HTML Documents . defining the opening_marker and the closing_marker. This is because the HtmlFormat recognizes > as a delimiter character. The rest of the HTML code between the opening_marker and the closing_marker. Map the anchor to /Order/*s/To. to map 18 Cross Court. PA to the /Order/*s/Address data holder.Property closing_marker data_holder Value Select TextSearch and type the string </p>. such as class=MsoBodyText> is not highlighted. 5. When you finish these assignments. Move to the next paragraph of the source. and use the same technique.
If necessary you can change this behavior in the project properties. The elements are empty because we have not yet parsed the table of books.Why the Output Contains Empty Elements Notice that the parser inserts empty Books and Total elements. That is because the XSD schema defines Books and Total as required elements. The parser inserts the element to ensure that the XML is valid according to the schema. Parsing the Name and Address 83 .
Following the anchors that you have already defined. Define the text Currency. Select the three-dots that are nested within the Group. the parser continues running.Parsing the Optional Currency Line If the source document contains a Currency line. 84 Chapter 5: Parsing Word and HTML Documents . insert a Group anchor. define $ as a Content anchor. This means that if the Group fails due to the and/or Content being missing. Within the group. which you defined above. 3. nest Marker and Content anchors that retrieve the currency. 2. You can implement this in the following way: 1. the parser should ignore the omission and continue parsing the rest of the document. Continuing within the Group. as a Marker anchor. Map the anchor to the VarCurrency variable. the parser should process the line and store the result in the varCurrency variable. Define a Group anchor. To parse the optional currency line: 1. The purpose of this anchor is to bind a set of nested anchors together. If the Currency line is missing. 3. 2. for processing as a unit. Marker Select the optional property of the Group. which you created above.
4. and select the property. and not the optional property of the Marker or Content. Be sure you select the optional property of the Group. Click the >> symbol to display the advanced properties of the Group. optional Parsing the Optional Currency Line 85 .
2. Specifically. Under the opening property of the EnclosedGroup. that is. you should be able to work out a solution. The EnclosedGroup is useful for parsing HTML code because it recognizes the opening-and-closing tag structure. In this exercise. which is characteristic of the code. assign text = </table>. or you can drag them from the example pane. the level where the Group is defined. start adding anchors within the EnclosedGroup. and not the nested level within the Group. One solution is to configure the anchors by editing the IntelliScript. Under the closing property. you will use the EnclosedGroup to recognize the <table and </table> tags that surround an HTML table.Parsing the Order Table As you have seen in the preceding exercises. insert an EnclosedGroup anchor. You can type the strings. Now. Select the three dots at the end of the top-level anchor sequence. We encourage you to try it yourself before you continue reading. 4. you can parse the order table by using a RepeatingGroup. assign text = <table. The HTML code of the example source is long 86 Chapter 5: Parsing Word and HTML Documents . By editing the IntelliScript. The example pane highlights the <table and </table> tags in the HTML code. If you did the exercises in the preceding chapters. you will nest the RepeatingGroup in an EnclosedGroup anchor. To parse the table: 1. 3. rather than selecting and right-clicking in the example source.
To assign count=2. To assign count=3. which you do not need to retrieve. Advances to the fourth cell in the last row. the EnclosedGroup looks like this: Parsing the Order Table 87 . and several of the anchors have non-default properties. which you can configure only in the IntelliScript. Marker Content At this point. RepeatingGro up Starts a repeating group at the second row of the table. For more information. display the advanced properties. display the advanced properties. see “Using Count to Resolve Ambiguities” on page 92. Retrieves the content of the fourth cell in the last row. Marker Terminates the repeating group and moves to the last row of the table. Within the EnclosedGroup Anchor.and confusing. Add Marker Configuration Explanation Advances to the first row of the table. See below for the anchors to insert within the RepeatingGroup.
Try the Mark Example command. you can view the color coding for the above anchors. If you scroll through the example source. The results look like this: 7. 88 Chapter 5: Parsing Word and HTML Documents . 6. expand the RepeatingGroup and insert the following anchors within it. and editing the copies. Add Content Configuration Explanation Retrieves the content of the first cell in a row. Within the RepeatingGroup Anchor. you can save time by copying one of the Content anchors.5. Optionally. Run the parser. Content Retrieves the content of the second cell. Now. Notice that the four Content anchors have very similar configurations.
the EnclosedGroup and RepeatingGroup should look like this: Parsing the Order Table 89 . Content Retrieves the content of the fourth cell. When you have finished. Add Content Configuration Explanation Retrieves the content of the third cell.Within the RepeatingGroup Anchor.
Run the Mark Example command, and check the color-coding again.
Run the parser. The result should look like this:
Why the Output does not Contain HTML Code
Notice that the parser removed the HTML code from the retrieved text. For example, the first Content anchor within the RepeatingGroup is configured to retrieve all the text between the <td and </td>. You might have expected it to retrieve the text:
Chapter 5: Parsing Word and HTML Documents
width=168 valign=top style='width:125.9pt;border:solid windowtext 1.0pt;border-top:none;mso-border-top-alt:solid windowtext .5pt;msoborder-alt:solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt'><p class=MsoBodyText><i style='mso-bidi-font-style:normal'>Topspin Serves Made Easy</i>, by Roland <span class=SpellE>Fasthitter</span></p>
In fact, this is exactly the color-coding of the IntelliScript:
The answer is that the Content anchor retrieves the entire text, including the HTML code. However, the HtmlFormat component of the parser is configured with a set of default transformers, which modify the output of the anchor.
One of the default transformers is RemoveTags, which removes the HTML code.
A transformer is a component that modifies the output of an anchor. Some common uses of transformers are to add or replace strings in the output. You will use transformers for these purposes in the next stage of the exercise. The format component of a parser is typically configured with a set of default transformers, which the parser applies to the output of all anchors. The purpose of the default transformers is to clean up the output. The HtmlFormat is configured with the following default transformers:
which removes HTML code from the output. which converts HTML entities, such as > or & to plain-
text symbols, > or &.
which converts multiple whitespace characters, spaces, tabs, or line breaks, to single spaces. which deletes leading and trailing space characters.
For more information, see “Transformers” in the Complex Data Exchange Studio User Guide.
Parsing the Order Table
Why the Parser does not Use Delimiters to Remove the HTML Codes
Do you remember the Becky Handler anchor that you defined at the beginning of the exercise? We explained that the anchor removes HTML codes because it retrieves only the data after the > delimiter. The color-coding illustrates this:
This works because the Becky Handler anchor is configured with the LearnByExample component, whose function is to interpret the delimiters in the example source.
In the Content anchors of the RepeatingGroup, we deliberately omitted the LearnByExample component:
The LearnByExample component would not have worked correctly here. The problem is that the number of delimiters in the order table is not constant. In some of the table rows, the name of a book is italicized. In other rows, the name is not italicized. In some source documents, the table might contain other formatting variations. The formatting can change the HTML code, including the number and location of the delimiters. That is why we left the value property empty. Instead, we relied on the default transformers to clean up the output.
Using Count to Resolve Ambiguities
Within the EnclosedGroup, we told you to assign the count property of two Marker anchors. The purpose was to resolve ambiguities in the example source. In the >Total< anchor, we set count = 2. This was to avoid ending the repeating group prematurely, at the first instance of the string >Total<, which is in the column-heading row. The count = 2 property means that the parser looks for the second instance of >Total<, which is in the last row of the table. This resolves the ambiguity. Another interesting point is that we quoted the text >Total<, rather than Total. This resolves the ambiguity with the word Total that appears in the Total Tennis (video) row of the table. If we didn't do this, the repeating group would incorrectly end at the Total Tennis (video) row.
Chapter 5: Parsing Word and HTML Documents
There. The skipped cells in this row contain no data. we set count = 3 to advance the parser by three table cells. which lies within the last table row. Parsing the Order Table 93 . so we do not need to parse them.The other use of the count property is in the last <td marker of the EnclosedGroup.
expand the first Content anchor of the RepeatingGroup. nest a Replace transformer. Click the >> symbol to display the advanced properties of the Replace transformer.Using Transformers to Modify the Output At this point.40</Price> <Total>$46. and expand the transformers property. Select the optional property.19</Total> instead of <Price>$11. In the IntelliScript. by Roland Fasthitter</Title> instead of <Title>Topspin Serves Made Easy. 4. At the three dots within the transformers property.40</Price> <Total>46. The transformers will modify the output of the anchors and correct these problems. the XML output still has two small problems: ♦ Some of the book titles are punctuated incorrectly. 3. The output is <Price>11. Display the advanced properties of the Content anchor.19</Total> You will apply transformers to particular Content anchors. This is important because the second row of the order table. The output is <Title>Topspin Serves Made Easy . “Total Tennis (Video)”. located before the comma." In other words.") replace_with = ". by Roland Fasthitter</Title> The difference is an extra space character. These transformers are in addition to the default transformers. does not contain a space followed by a comma. Set the properties of the Replace transformer as follows: find_what = TextSearch(" . the transformer replaces a space followed by a comma with a comma. 2. The transformer cannot perform the 94 Chapter 5: Parsing Word and HTML Documents . ♦ The prices and totals do not have a currency symbol. To define the transformers: 1. 5. which the parser applies to all the anchors.
To configure the global component. following the equals sign. 8. and type AddCurrencyUnit. You could independently configure an independent AddString transformer in each appropriate Content anchor. 7. You can configure a single AddString transformer as a global component. Press Enter. This displays a Schema view. Insert an AddString transformer. click the browse button. You can then use the global component wherever it is needed. Select the pre property of the AddString transformer. select the three dots to the left of the equal sign. This is called the global level. where you can select a data holder. If you do not select optional. This is the identifier of the global component. but there is an easier way. and press Enter. at the top level of the IntelliScript. At the right of the text box. Using Transformers to Modify the Output 95 . the Content anchor fails and the output omits the second row of the order. the level where MyHtmlParser and varCurrency are defined. You will use an AddString transformer to prefix the prices and totals with the currency symbol.replacement on this row. Press Enter at the second three-dots symbol on the same line. 6.
select the three dots and press Enter. For example. Now. 96 Chapter 5: Parsing Word and HTML Documents . VarCurrency. 9. which is the one that is mapped to /Order/*s/Total. respectively. select VarCurrency. which you have just defined. you can insert the AddCurrencyUnit component in the appropriate locations of MyHtmlParser. display the advanced properties of the last Content anchor in the EnclosedGroup. This applies the transformer to the anchor. This means that the AddString transformer prefixes its input with the value of which is the currency symbol. Under its transformers property. which are mapped to Price and Total. insert AddCurrencyUnit in the transformers property of the second and fourth Content anchors in the RepeatingGroup. The result should look like this: In the same way.In the Schema view. The drop-down list displays the AddCurrencyUnit global component. Select the global component.
The identifier of the global component appears in the drop-down list at the appropriate locations of the IntelliScript. Global Components Global components are useful when you need to use the same component configuration repeatedly in a project.10. You can define any Complex Data Exchange component as a global component. The AddCurrencyUnit transformer that you added at the end of the exercise. which you defined by using the wizard. Run the parser. for example. an action. The VarCurrency variable that you defined at the beginning of the exercise. The result should be: Notice that the Title element does not contain an extra space before the comma. You can use the global component just like any other component. called MyHtmlParser. or an anchor. The Price and Total elements contain the $ sign. you defined three global components: ♦ ♦ ♦ The parser. These are the effects of the transformers that you defined. In this exercise. Using Transformers to Modify the Output 97 . a transformer.
98 Chapter 5: Parsing Word and HTML Documents . This means that Complex Data Exchange runs the WordToHtml processor on the document before it runs the parser. 2. The purpose of this test is to confirm that the parser correctly processes a document when the currency line is missing.doc). Display the advanced properties of LocalFile. 4. test the parser on the source document OrderWithoutCurrency. To perform the test: 1. Display the advanced properties of MyHtmlParser. 3. Edit the file_name property. and browse to OrderWithoutCurrency. and assign pre_processor = WordToHtml. This document is identical to the example source (Order. This means that the parser processes a file on the local computer.doc.Testing the Parser on Another Source Document As a final step. instead of the example source. Assign the sources_to_extract property a value of LocalFile.doc. except that it is missing the optional currency line.
Run the parser. you can delete the sources_to_extract value. The output should be identical to the above. Testing the Parser on Another Source Document 99 .5. This was a temporary setting that you needed only for testing the parser. except that the prices and totals are not prefixed with a $ sign. When you finish. 6.
To parse HTML code. you can either define the anchors using delimiters. 100 Chapter 5: Parsing Word and HTML Documents . it is usually best not to select the match_case property because HTML is not case sensitive. you can use the count property of a Marker. To test a parser on additional documents other than the example source. The variables are useful as input to other components. These approaches take advantage of the characteristic opening-and-closing tag structure of HTML. which converts the document to a parser-friendly HTML format. You can apply default transformers to all the anchors. which you do not want to display in the XML output of a parser. or you can rely on the default transformers to remove the code.Points to Remember To parse a Microsoft Word document. Use transformers to modify the output of anchors. you can use the EnclosedGroup anchor or the opening_marker and closing_marker properties of the Content anchor. Use variables to store retrieved data. If you define an HTML tag as a Marker. To resolve ambiguities between multiple instances of the same text. Global components are useful when you need to use the same component configuration repeatedly. you can use a document processor such as WordToHtml. Assign a document processor to the document if it needs one. To exclude HTML tags from the parsing output. or you can apply transformers to specific anchors. assign the sources_to_extract property.
Chapter 6 Defining a Serializer This chapter includes the following topics: ♦ ♦ ♦ ♦ ♦ Overview. 111 101 . 106 Calling the Serializer Recursively. 104 Configuring the Serializer. 108 Points to Remember. 102 Creating the Project.
It contains only four serialization anchors. see “Serializers” in the Complex Data Exchange Studio User Guide. it calls itself repetitively to serialize the nested sections of an XML document. Nonetheless. the serializer has some interesting points: ♦ ♦ The serializer is recursive. You can view the output in Notepad. it is easier to define a serializer than a parser because the input is a fully structured. In this exercise. converting XML to another format. you will create a serializer that works in the opposite direction. Usually. That is. you have defined parsers. The output of the serializer is a worksheet that you can open in Excel. which convert documents in various formats to XML. <Person> <Name>Jake Dubrey</Name> <Age>84</Age> <Children> <Person> <Name>Mitchell Dubrey</Name> <Age>52</Age> </Person> <Person> 102 Chapter 6: Defining a Serializer . you need Microsoft Excel.xml The document is an XML representation of a family tree. It is recommended only to view the output. Requirements Analysis The input XML document is: tutorials\Exercises\Files_For_Tutorial_5\FamilyTree. which contains additional Person elements. which are the opposite of the anchors that you use in parsing. but for the most meaningful display. Notice the inherently recursive structure: each Person element can contain a Children element. You do not need Excel to run the serializer. unambiguous XML document. Prerequisite The output of this exercise is a *. It is also possible to generate a serializer automatically by inverting the operation of a parser. You will define the serializer by editing the IntelliScript. The serializer that you will define is very simple.csv (comma separated values) file. For more information.Overview In the preceding exercises.
26 Excel displays the output as a worksheet looking like this: Overview 103 .csv file.<Name>Pamela Dubrey McAllister</Name> <Age>50</Age> <Children> <Person> <Name>Arnold McAllister</Name> <Age>26</Age> </Person> </Children> </Person> </Children> </Person> Our goal is to output the names and ages of the Person elements as a *.52 Pamela Dubrey McAllister.84 Mitchell Dubrey. which has the following structure: Jake Dubrey.50 Arnold McAllister.
you can hide the empty example pane of the IntelliScript editor. Optionally. the Complex Data Exchange Explorer displays the new project. On the Complex Data Exchange Studio menu. 104 Chapter 6: Defining a Serializer . 6. Unlike a parser. a serializer does not have an example source file. browse to the schema FamilyTree. click File > New > Project. 4. which is in the tutorials\Exercises\Files_For_Tutorial_5 folder. Under the Complex Data Exchange node. you can open the schema in an XSD editor or in Notepad.Creating the Project To create the project: 1. specify the following options: ♦ ♦ ♦ Name the project Tutorial_5. On the following wizard pages. Double click the Serializer_Script. and examine how the recursive data structure is defined.xsd. 5.tgp file to edit it. 3. 2. If you are new to XSD. Name the serializer FamilyTreeSerializer. The schema defines the structure of the input XML document. When you finish the wizard. Name the script file Serializer_Script. select a Serializer Project. When you reach the Schema page.
select the project in the Complex Data Exchange Explorer. click anywhere in the IntelliScript editor. To design the serializer.xml input document. you will use the FamilyTree. it is a good idea to organize all the files that you use to design and test a project in the project folder. which is located in your Eclipse workspace. The Info tab of the properties window displays the location. and click the File > Properties command. Although not required. you selected a non-default location. 7. To determine the location of the project folder. click IntelliScript > IntelliScript or click the toolbar button that is labeled Show IntelliScript Pane Only.To do this. see “Project Properties” in the Complex Data Exchange Studio User Guide.0\workspace\Tutorial_5 Determining the Project Folder Location Your project folder might be in a non-default location for either of the following reasons: ♦ ♦ In the New Project wizard. To restore the example pane. click IntelliScript > Both on click the button that is labeled Show Both IntelliScript and Example Panes. such as the input and output encoding that are used in your documents and the XML validation options. For more information. Your copy of Eclipse is configured to use a non-default workspace location. Project Properties The properties window displays many useful options. which is by default My Documents\Informatica\ComplexDataExchange\4. Creating the Project 105 . whose content is presented above. Alternatively.xml file from tutorials\Exercises\Files_For_Tutorial_5 to the project folder. We suggest that you copy the FamilyTree. and then click the Project > Properties command.
*. this causes the output file to have the name output." (a comma). Display the advanced properties of the serializer. When you run the serializer in the Studio.Configuring the Serializer You are now ready to configure the serializer properties and add the serialization anchors." This means that the serialization anchor writes the content of the /Person/*s/Name data holder to the output file. You must do this by editing the IntelliScript. To configure the serializer: 1.csv. It appends the closing string ". Define a second ContentSerializer as illustrated: 106 Chapter 6: Defining a Serializer .csv. Under the contains line of the serializer.csv files open in Microsoft Excel. 3. There is no analogy to the select-and-click approach that you used to configure parsers. insert a Content Serializer serialization anchor and configure its properties as follows: data_holder = /Person/*s/Name closing_str = ". By default. and set output_file_extension =. 2. with a leading period.
like this: If Excel is not installed on the computer. Type 013. This displays a small dot in the text box. On the keyboard.csv to view the output. Press Ctrl+a again. At the prompt. browse to the test input file. FamilyTree. Alternatively.This ContentSerializer writes the /Person/*s/Age data holder to the output. In the Complex Data Exchange Explorer view. examine the Events view for errors. Complex Data Exchange displays an Excel window. and open it there. double-click output. click Run > Run. When the serializer has completed. Configuring the Serializer 107 . you can copy the file to another computer where Excel is installed.xml. To type the ASCII codes: ♦ ♦ ♦ ♦ ♦ ♦ 4. under Results. Run the serializer. you can view the output file in Notepad. It appends a carriage return (ASCII code 013) and a linefeed (ASCII 010) to the output. Press Enter to complete the property assignment. To do this: ♦ ♦ ♦ ♦ ♦ Assuming that Excel is installed on the computer. Type 010. Select the closing_str property and press Enter. Set the serializer as the startup component. press Ctrl+a. On the menu.
the results contain only the top-level Person element. Insert a RepeatingGroupSerializer.Calling the Serializer Recursively So far. The purpose of this serialization anchor is to call a secondary serializer. and to generate a repetitive structure in the output. 3. 108 Chapter 6: Defining a Serializer . In this case. To call the serializer recursively: 1. You can use this serialization anchor to iterate over a repetitive structure in the input. 2. nest an EmbeddedSerializer. Within the RepeatingGroupSerializer. You need to configure the serializer to move deeper in the XML tree and process the child Person elements. Assign the properties of the EmbeddedSerializer as illustrated. the RepeatingGroupSerializer will iterate over all the Person elements at a given level of nesting.
The optional property means that the secondary serializer does not cause the main serializer to fail when it runs out of data. or mappers Multiple script (TGP) files Multiple XSD schemas We won't give you an exercise on this subject. This is what lets the serializer move down through the generations of the family tree. In the exercises throughout this book. set it as the startup component and use the commands on the Run menu. each project contains a single parser. ♦ ♦ 4. the serializer calls itself recursively. The schema_connections property means that the secondary serializer should process /Person/*s/Children/*s/Person as though it were a top-level /Person element. Here is an example: To run one of the components. or mapper. In other words. serializer. The following paragraphs are a brief summary. Calling the Serializer Recursively 109 . but you can find the instructions in the Complex Data Exchange Studio User Guide. serializers. insert them at the global level of the IntelliScript. serializers. We did this for simplicity. It is quite possible for a single project to contain multiple components. for example: ♦ ♦ ♦ Multiple parsers.The properties have the following meanings: ♦ The assignment serializer = FamilyTreeSerializer means that the secondary serializer is the same as the main serializer. Run the serializer again. Multiple Data Transformation Components To define multiple data transformation components such as parsers. The result should be: Defining Multiple Components in a Project None of the exercises in this book contain multiple components. or mappers.
right-click the XSD node in the Complex Data Exchange Explorer and click Add File. 2. That is what you did in this exercise. click Add File. double-click the file in the Complex Data Exchange Explorer. Multiple XSD Schemas To add multiple schemas to a project. 110 Chapter 6: Defining a Serializer . To create an empty XSD schema that you can edit in any editor. Right-click the Scripts node in the Complex Data Exchange Explorer. To add a script that you created in another project. with the interesting twist that the main and secondary serializers were the same—a recursive call. or mappers to process portions of a document. 1. To open a script file for editing. To create a script file. Multiple Script Files You can use multiple script files to organize your work. and click New > Script.The startup component can call secondary parsers. click New > XSD. serializers.
You can create a serializer either by generating it from an existing parser or by editing the IntelliScript. for example. is supported. which are defined in the same project. or mappers. a component calling itself. You can design a serializer that outputs to any data format. serializers. Microsoft Excel. Recursion. Secondary components can process portions of a document. but work in the opposite direction. A serializer contains serialization anchors that are analogous to the anchors that you use in a parser. The startup component can call secondary serializers. and mappers. Points to Remember 111 . A single project can contain multiple data transformation components such as parsers.Points to Remember A serializer is the opposite of a parser: it converts XML to other formats. you can hide or display the panes by choosing the commands on the IntelliScript menu or on the toolbar. You can determine the folder location by clicking File > Properties or Project > Properties. We recommend that you store all files associated with a project in the project folder. parsers. In an IntelliScript editor. located in your Eclipse workspace.
112 Chapter 6: Defining a Serializer .
116 Configuring the Mapper. 114 Creating the Project.Chapter 7 Defining a Mapper This chapter includes the following topics: ♦ ♦ ♦ ♦ Overview. 118 Points to Remember. 120 113 .
you can nest mapper anchors. you will work with a mapper. The purpose is to change the XML structure or vocabulary of the data. With parsers.Overview So far.xml) is a summary report listing the names and IDs separately. You can use a blank project to configure any kind of data transformation. which you create by using the Blank Project wizard. and other components that perform the mapping operations.xsd. Requirements Analysis The goal of this exercise is to use an existing XML file to generate a new XML file that has a modified data structure. The exercise presents a simple mapper that generates an XML summary report. The mapper design is similar to that of parsers and serializers. The output conforms to the schema Output. In this chapter. The files are provided in the following folder: tutorials\Exercises\Tutorial_6 Input XML The input of the mapper is an XML file (Input. Within the main Mapper component. you have worked with two major components. The input conforms to the schema Input. you converted documents of any format to XML. With serializers. <Persons> <Person ID="10">Bob</Person> <Person ID="17">Larissa</Person> <Person ID="13">Marie</Person> </Persons> Output XML The expected output XML (ExpectedOutput. the exercise demonstrates how to configure an initially empty project. Among other features.xml) that records the identification numbers and names of several persons. in addition to mappers. Map actions. which performs XML to XML conversions. you changed XML documents to other formats.xsd: <SummaryData> <Names> <Name>Bob</Name> <Name>Larissa</Name> <Name>Marie</Name> </Names> <IDs> 114 Chapter 7: Defining a Mapper .
<ID>10</ID> <ID>17</ID> <ID>13</ID> </IDs> </SummaryData> Overview 115 .
xsd and Output. On the Complex Data Exchange Studio menu. 2. 5. you will start with a blank project. Add the schemas Input. name the project Tutorial_6. select a Blank Project. Right-click the XSD node and click Add File. In the wizard. 116 Chapter 7: Defining a Mapper . 3.Creating the Project To create the mapper. but it contains no XSD schemas or other components. Both schemas are necessary because you must define the structure of both the input and output XML. click File > New > Project. In the Complex Data Exchange Explorer. Under the Complex Data Exchange node.xsd. Notice that it contains a default TGP script file. and click Finish. To create the project: 1. 4. expand the project.
into the project folder.xml.6. The Schema view displays the elements that are defined in both schemas. We recommend that you copy the test documents. Input.0\workspace\Tutorial_6 Creating the Project 117 .xml and ExpectedOutput. By default. the folder is: My Documents\Informatica\ComplexDataExchange\4. and with the SummaryData branch for the output. You will work with the Persons branch of the tree for the input. 7.
4.tgp.Configuring the Mapper To configure the mapper: 1. 2. A mapper does not use an example source. Open the script file. Optionally. 5. and give it a type of Mapper. Under the contains line of the mapper. use the commands on the IntelliScript menu or the toolbar to display the IntelliScript pane only. The properties define the schema branches where the mapper will retrieve its input and store its output. You will use this mapper anchor to iterate over the repetitive XML structures in the input and output. For more information. so you do not need the example pane. see “Global Components” on page 97. 3. in an IntelliScript editor. Define a global component called Mapper1. insert a RepeatingGroupMapping component. Assign the source and target properties of the mapper as illustrated. which has the default name Tutorial_6. 118 Chapter 7: Defining a Mapper .
Configure the source property of each Map action to retrieve a data holder from the input. Compare the results file. 7.6. Within the RepeatingGroupMapping. except perhaps for the <?xml?> processing declaration. which depends on options in the project properties. Configure the target property to write the corresponding data holder in the output. Configuring the Mapper 119 . The purpose of these actions is to copy the data from the input to the output.xml. with ExpectedOutput. 8. the files should be identical. Set the mapper as the startup component and run it.xml. insert two Map actions. Check the Events view for errors. which is called output. If you have configured the mapper correctly.
You can create a mapper by editing a blank project in the IntelliScript. 120 Chapter 7: Defining a Mapper . which are analogous to the anchors that you use in a parser or to the serialization anchors in a serializer. It uses Map actions to copy the data from the input to the output. A mapper contains mapper anchors.Points to Remember A mapper converts an XML source document to an XML output document conforming to a different schema.
124 Running the COM API Application.Chapter 8 Running Complex Data Exchange Engine This chapter includes the following topics: ♦ ♦ ♦ ♦ ♦ Overview. 127 Points to Remember. 129 121 . 122 Deploying a Data Transformation as a Service. 123 COM API Application.
The application that activates the Engine will be a Microsoft Visual Basic 6 program that calls the Complex Data Exchange COM API. C. For the benefit of Complex Data Exchange users who do not have Visual Basic. By using the CGI web interface. Launch an application that activates the Engine and runs the service. 2. C++. we provide both the source code and the compiled program. see the Complex Data Exchange Engine Developer Guide. This makes it available to run in Complex Data Exchange Engine. 122 Chapter 8: Running Complex Data Exchange Engine . As an alternative to the approach that you will learn in this tutorial. You can use integration agents to run Complex Data Exchange services within third-party systems. In Complex Data Exchange Studio. see “Defining an HL7 Parser” on page 31. In addition. you can run Complex Data Exchange services: ♦ ♦ ♦ From the command line. In this tutorial.NET.Overview After you have created and tested a data transformation. you will perform these steps on the HL7 parser that you already created. or Web Service APIs. you can perform the exercise on any other data transformation that you have prepared. you can run services by using the Complex Data Transformation in Informatica PowerCenter. deploy the data transformation as a service. . For more information about the parser. For more information. If you prefer. By using the Complex Data Exchange Java. you need to move it from the development stage to production. There are two main steps to do this: 1.
select the project. Edit the information as required. The view lists the service that you have deployed. 5. display the Repository view. 2. To deploy a data transformation as a service: 1. 4. 6. along with any other Complex Data Exchange services that have been deployed on the computer. By default.Deploying a Data Transformation as a Service Deploying a Complex Data Exchange service means making a data transformation available to Complex Data Exchange Engine. At the lower right of the Complex Data Exchange Studio window. Deploying a Data Transformation as a Service 123 . 3. You cannot view or administer Complex Data Exchange services in the Windows Control Panel. We suggest that you use the project that you already prepared. see “Defining an HL7 Parser” on page 31. For more information about Tutorial_2. Tutorial_2 In the Complex Data Exchange Explorer. Confirm that the startup component is selected and that the data transformation runs correctly. On the Complex Data Exchange menu. click Project > Deploy. the repository location is: c:\Program Files\Informatica\ComplexDataExchange\ServiceDB Note: There is no relation between Complex Data Exchange services and Windows services. The Deploy Service window displays the service details. Click the Deploy button.
The source code and the compiled executable file are stored in the folder: tutorials\Exercises\CMComApiTutorial The application is described in the following paragraphs.CMRequest Dim objCMStatus As CM_COM3Lib. Source Code The Visual Basic application displays the following form: Enter the source document path. When you click the Run button. we have supplied a sample application that calls the Complex Data Exchange COM API. the application executes the following code: Private Sub cmdRun_Click() Dim objCMEngine As CM_COM3Lib.COM API Application To illustrate how to run a Complex Data Exchange service from an API application. an output document path. The application is programmed in Microsoft Visual Basic 6. and the name of a Complex Data Exchange service.CMEngine Dim objCMRequest As CM_COM3Lib.0.CMStatus 'Engine 'Request generation 'Status display Dim strRequest As String Dim strOutput As String Dim strStatus As String 'Request string 'Output string (not used in this sample) 'Status string 'Check for the required input If txtSource = "" Then MsgBox "Enter the path of the source document" Exit Sub End If If txtOutput = "" Then 124 Chapter 8: Running Complex Data Exchange Engine .
Text).InitEngine 'Generate a request string Set objCMRequest = New CM_COM3Lib.Generate( _ txtService.Exec(strRequest.IsGood(strStatus) & vbCr & _ objCMStatus.GetDescription(strStatus).FileInput(txtSource. "". _ objCMRequest.MsgBox "Enter the path of the output document" Exit Sub End If If txtService = "" Then MsgBox "Enter the name of a deployed service" Exit Sub End If Me. _ vbOKOnly.MousePointer = vbHourglass 'Initialize Engine Set objCMEngine = New CM_COM3Lib.FileOutput(txtOutput.CMRequest strRequest = objCMRequest. "") 'Execute the request strStatus = objCMEngine.CMStatus Call MsgBox( _ "Return code = " & objCMStatus. "Service Status") Me. strOutput) 'Display the status Set objCMStatus = New CM_COM3Lib. "".Text. "".Text).MousePointer = vbDefault Set objCMEngine = Nothing Set objCMRequest = Nothing Set objCMStatus = Nothing End Sub COM API Application 125 .CMEngine objCMEngine. _ objCMRequest. _ "".
see the Complex Data Exchange COM API Reference. Executes the request and generates the output. Displays information about the results of the Complex Data Exchange Engine operation. and Complex Data Exchange service name. and other details of the operations that Complex Data Exchange Engine should perform. 126 Chapter 8: Running Complex Data Exchange Engine . Use CMStatus to confirm that the request ran successfully. such as a return code and error messages. demonstrated in the sample. which specifies the service name. the input.Explanation of the API Calls The Visual Basic sample uses three Complex Data Exchange COM API objects: COM API Objects CMRequest Description Generates a request string. Define the request parameters. such as the input location. 2. output location. Use CMEngine to execute the request. the output. For complete information about the COM API. 3. 4. are as follows: 1. CMEngine CMStatus The steps for using these objects. Use CMRequest to generate a request string.
to run the sample HL7 parser.txt For example. If an error occurred during the service operation. 4. After a moment. which is true on most Windows computers. you might copy the source document hl7to the c:\temp directory. and enter the options as illustrated below: 3. execute the following file: tutorials\Exercises\CMComApiTutorial\CMComApiTutorial.exe 2. Running the COM API Application 127 . we would package it in a setup file that includes the required runtime library.0 run-time library on your computer. Of course. if this were a production application. Open the output file in Notepad or Internet Explorer. In your working copy of the Tutorials folder. obs. the application displays an error message. Type the name of the Complex Data Exchange service that you deployed. Type the full path of the source document and the output document. By default. it is the name of the project. you can run the sample Visual Basic application. Click the Run button.Running the COM API Application If you have the Visual Basic 6. the application displays a status message: A return code of 1 means success. To run the COM API application: 1.
To view a log. 128 Chapter 8: Running Complex Data Exchange Engine . see “Configuration Editor” in the Complex Data Exchange Administrator Guide. If an error occurred.cme file to the Events view of the Studio.The output should be identical to the output that you generated when you ran the application in the Studio. To change the reports location. the Engine stores an event log in the Complex Data Exchange reports location. you can drag the *. by default c:\Documents and Settings\<USER>\Application Data\Informatica\ ComplexDataExchange\CMReports where <USER> is your user name.
by using the CGI interface. by API programming. You can run a service in Complex Data Exchange Engine by using the command-line interface.Points to Remember To move a data transformation from development to production. The services are stored in the Complex Data Exchange repository. or by using integration tools. Points to Remember 129 . deploy it as a Complex Data Exchange service.
130 Chapter 8: Running Complex Data Exchange Engine .
Index A Acrobat parsing tutorial 48 transformations 5 Acrobat files parsing 48 Acrobat Reader PDF files 48 actions computing totals 48 tutorial 67 Adobe Reader PDF files 48 advanced properties 66 viewing and editing 48 Alternatives anchor 69 ambiguities resolving 72 ambiguous source text resolving 92 anchor Group 84 anchors Alternatives 69 Content 16. 81 defining 23 defining positional 55 API EnclosedGroup 86 Marker 16 RepeatingGroup 39 COM example 124 Engine 4 tutorial 122 attributes XML 3 B basic properties 66 editing 48 binary data parsing 52 Binary Source view 15 blank project editing 116 boilerplate ignoring 48 C child elements XML 3 131 .
CMW files project 13 COM API Visual Basic example 124 Visual Basic source code example 124 Component view 14 components global 95, 97 Content anchor 16, 81 ContentSerializer serialization anchor 106 count of newlines 62 using to resolve ambiguities 92
data holder meaning of 20 debugging parser 42 default transformers purpose 90 delimiters LearnByExample 92 positional 48 tab 16 deploying services tutorial 122 document processors example 53 for Word documents 75 documents defined 4
Engine 4 API tutorial 122 error events 43 errors correcting in parser configuration 22 event log interpreting 43 Events view 15 example pane 15 example source document 13 Excel serialization tutorial 102 Explorer view 14
failure events 43 fatal error events 43 folder project 105 format positional 48 tab-delimited 23 formatting using for parsing 72
global components 97 IntelliScript 95 Group anchor 84
editors Studio 14 elements XML 2 EmbeddedSerializer serialization anchor 108 empty elements in XML output 83 EnclosedGroup anchor 86
hello, world tutorial 10 Help view 15 HL7 background 32 message structure 32 parsing tutorial 32
transformations 4 HTML background information 73 converting Word documents to 72 generated by Word 76 removing code from output 90 HTML documents transformations 5
newlines as separator of RepeatingGroup 62 searching up to 56
offset Content anchor 55 optional property of Group 84
information events 43 installation procedure 6 IntelliScript correcting errors in 22 pane 15 IntelliScript Assistant view 14 IntelliScript editor 15 opening 16 Internet Explorer XML editor 7 invoices parsing 48
parent elements XML 3 parser running in Studio 25 parsers definition 2 multiple in project 109 testing and debugging 42 parsing beginner tutorial 10 by example 2 HL7 tutorial 32 PDF parsing tutorial 48 PDF files parsing 48 transformations 5 viewing 48 PDF to Unicode document processor example 53 PdfToTxt_3_00 document processor example 53 perspective resetting 11 positional format 48 positional parsing tutorial 48 processors document 53 for Word documents 75 WordToHtml 72 project determining folder location 105
LearnByExample interpreting delimiters 92
Map action 119 mappers creating 116 definition 2, 114 tutorial 114 Marker anchor 16 Microsoft Word parsing documents 72 parsing tutorial 72
project properties in Studio 105 projects creating 35 importing 12 properties project 105
recursion serializer calls 102 repeating groups nested 48, 60 RepeatingGroup anchor 39 RepeatingGroupMapping mapper anchor 118 RepeatingGroupSerializer serialization anchor 108 Repository view 15 requirements analysis for parsing 48 result files output 13
multiple in project 109 tutorial 102 services 4 repository 123 solutions to Getting Started exercises 7 source of Mapper 118 source documents testing additional 26 testing parser on 98 startup component setting 25 Studio 4 opening 11 system requirements installation 6
tab delimiters between text 16 tab-delimited format 23 target of Mapper 118 testing parser 42 source documents 98 text files parsing tutorial 10 TGP script files 13 totals computing 67 transformers 91 configuring 94 default 90 Tutorial_1 basic parsing techniques 10 Tutorial_2 defining a parser 32 Tutorial_3 positional parsing 48 Tutorial_4 loosely structured documents 72 Tutorial_5 defining a serializer 102 Tutorial_6 defining a mapper 114
Schema view 15 schema XML 3 schema files XSD 13 schemas XSD 37 script files TGP 13 search scope 58, 59 controlling 48 separator of repeating group 40 serialization anchors 102 ContentSerializer 106 defining 106 EmbeddedSerializer 108 RepeatingGroupSerializer 108 serializers creating 104 definition 2
tutorials basic parsing techniques 10 HL7 parser 32 parsing HTML document 72 parsing Word document 72 positional PDF parsing 48 running service in the Engine 122 Tutorials folder copying 7 X XML overview 2 standards 3 tutorial 3 XML editor Internet Explorer 7 XPath defining anchor mappings 21 extensions 21 XSD editor 15 reference information 37 schemas 37 V valid XML schema 3 variables defining 79 views Studio 14 Visual Basic COM API example 124 vocabulary XML 3 W warning events 43 web pages transformations 5 welcome page displaying in Eclipse 11 well-formed XML rules for 2 whitespace in XML 3 windows Studio 14 Word parsing documents 72 parsing tutorial 72 WordToHtml processor 72. 75 WordToRtf processor 75 WordToTxt processor 75 Index 135 .
136 Index .