Javlin

Whitepaper
Series

Designing Data Applications
for Large Enterprises
The power of the data integration layer

Author
Branislav Repcek
Javlin

The resulting application is then more flexible and allows us to quickly react to typical data management issues caused by fast changing requirements for many business rules in growing and unstructured environment. 2 Javlin Whitepaper Series | www. The key to this approach is an additional application layer which separates business logic from the presentation and storage layers. To ensure such requirements are met. we propose an improved architecture based on a multi-tier application design commonly used for large scale applications.com | Copyright Javlin 2011 .cloveretl. For a reliable enterprise approach.javlininc. PL/SQL. By using standardized tools and frameworks it is possible to develop large applications very quickly and efficiently. or PHP). Java. Modern ETL frameworks allow us to design the business logic in visual language which is easier to work with than the logic hard-coded in the application’s backend in more traditional languages (e. scalable and extensible.g. it is very important applications be robust.com | www. The validity of this design approach is demonstrated on a real-world application which has been designed and implemented according to the presented architecture.Designing Data Applications for Large Enterprises Abstract This paper presents a design methodology suitable for applications which process large amounts of data in an enterprise environment.

even in medium or large companies. Usually after the data management becomes unbearable. This As data volumes increase. not only do the data formats and volumes grow. the one person who knows the code) to protect the data architecture. Here is where we begin to scope this major issue in data management. higher automation and easier maintenance and extensibility. Further.g. companies seek solutions with better integration. as the data volumes increase. However. unmanageable.javlininc.) Using unsuitable tools for data management increases the administrative effort required to keep data up- to-date and correct. but the complexity of the business logic increases dramatically. Surprisingly. A very common approach is to build a custom application (often web-based) using a database as data storage. different banks or insurance companies who store very similar data. Even though this some companies experience a measure of success. might be adequate for small companies and their simple management of both the data business logic. such practices are quite common. not the other way around. sometimes the business logic or understanding itself is dependent on the tool.com | Copyright Javlin 2011 .com | www. Further. This is often true for companies which integrate many different customers with just slightly different data formats (e.cloveretl. Many data processing applications start as a simple spreadsheet with embedded macros or formulas. Over time. the resulting applications are often hard to maintain and extend since they are a unique and represent a non- standard solution.Designing Data Applications for Large Enterprises Data Management Challenges Today. What is proposed then is a more flexible and elegant solution 3 Javlin Whitepaper Series | www. every company and enterprise is compelled to manage huge amounts of data that impact their business on a daily basis – data that comes from different systems and stored in a multitude of formats. the business logic is often hard-coded which means increased life cycle cost: the processes will surely change over time and the applications will require re-writes. What this demands is an in-house development team (or often. and business logic becomes this approach quickly becomes hard to manage and expensive due to significant overhead and manual work.

This separation means that all the data has to pass through the data integration layer during processing. The Data Integration Layer The data integration layer separates data and business logic from the actual application which is used as a user interface to display the data. rapid development. Since all the data operations pass through the data integration layer. data exports in different formats. you can use all the advantages the software DI layer offers. ability to access other data sources. In this context. this can include the positive elements of database independency.  Data integration (DI) layer. data exports in different formats and many more. This is accomplished by adding an additional layer to the architecture of the application – an independent data integration layer. DI Layer Implementation Since the data integration layer becomes the core of the application. This offers many advantages such as database independency.javlininc.  Data storage (any platform -often a database. Its main purpose is to host all ETL components of the system and provide clear separation between this ETL parts and the rest of the application.Designing Data Applications for Large Enterprises to the issue of managing data architectures for the future – a specific way of designing such applications with extensibility and standardization in mind. the DI layer is the only layer which contains the implementation of the business logic. to name a few.cloveretl. it is important to carefully design how it interacts with the rest of the application 4 Javlin Whitepaper Series | www. The basic architecture of such an application consists of three distinctive layers:  Thin client – either stand-alone or web-based application. file system or even an enterprise service bus). ability to access multiple different data sources.com | www. Most notably.com | Copyright Javlin 2011 . rapid development.

Applications running in a batch scheduled mode can simply send an email to the operator or to a ticketing system.com | Copyright Javlin 2011 . it is often required to validate the data against other system (especially for manually Thin client entered data) or to run various processes depending on the data. This makes it a perfect place for an off-the-shelf ETL (Extract-Transform-Load) package to handle all the heavy data processing. Many data management applications can be very simple – they basically convert one data format to another and may perform some light processing.cloveretl.Designing Data Applications for Large Enterprises and which software package is used.com | www. There are several key questions which need to be answered before the DI layer software is selected and implementation can start:  What is the complexity of the business logic behind the application?  Is the application interactive or processing data based on a schedule?  Which communication protocols are used by data sources and data sinks?  What level of fault tolerance and scalability is required? The complexity of the business logic has a significant impact on the software package which is used as DI layer core.javlininc. DI layer These requirements place additional constraints on the ETL package which must support all the Any platform intended business logic features so that there is a minimum amount of Figure 1: Basic application architecture with code which is outside of the ETL data integration layer. However. but such behavior is probably not acceptable to users who 5 Javlin Whitepaper Series | www. Data management often involves integration of multiple systems together to validate the data and can contain business logic which directly operates on the processed data. Interactive applications usually require more complex design as the users expect the application to behave well even if an error occurred. core.

but they certainly impact the cost of the whole solution. However. payments. runs business logic offers a etc. Most ETL packages offer higher-tier versions with support for clustering.com | Copyright Javlin 2011 .g. orders. steps. it might be possible to start small and expand the application as the need arises. 6 Javlin Whitepaper Series | www. It is often quite easy to access databases. customer database. Fault tolerance and scalability requirements might not directly impact the overall design. Separating the layer which runs business logic is a must: it offers a clear advantage in better maintainability and extensibility. This is where the efficiency of the DI layer saves time and money in the development life cycle of a enterprise business operations. even on multiple servers. Advantages Using the DI layer in an application offers multiple advantages with business logic separation being the most prominent.javlininc.cloveretl. Business Logic Separation In many applications. It is often very easy to migrate to higher version if required. FTP or HTTP servers or use more complex communication protocols like MQ. load balancing and so on. Interactivity also requires different business logic design. but these versions often cost significantly more than simple versions. and therefore. with modern tools this becomes less of an issue as they typically support many different protocols.) and multiple validation and data cleansing very clear advantage. Depending on the tool chain used the deployments of updated or new business rules can be very simple. The communication with other components in the system might place additional constraints on the ETL tool. business operations can become very complex due to many different data Separating the layer which formats (e.Designing Data Applications for Large Enterprises expect to see direct feedback in their client.com | www.

it is often the case a single company can use dozens of different file formats or data structures just within a single application. Since the main job of data integration software is to transform data into useable workflow supporting operations. hiding platform specific details from the business logic itself allows for easier future-proofing of an application – migration. or local file system).com | Copyright Javlin 2011 .g. file formatters (simple CSV. Python etc. C#. XML) and different file transfer protocols (HTTP. the expressive power of built-in languages or (visual) components tends to be well suited for designing business rules.com | www. Data Access Data integration platforms such as CloverETL offer several connectors or components which allow reading and writing the data in different formats. MSSQL). Oracle. MySQL. Seeing the transformation in action is often just a click of a button. FTP.cloveretl. In addition. therefore allowing for rapid prototyping of future rules and enhancements with a very quick result turnaround.javlininc. It is often much easier to prepare transformation in such application than it is to write them in programming languages like SQL. Such separation from low-level coding often allows business analysts to work directly with the business logic. 7 Javlin Whitepaper Series | www. The ability to write the core transformations independently from the data input or output method is crucial as it saves considerable development resources and allows for cleaner design and easier integration within existing pipeline. This often includes connectors to different database engines (e. This greatly assists in rapid development needs of the enterprise. Finally. data format changes or platform updates are all often just a configuration change.Designing Data Applications for Large Enterprises Rapid Development State-of-the-art data integration platforms offer a user-friendly client application with visual interface for data transformation authoring. MS Excel.

com | www. For high availability and high performance. larger enterprise solutions cost hundreds of thousands of dollars. or monitoring. address validation or WebService integration.Designing Data Applications for Large Enterprises Data Quality Data quality and consistency is one of the most important things for any business-critical data. Historically.javlininc. Modern data integration platforms offer wide range of data validation and cleansing tools.com | Copyright Javlin 2011 . this has been mitigated. Some of the tools offer even more advanced features like distributed processing or auto-scaling. With the advent of lower cost and robust solutions.cloveretl. any errors in the input aspects affecting business can have significant impact on the revenue and critical data. 8 Javlin Whitepaper Series | www. but life cycle cost is clear a consideration. The offerings range from trivial validation (date or number format). many tools integrate with fail-over systems and are able to run in clusters with load-balancing. for example. Through different connectors or components it is possible to connect to an ESB (Enterprise Software Bus) and integrate the whole application into a bigger enterprise infrastructure. With Data quality is becoming increasing complexity of the business rules and the one of the most critical systems which process the data. logging. both in license and maintenance fees. The first consideration is the cost of the data integration tools. through referential integrity validation to complex tools offering. Considerations As with other solutions. workflow. Enterprise Features Data integration systems often offer full set of enterprise features such as automation. operation costs. this architecture presents some considerations before deciding whether the architecture is suitable for a given enterprise architecture.

com | www. Higher initial cost will usually pay off quickly as the run costs are quite stable even though data volumes increase.cloveretl. This may translate to higher initial project investment. Most applications will slowly grow with their data volumes. The application was designed to replace MS Excel based workflow – different work groups in multiple companies communicated by using worksheets with pre-defined structure. A calculation on return on investment (ROI) must be completed for decision based analysis. This is especially true for applications which require very complex business logic which can quickly change with market needs for enterprise data driven applications.Designing Data Applications for Large Enterprises The price needs to be specifically considered when thinking about the future of the application. This means new support personnel must be trained and new guidelines for tool deployments will have to be designed. This allows every customer to select what is best and upgrade when necessary. Data Integration tools often offer several tiers with lower tiers suitable for smaller projects and higher tiers with more features for large-scale projects. some of the documents had quite complicated layouts (more than 200 columns) and had thousands of rows. Due to the number of customers and financial transactions.javlininc. The tasks performed by the employees handling these documents ranged from simple data pairing (customer and transactions) to more complicated financial calculations and estimates. Moreover. Real-life Example We will demonstrate the usability of the proposed architecture on a data management application built for a financial services customer. this was proving to be very cumbersome and generated quite large number of errors (simple typos or copy-paste errors between different documents). 9 Javlin Whitepaper Series | www. One key point is that one-time only costs after the initial phase of the project may be offset by reduced development time for new business logic. improved data quality and stability. Another second consideration is the requirement to support new tools which may not be deployed elsewhere in the company.com | Copyright Javlin 2011 .

10 Javlin Whitepaper Series | www.com | Copyright Javlin 2011 . component communication etc. IBM DB2 Apart from allowing the users to edit the data. The resulting high-level architecture diagram of the whole application is presented in Figure 2.cloveretl. All Web GUI changes made by the employees had to be audited to make sure CloverETL that any errors can be quickly Server traced back and corrected. The Users Directory application required user interface. the resulting application needed to apply Figure 2: Overall application architecture.Designing Data Applications for Large Enterprises Requirements and Architecture Several key requirements have been set by our customer. Some of the insurance payments. This was required since not all companies had been able to switch to a different transport format. update outputs are sent to NFS share on different various invoices and so on.g. it is able to support IBM AS/400 which was the platform preferred by our customer. The above requirements are a perfect fit for Clover ETL Server which has been selected as the core of the DI layer of this application. log-in verification.javlininc. Since Clover ETL runs on all reasonable implementations of Java. Red arrows indicate flow of the business data. various business rules – for Most components run on AS/400 with users example create the calendar for accessing only the web server directly and never accessing the database. inter- support such operations.com | www. The server where users can directly see the files resulting business logic was quite via shared Windows directory.) An additional request was that the application had to support loading and exporting the data in MS Excel format. Therefore it was required that the business logic would use the database internally and import and export would be allowed in MS Excel formats as well. since manual updates of the master data were required. blue arrows complex and therefore we needed indicate flow of the data related to the a tool which would be able to application itself (e.

Business Logic Keeping with the philosophy of a separate DI layer. Two components (e. Since we had few free virtual machines running Windows and Linux guest systems. processing.Designing Data Applications for Large Enterprises Platform Independence During the development it proved beneficial to have the ability to quickly deploy and test new versions of the application.g. Green boxes are reader output graph (see Figure 3). The flexibility of CloverETL Server allowed us to quickly deploy onto any platform without any substantial configuration.com | www. Figure 3: Example input (top) and output Most of the graphs work together (bottom) transformation graphs which in pairs – one input and one together define part of the business logic for one data type. Data flow is groups which usually contains a represented by lines connecting various graph pair for historical data components and by convention it is from left processing and a pair for change to right. When the server was ready. blue or more of such pairs are usually are writer components while dark yellow grouped together into bigger boxes are various transformations (e. whole business logic is implemented as a set of CloverETL transformations. This allowed us to start the development even before customer was able to provide dedicated AS/400 test environment for this project.com | Copyright Javlin 2011 . sort. join.cloveretl. file or database input). The requirements set by our customer have resulted in very complex business with more than 40 transformation graphs with quite a bit of code in custom transformations. custom transformations). 11 Javlin Whitepaper Series | www. migration of the graphs was as simple as copying files over and adjusting the configuration for the new environment. we have decided to use some of them for the development.javlininc.g.

there are multiple interfaces between different components in the application. When database was ready. The input and output components even allowed us to quickly switch input and output data formats during the development.com | Copyright Javlin 2011 . The only configuration required for the business logic of the application consists of mapping between graphs and data types.javlininc. it is natural to use JDBC for communication between database and CloverETL. it was trivial to change various parts of business logic while the application was in development.cloveretl.Designing Data Applications for Large Enterprises Since everything is implemented as a graph. Therefore. Communication between Components As can be seen from Figure 2. but had been prepared by our customer over the course of the development. This makes it easy to deploy new data types since only a configuration update in the database and in the CloverETL Server is required. The output which is sent from CloverETL Server to the directory with XLS files is simple disk-based access – the target directory is simply mounted on the AS/400 server. This choice allows for relatively painless migration to a different database provider – simple driver change in the configuration is usually enough. we tested first versions of the graphs by using MS Excel files as input and output. we simply switched input/output from MS Excel to database components and tweaked graphs for better performance (since it is possible to run complex queries in DB while the same is not possible in MS Excel worksheets). 12 Javlin Whitepaper Series | www. Communication between users and web server is done through HTTP protocol. This was especially useful since the complete database layout for data input had not been finalized before the development started.com | www. Web front-end itself is written in PHP with AJAX parts for some additional functionality. Since CloverETL Server is a Java application.

Designing Data Applications for Large Enterprises The most interesting interface is between the web server and CloverETL Server.  It needs to be asynchronous to allow the user to only send request for data processing without having to actively wait for it to finish. CloverETL Server implements Launch Service which provides an The most interesting easy-to-integrate interface designed specifically for interface is the one between use in web-based applications. norm for communication between web applications. to make life of the end user easier.  It should provide a way of sending back the information about the status of the transformation (whether it succeeded or failed).cloveretl. Since the data is sent via HTTP. To fulfill the above requirements. the calling web application can decide what to do with the response – it can be processed further or a file save dialog can be shown to the user. the main part of the user interface is a table which represents the data in the database. it had to be able to talk to Clover ETL Server as well as to 13 Javlin Whitepaper Series | www.javlininc. The response is sent back again via HTTP protocol and can contain any data – simple CSV. Of course.  It should provide a way of sending back the output of the transformation. Since our application required interactive data processing (for importing and exporting of the data).com | www. XML or even XLS document if required.  It needs to allow sending of parameters to the graphs to control the data transformations. There are few requirements such an interface must meet before it can be used:  It needs to be easily accessible from web applications. Each request including its parameters can be encoded as a URL which is decoded by the CloverETL Server. Web Interface The basic design of the web interface for our application was quite straightforward. Since its main use is maintenance of the master data. The communication the web server and the protocol is based on standard HTTP which is the CloverETL Server. the application used AJAX and extensive scripting to make it as comfortable as possible.com | Copyright Javlin 2011 .

the resulting page with the data can be several megabytes even when only hundred or so lines are displayed. In the end.javlininc. user accounts and the data it was displaying.com | www. keep queries into the database to minimum to minimize latency issues and so on. Over time we expect this to become less of a problem in future when our clients migrate to newer versions. This has been resolved by using a library which allowed us to switch the target database via a simple configuration change. it was important to keep the graphs as fast as possible. data transfers between the database and Clover ETL as well as the browsers turned out to be a bottle neck for the application. Performance For any application which allows users to directly work with the data and run transformations in interactive mode it is very important that these transformations run as fast as possible to minimize the time users have to wait for the operation to finish. All the graphs. To connect to the database we have used the functions provided directly by PHP. Since the data layouts are quite complex (up to 250 columns). The only problematic part was caused by incompatibilities between SQL syntax in DB2 on AS/400 and Linux version. For users the most visible is the performance of the client application. which run in interactive mode. To connect to Clover ETL Server we have used Launch Services as described above. this did not prove to be much of a problem. 14 Javlin Whitepaper Series | www. Since the data sets we worked with were relatively large.g. This has been accomplished by designing the graphs so that they do not perform any unnecessary operations – e.Designing Data Applications for Large Enterprises the database which stored the configuration. need only few parameters which were easy to pass in URL. Since the application did not need to execute any complex queries. This could be mitigated a bit by switching to a faster browser and with improvements in the browser technology. only sort when necessary.com | Copyright Javlin 2011 .cloveretl.

com | www. We proved the practicality of such architecture by implementing and deploying data management application for our customer. 15 Javlin Whitepaper Series | www. We show that the architecture allows for rapid application development and with right core software choices the whole application can be easily ported to a different platform.javlininc. since it allows for faster development. testing and deployment of the application as more teams can work in parallel.Designing Data Applications for Large Enterprises Conclusion We have presented an improved multi-tier application architecture with a data integration layer.com | Copyright Javlin 2011 . such separation is beneficial. Even if an ETL tool is not used to implement the business logic. The separation of business logic from the presentation layer and from the data storage provides a flexibility which would be hard to achieve if the logic was coded directly in the front-end or back-end as is usual for many similar applications. Such architecture is suitable for applications which process large amounts of data in different formats.cloveretl.

javlin.com 16 Javlin Whitepaper Series | www. synchronization.javlininc.Designing Data Applications for Large Enterprises Javlin Javlin is a premier provider of data integration software and solutions.cloveretl.com | Copyright Javlin 2011 . Javlin offers software solutions. audit. migration. cleansing. The CloverETL OEM foundation program also provides a way to embed ETL in applications. www.cloveretl. CloverETL. Its leading software platform. provides users the ability to manage data solutions such as integration.com www. custom software development and data integration consulting services. Master Data Management and Data Warehousing.eu CloverETL CloverETL software is platform independent and scalable with a smooth upgrade path.com | www. www.javlininc. It is also easily embeddable thanks to its small footprint. consolidation. In addition to development of data integration products.