Knightsbridge Solutions

W H I T E P A P E R

Managing Big Data:
Building the Foundation for a Scalable ETL Environment

Citicorp Center 500 West Madison, Suite 3100 Chicago, IL 60661 USA 800.669.1555 phone 312.577.0228 fax w w w. k n i g h t s b r i d g e. c o m

Managing Big Data: Building the Foundation for a Scalable ETL Environment

Can you, at this moment in time, imagine managing 500 terabytes of data? Or integrating billions of clicks from your web site with data from multiple channels and business units every day, maybe more than once a day? It seems like an extreme scenario, but it’s one that industry analysts uniformly predict organizations will be confronting within the next three to four years. The exact figures may vary slightly, but the consensus is solid: enterprises are going to be swamped with data, and they’re going to have to figure out how to manage it or risk being left behind. A rapid increase in data volume is not the only challenge enterprises will face. End users want more information at more granular levels of detail, and they want flexible, integrated, timely access. The number of users and the number of queries are also growing dramatically. Additionally, organizations are placing more emphasis on highvalue, data-intensive applications such as CRM. All of these developments pose problems for enterprise data management. Fortunately, there is an effective answer to these problems–scalable data solutions, and more specifically, scalable ETL environments. Scalability is defined as the retention or improvement of an application’s performance, availability, and maintainability with increasing data volumes. This paper will explore the three dimensions of scalability as they relate to ETL environments, and will suggest some techniques that IT organizations can use to ensure scalability in their own systems.
SCALABLE PERFORMANCE

In the case of performance, scalability implies the ability to increase performance practically without limit. Performance scalability as a concept is not new, but actually achieving it is becoming much more challenging because of the dramatic increases in data volume and complexity that enterprises are experiencing on every front. How long

will users tolerate growing latency in the reporting provided to them? How long can enterprises keep adding hardware, installing new software, tweaking or building from scratch new applications? The fact is, yesterday’s scalable solutions aren’t working in the new environment. The extract, transform and load (ETL) environment poses some especially difficult scalability challenges because it is constantly changing to meet new requirements. Enterprises must tackle the scalability problem in their ETL environments in order to successfully confront increasing refresh frequencies, shrinking batch windows, increasing processing complexity, and growing data volumes. Without scalability in the ETL environment, scalability in the hardware or database layer becomes less effective. In terms of implementing scalable data solutions, enterprises should adopt a "build it once and build it right" attitude. If an ETL environment is designed to be scalable from the start, the organization can avoid headaches later. Let’s consider a situation in which this is not the case, and the ETL environment is architected without consideration for scalability. The first generation of this solution will be fine until data volumes exceed capacity. At that point, the organization will be able to make the fairly easy move to a second-generation environment by upgrading hardware and purchasing additional software licenses. Once this solution is no longer sufficient, however, the enterprise will find it more difficult and costly to evolve to a third-generation solution, which usually involves custom programming and buying point solutions. Finally, once the thirdgeneration solution has reached its limits, the enterprise will need to rebuild its ETL environment entirely, this time using scalable and parallel technologies. Clearly, enterprises can save time and money by implementing a scalable ETL environment from the very beginning. Although a thorough discussion of techniques to ensure performance scalability is beyond the scope of this paper, following are some basic considerations for improving the performance of the ETL environment.

Knightsbridge Solutions LLC January 2002

“Managing Big Data: Building the Foundation for a Scalable ETL Environment”

2

Reduce data flow

Making reductions to the data flow for improved scalability is common sense. However, this technique is often overlooked in ETL applications. Some ways to reduce data flow:

Avoid scanning unnecessary data. Are there data sets or partitions that do not need to be scanned? Could the extra CPU/memory cost of passing unused rows or columns be avoided? Could the data be structured in columnar data sets so that only necessary columns would be read?

Apply data reduction steps of the ETL job flow as soon as possible. Could data be filtered during the extract processing instead of during the transformation processing? Could data be aggregated sooner in the processing stream, reducing the rows sent for subsequent processing?

Use tools to facilitate data reduction. Could a "change data capture" feature on a source database be used?

Eliminate I/O

Whether implemented within the ETL application, or more likely by the ETL tool, these techniques avoid I/Os and reduce run times. The first three in-memory techniques are generally implemented within the database and often come "free" with the ETL tool that processes in the database. Pipelining is generally supported through an ETL tool.

In-memory caching. Will the I/Os generated from caching a lookup file/table be less than a join? Will the lookup file/table fit in memory–either OS buffers, tool buffers or database buffers?

• • •

Hash join versus sort-merge join. Will the join result fit in memory? In-memory aggregates. Will the aggregation result fit in memory? Pipelining. Can data be passed from one stage of ETL processing to the next? Does the data need to be landed to disk for reliability purposes?

Optimize common ETL processing

Certain ETL tools provide optimized implementation of common and processingintensive operations.

Knightsbridge Solutions LLC January 2002

“Managing Big Data: Building the Foundation for a Scalable ETL Environment”

3

Change data capture. Will only records that have been changed (inserted or updated) be sufficient for all the ETL logic required before populating the target system?

Surrogate key optimization. Will the surrogate key logic supported within the tool allow the warehouse dimensions to change in the manner users need? Or does custom logic need to be written?

Incremental aggregates. Does the tool support optimizations for only applying deltas/changed data to aggregates? Bulk loading. Does the bulk loading support the unit of work features required for recoverability?

Eliminate bottlenecks and balance resources

Hardware performance tuning is essential and is heavily dependent on the ETL application workload being run through the system or systems. There is no such thing as a perfectly tuned system, but there are several tuning techniques that get the system running close to optimal.

Eliminating configuration bottlenecks. Is the disk farm (I/Os) a bottleneck? Is the memory buffer allocation a bottleneck? Is the network a bottleneck? Is the CPU or operating system configuration a bottleneck? Is the ETL tool configuration causing a bottleneck?

Balancing resource utilization within applications. Does the hardware environment have the right mix of CPU/disk/memory/network for the ETL environment? Should hardware be added or subtracted to balance the application needs?

Provide rules to determine environment scalability

Without defined methods for measuring scalability, an IT organization will have difficulty quantifying the impact of improvements. Following are two common measurements of scalability within an ETL environment.

Knightsbridge Solutions LLC January 2002

“Managing Big Data: Building the Foundation for a Scalable ETL Environment”

4

Linear speed-up. Linear speed-up is measured by reducing the run-time of an ETL application by a factor of N with N times more resource. Linear scale-up. Linear scale-up is measured by maintaining a constant run time while increasing data volumes by a factor of N with N times more resource.

Use data parallelism

Parallel processing is only slowly being adopted into ETL tools, although many enterprises are already making use of it without the help of tools. Following are some of the considerations for using data parallelism.

Application data parallelism. Is the ETL job stream too small to be partitioned to use parallelism? How will application logic such as joins and aggregates need to be changed to accommodate partitioned data? How many different partitioning schemes will be needed? How can the data be efficiently repartitioned? How will the data be gathered back into serial files?

Control and hardware parallelism. How will jobs running on different servers (a distributed environment) be controlled and managed? How will data be moved between different servers? How will failure recovery, monitoring and debugging be implemented?

Again, an exhaustive discussion of techniques to improve ETL performance is beyond the scope of this paper, but hopefully these suggestions will encourage IT organizations to explore opportunities for performance improvement in their own ETL environments.
SCALABLE AVAILABILITY

Availability hardly needs defining. An application is available if its end user is able to utilize it when expecting to be able to do so. End users and IT staff generally agree beforehand on the level of availability that is needed. IT must then design and operate applications such that the service level expected is actually delivered. Let’s think of availability in different terms for a moment. Let’s assume that the end user defines availability as one helium balloon staying no less than 10 feet above the ground. If left alone, the balloon eventually descends as a result of helium molecules leaking
Knightsbridge Solutions LLC January 2002 “Managing Big Data: Building the Foundation for a Scalable ETL Environment”

5

through the balloon’s skin or through the seal. To retain the contracted availability, someone periodically adds more helium to the balloon. The refilling process requires some time, however modest. The refilling must occur outside the contracted availability window. Therefore, it is probably unrealistic to expect 24-hour availability every day of the week and every week of the year. Most helium balloons fall to the floor within three or four days after having been filled. That means the availability contract can’t be met for more than about three days before being broken to make time for a refill. What if the end user really needs 24x7x365 helium balloon availability? One way to provide it would be to add a second balloon into the system. While one is floating, the other could be pressurized. Because pressurizing requires substantially less time than the leakage, which ultimately brings the balloon below the 10-foot threshold, complete 24x7x365 availability could relatively easily be met by two balloons. This system utilizes redundancy to provide availability. The waiting, pressurized balloon is the "hot spare." Consider the consequences of redefining availability to a 20-foot threshold. We now pressurize the balloons with additional helium, which stretches the skin more and stresses the balloons more. The probability that a balloon will eventually fail (probably with a loud BANG) after a few refills increases. We must now track the number of refills a balloon has experienced to gauge its life against its expected age at the higher pressure and retire it long before it actually explodes. This process requires us to increase the redundancy by retaining at least one "cold spare" to add into the cycle of redundancy needed for full availability. As the availability threshold rises, the stress on the system increases, as well; it becomes more difficult and more expensive to maintain the expected availability. The difficulty and expense are closely correlated. As the system becomes more stressed, additional vigilance is required to avoid unexpected outages. Additionally, as the stress goes up, the number of balloons in the system goes up, and the frequency of maintenance operations (pressurizing and replacing balloons) rises.

Knightsbridge Solutions LLC January 2002

“Managing Big Data: Building the Foundation for a Scalable ETL Environment”

6

The helium-balloon metaphor models modern information technology systems pretty well. If left alone, balloons leak their helium and IT systems eventually fail. Technology systems can fail for any number of reason including power outages, disk failure, memory failure, software errors, etc. Balloons require refills; IT systems require maintenance and administration. For example, upgrading or patching the operating system may require an outage. Reorganizing database storage structures may also require an outage. Big-data systems are especially at risk of outage because they inherently involve large numbers of storage devices. Although the mean time between failures (MTBF) for a single device might be thousands of hours, the system-wide MTBF decreases dramatically when these MTBF probabilities are multiplied by hundreds or thousands of devices. Weekly disk drive failures are the norm rather than the exception in multiterabyte repositories. As with the helium balloons, redundancy can effectively diminish the ill effects of these failures. Redundancy is a frequently deployed tool in IT’s arsenal. Particularly apropos is the concept that as scale increases, stress increases and, at certain stress points, special considerations are needed to deliver contracted availability. Therefore, even with a static availability requirement, if system scale increases, the cost of delivering that same availability increases. Availability scaling differs from performance scaling in that performance scaling can often be achieved with a linear or better than linear relationship between system scale and cost. Availability scales linearly within ranges. Additional expense often accompanies the jump from one scale range to the next, eliminating the linearity of the scale/cost relationship at these transitions. Now that we’ve established that special considerations are called for in scalable big-data applications, let’s look specifically at availability in ETL systems. End users generally do not interact directly with ETL applications. Despite this, ETL systems often must perform their functions within narrow batch windows. Availability of the ETL systems during these windows is critical to maintain contracted service levels for downstream systems, generally data warehouses. If the ETL process completes at 9 a.m. instead of 7

Knightsbridge Solutions LLC January 2002

“Managing Big Data: Building the Foundation for a Scalable ETL Environment”

7

a.m., end users who report against the database at 8 a.m. may retrieve stale data and may further be confused by that same report showing conflicting information at 10 a.m. What can be done to attain and retain contracted availability levels in big-data ETL applications? For ETL applications, availability means extracting data from source systems on time, transforming it on time and loading it on time. Some tips on attaining availability in ETL applications include:
Implement higher redundancy for persistent data stores

Most sophisticated ETL processes produce and maintain data stores during the transformation process. These stores fall into two categories: persistent and temporary. Persistent stores live from ETL cycle to cycle. Frequently, these stores contain information critical to computing aggregates or otherwise producing load-ready data in the next ETL cycle. Recomputing this persistent store from scratch can be extremely slow and expensive, if not impossible. If several months or years of transactional data are summarized in this persistent store, it may well be impossible to recreate the summary. Even if it were technically possible to resummarize the transactional data, it can often be very difficult to arrive at the same result. Therefore, it’s best just not to lose the persistent data store in the first place. Backups are clearly important, but the backup/restore cycle is generally too slow to accommodate regular disk outages. Disk redundancy (RAID 1 or RAID 5) is a far better solution. RAID 1 (full mirroring) can be superior to RAID 5 (parity-based redundancy) or its derivatives in terms of greatly reducing the probability of an outage, but RAID 1 comes at significant capital cost.
Implement less redundancy for temporary data repositories

ETL processes often produce transient or temporary result sets. These generally don’t have to be protected if they can be easily recreated. If the time to recreate one of these temporary results would be high enough to prevent the ETL process from fitting within its allotted batch window, RAID 5 storage should be utilized.
Separate transient data from stateful/persistent data

Because persistent stores require different service levels than transient stores, separating the two is essential to achieving high availability affordably.
Knightsbridge Solutions LLC January 2002 “Managing Big Data: Building the Foundation for a Scalable ETL Environment”

8

Utilize cold spare servers

ETL applications rarely justify hot spare servers. The cost of operating a hot spare server in all but the largest distributed architectures is prohibitive. If an ETL system consists of eight or more servers, having an extra as a hot spare may be justifiable in that it represents about 11 percent of the total cost. On the other hand, if the ETL system consists of a single server, the hot spare represents 50 percent of the total server cost and would probably be less than one percent utilized. In large data centers with standardized server platforms, a cold spare with multiple potential personalities could act as a spare for more than one ETL application, spreading the cost across multiple systems. Thus, a cold spare is more easily justified but does require some degree of platform and configuration standardization.
Utilize off-line backups

Backups are hardly conceptually novel. Nevertheless, in a big-data setting, special consideration must be given to the amount of time required to perform a full or partial restore. A conceptually simple restore can prove quite difficult and time-consuming in practice if full backups are taken rarely and incremental backups taken frequently. A good archive management package is critical. Furthermore, when a surgical restore is called for, many partial backups can lead to a quicker restore than a single monolithic backup. Design the backup process primarily to allow smooth and quick restores. Don’t back up for backup’s sake; back up for restore’s sake.
Checkpoint at the right intervals

A complex ETL application should not occupy the full batch window. Some part of the window must remain in reserve to allow for recovery in case of fault. All portions of the ETL flow must then be designed to be recoverable within this reserve window. Such a strategy allows the ETL process to fit within the batch window even if a single fault occurs. Checkpointing is an effective technique for engineering intermediate recovery points in a complex and long-running application.

Knightsbridge Solutions LLC January 2002

“Managing Big Data: Building the Foundation for a Scalable ETL Environment”

9

Clean transient data at the end of each cycle

Outages caused by exhausting free storage space for transient data are easily avoided by religiously removing transient data left behind by each ETL cycle.
Design the system such that no server is special

In a distributed server configuration, all servers should be as similarly configured as possible. If any is treated as the master or is, in some way, special, it will quickly become the weak link. Heterogeneous configurations can require multiple spares where a single cold spare would otherwise do.
Publish the freshness of the data in the database

In the unfortunate circumstance that an ETL cycle exceeds its allotted batch window, end users may access the database being loaded by the ETL cycle and assume the cycle completed successfully and then proceed to incorrectly interpret the data retrieved from the database. Publishing the freshness of the data in the database, preferably on every report or in the OLAP tool, avoids confusion.
Make data loads atomic

The load phase of an ETL process should be atomic. Either all the data is published or none is. This avoids end users reporting against a partial refresh in situations when the ETL cycle exceeds its allotted batch window.
Monitor everything in the chain between E and L

An effective technique for restoring some of the linearity between system scale and cost of delivering a certain level of availability is to monitor the entire ETL chain thoroughly. Automate exception reporting and alert operations staff to impending problems. Scaling availability in a big-data ETL application ultimately comes down to good planning and design and to tremendous discipline. In a large-scale ETL system, an ounce of prevention can be far cheaper (read tens of thousands of dollars cheaper) than a pound of cure.

Knightsbridge Solutions LLC January 2002

“Managing Big Data: Building the Foundation for a Scalable ETL Environment”

10

SCALABLE MAINTAINABILITY

In point of fact, “maintainability” is perhaps not the best word to describe the third scalability dimension, as this dimension involves more than just software maintainability. This third dimension of scalability is primarily about low cost of ownership, and maintaining predictable cost of ownership over time. In short, a scalable ETL application’s cost scales predictably over time. Over time, two aspects of ETL applications increase cost: increasing data volumes (and, consequently, workloads) and feature-set enhancement. Cost-of- ownership predictability in each of these areas implies different cost growth curves. As data volumes increase, a predictable cost curve is one where cost increases linearly or sub-linearly. In other words, if an ETL environment costs $300,000 per year to operate when the base data volume being processed is one terabyte, the cost should be no more than $600,000 per year when the volume increases to two terabytes. Cost predictability with regard to feature set enhancement means two things. First, the cost of the enhancement itself should be commensurate with the enhancement. Second, once enhanced, the software’s cost of ownership should change primarily due to workload increases and not software maintenance needs. Before detailing the primary cost drivers and how best to manage them into "predictability," we need to address the issue of measuring cost. Cost-of-ownership measurement is more art than science. Ownership cost includes several components: software development, testing and maintenance; capital expenditure; hardware expense; software licensing expense; data center operations; and others. Furthermore, cost depends on service level commitments on various quality of service (QoS) metrics. These include availability, performance, timeliness, accuracy, completeness, disaster recovery time and unplanned event rates. This paper assumes that anyone trying to manage cost of ownership first implements consistent, systematic processes for measuring ownership cost and, second, adjusts these costs fairly with respect to service-level commitments. For example, increasing an

Knightsbridge Solutions LLC January 2002

“Managing Big Data: Building the Foundation for a Scalable ETL Environment”

11

application’s availability metric while expecting cost of ownership to remain unchanged is folly. Following are various drivers that determine whether cost of ownership remains predictable over time or not. These drivers should be taken into consideration when attempting to improve cost-of-ownership scalability.
Application cost scalability depends on the hardware platform and hardware architecture

Assume that an ETL application costs $100,000 per year to own and is processing one terabyte of data per day. When the volume demand increases to two terabytes per day, the current platform is unable to scale to support such a workload. A new platform may have to be chosen, and the software may have to be rearchitected to take advantage of the more scalable hardware. These activities add dramatically to the ETL application’s cost of ownership for the current year.
Cost scalability depends on frequency of software redesign

If scaling an application requires the software to be redesigned, the cost scalability suffers. Software redesign may be called for to better take advantage of hardware resources. For example, if an ETL application is written to run on a single server, and increased data volumes call for a clustered architecture, software redesign may necessary to take advantage of the clustered hardware. Such a redesign effort would undoubtedly defeat the application’s cost scaling.
QoS degradation affects cost scaling

Certain QoS measurements naturally degrade as data volumes grow over time. For example, if an application’s data volume doubles, the number of disks required may also double. Twice the disk means double the frequency of disk outages. Unless service levels are specifically maintained as volumes increase, the application ownership cost will increase. The hardware and software architecture must be designed to retain consistent service levels as the data volume increases.
Data center operational costs must scale

Ideally, an application’s operational costs stay constant as data volume increases.

Knightsbridge Solutions LLC January 2002

“Managing Big Data: Building the Foundation for a Scalable ETL Environment”

12

Unfortunately, operational costs often increase with data volume and workload. For example, as an application grows and is scaled up on more servers and uses more storage devices, the operations staff often increases to support the extra hardware. The staffing level and hardware quantity can be decoupled if the proper hardware architecture is matched to a well-designed ETL software environment.
Software maintenance workloads must scale

As with data center operational costs, software maintenance cost should be decoupled from application workload and data volume. Often it fails as the application reaches stressful workloads or data volumes. This puts the application back in the software development cycle, adding dramatically to its ownership cost.
Enterprises must take care not to let software enhancements deteriorate the software’s maintainability

Complex, scalable ETL software contains features that address more than just functionality. ETL software specifically addresses performance, scalability and other quality of service features. These myriad features lend significant complexity to the ETL software. This complexity calls for rigor in the original design and discipline during enhancements. Enhancements that diminish the ETL software’s ability to meet or exceed service levels result in a double whammy with respect to cost of ownership degradation. First, the software maintenance cost goes directly up. Second, the operational costs go up from having to implement artificial measures to restore service levels (for example, hiring additional operational staff and deploying additional hardware).
Operability must be designed into scalable ETL software

Data center operational costs are substantially affected by the operability of the software being managed. For example, ETL software that leaves behind few artifacts that might clutter storage or confuse operators is easier to manage than software that does otherwise. Operable software provides simple interfaces, has no side effects, integrates well with schedulers and operations consoles, and behaves predictably under stress. Cost-of-ownership scalability depends on a variety of factors. Maintaining predictable cost over a long period requires that enterprises address all of the aforementioned issues.
Knightsbridge Solutions LLC January 2002 “Managing Big Data: Building the Foundation for a Scalable ETL Environment”

13

CONCLUSION

Developing a scalable ETL environment is well worth the extra time and effort it requires. Enterprises can’t afford to ignore the massive data growth industry analysts are predicting for the near future. Those who take steps now to make their infrastructures more scalable will gain a competitive edge by having access to data that is detailed, timely, and highly available. This paper was authored by: Knightsbridge Solutions LLC 500 West Madison, Suite 3100 Chicago, IL 60661 Phone: (800) 669-1555 Fax: (312) 577-0228

For further information or additional copies, contact Knightsbridge at (312) 577-5258.

Knightsbridge Solutions LLC January 2002

“Managing Big Data: Building the Foundation for a Scalable ETL Environment”

14

Sign up to vote on this title
UsefulNot useful