You are on page 1of 14

Business white paper

Transitioning to a new era of Human Information

but is it clever? Architecture of unstructured information Using a single platform to understand 100% of Information Human information in the cloud Conclusion About HP Autonomy About HP 2 .Table of contents 3 4 7 10 11 12 12 14 14 14 Introduction Why is Human Information different? Structured data methods are not built to solve the Unstructured challenge Big Data—it’s big.

videos. businesses started using the location of data to give the information more meaning. All of the content generated by people—called “human information”—is made up of emails. It drives almost every process. computing and a fundamental shift in the way you and your business interact with information today. When processing power caught up. we had to make information machine-friendly. Name: Address: Age: Code: Reference: Smith 46 Aracia Avenue 34 00403034 FFRRFF55434 3 . For instance. and intuition into the rapidly growing and diverse data that we engage with every day. and social media. IP. when the total of that column went to 0. This unstructured content makes up 90 percent of all the information we use every day and it is growing at a staggering 62 percent CAGR—three times faster than structured data. Today’s expanding digital universe Today. they look to Twitter. the focus was on changing the “T” in IT— with the introduction of mainframes. mobile computing. The old method of finding answers in the rows and columns of a database takes too long. It did not. from sales to marketing to HR to analytics. And your ability to leverage it represents the future of your business. and often yields an incomplete result. human information. Data from popular websites shows that the rate of data creation and consumption is increasing exponentially. it does represent the future of information In the earlier days. and vital things happen. the amount of data on the site. they look for incriminating emails. if a column in the inventory database represented the number of red bicycles in the warehouse. ideas. client servers. Understanding human information is a fundamentally new approach that uses technology to deliver insight. web conversations. however. compliant. the world exceeded a zettabyte of information (1 trillion gigabytes) • The digital universe doubles in size every two years Beyond its sheer size. and more.Introduction As a business leader. differentiating. has grown extremely rapidly. social content. To handle this. you depend on vast amounts of human information to keep your business competitive. cloud computing. When marketers want to know more about their customers. or one hour of video every second. For instance. And the growth of video content provides a vivid example. And this is only becoming more complicated with the explosion of social media activity and the growing use of text-based interactions. As we enter the new era of human information. which contributed to a structured data world. the computer would automatically order more. change how computers can interact with rich. leaves out gaps of knowledge. and in the digital universe in general. and working to its full potential. • Unstructured data represents 90 percent of all information and is growing at 62 percent CAGR • Structured information is growing at 22 percent CAGR Yesterday’s information processing Databases in the 1960s were maintained on computers that were not powerful enough to understand real-world. we operate in an expanding digital universe. call center conversations. unstructured information is where all the interesting. Yet. And this changed how technology worked behind your information. Since YouTube was founded in early 2005. This leap allowed us to derive more benefit from what computers had to offer—more than simply locating or retrieving information. and more. Consider the following statistics: • Users upload 60 hours of video every minute. rich information the way humans could. • Over 4 billion videos are viewed a day • In 2011. YouTube had more than 1 trillion views or almost 140 views for every person on Earth • In 2010. we realize that data has characterized the information age from the start. when detectives try to uncover a crime.

and social media interactions such as those occurring via Twitter and Facebook. There is a definitional problem when dealing with human information. the news is that Clinton has met with the Chinese Premier. Unstructured Rich Media. finding human information raises a number of new questions regarding how we can organize. You must take into account why he would or would not be considered a dog. videos. Unstructured rich media includes photos. To recognize the importance of understanding the concepts contained in information. the context must be understood to grasp the meaning of the information. IMs. such as your social or cultural viewpoint. the main point changes based on who reads it. Within the same set of phrases or words. The question “Is Snoopy a dog?” does not have a simple answer. 2. federation. We can see this principle best in poetry. But these methods still have limitations. they have a distance. depending on where it appears. video. and images. It represents all types of information and does not fit neatly into a structured database. the wicked wolf got boiled—it was really wicked. Typically.” then Snoopy is not a dog. documents. Using simple search methods. information never matches exactly the way structured information would. Since computers have historically used databases to increase search efficiency. When a user poses a query. The answer to a question is dependent on other pieces of information. Unstructured Text Data. and other basic functions are now used to improve the search process. • Meaning is relative. No two ideas are exactly the same. it is natural to first wonder how we can effectively search and find the information we need. or whether a blog gives a positive view of a product. Consider the description “low-drag wing design expert” versus “high-efficiency aerofoil designer. but they have degrees of similarity based on how conceptually close they are to each other. a very different idea. social media. • Information is diverse. • Ideas do not match. process. where complex metaphors can run through a set of text. Charter & Tour Magazine. but you will still have to sift through the results to find what you want. building on each other and adding depth. • Meaning is multi-layered. popularity ranking.” you get results that contain the word. More recent methods of using rules. What something means is closely related to your own perspective. such as safari animals. Drives Up in Black Lincoln” appears.” The word “wicked” meant both bad and good in the same post. a single word can have multiple meanings based on its intent. The ever-changing nature of a word’s meaning makes it especially difficult to understand and process human information without the ability to understand context. 4 . Because these methods do not understand the meaning of the word “DOG. he is a cartoon character. SMS messages. It includes text in the form of emails. as there are many ways to define Snoopy. When analyzing human information. Human Information comes in two categories: 1. For instance. When the story “Clinton Arrives by Car to Meet the Chinese Premier. • Meaning is dynamic. would be conceptually “distant. In the age of social media. who is in a photo. and search it. people search or analyze data using an attribute such as the date a video was taken. Two opposing cultural groups will view a set of results very differently. computers can send back every instance of a particular word or combination of words. the real news is that Clinton arrived in a black Lincoln. audio in the form of speech and sounds. Human information is not limited to one file type or source. For most people. but the ideas are conceptually “close. This demonstrates the relative nature of information. • Information does not match exactly. Unstructured text data includes content posted to blogs. documents. you first need to understand the unique challenges posed by human information. Take the tweet “Saw Red Riding Hood. and meaning changes over time and is subject to historical perspective. there can be multiple levels or layers of meaning. such as keywords.” These words do not match. XML. and other forms of information that by default do not have any text information on their subject beyond simple metadata.” • Context is important. For the subscribers to Limousine.” In turn. sound files.Why is Human Information different? When we consider human information and its dominance in today’s enterprise. Even within the same phrase. new slang terms are continually emerging. Distances between ideas change with the context around them. news feeds. if the answer is “No.

For the first time in the history of the technology industry. it is the “I” in IT that is changing. 5 .

or click streams. or may not be able to find it at all. how can software tell if a picture is of a yellow rose. the publishing house. clustering. structured. Autonomy creates a framework for extracting the concepts from content to determine the meaning of information. But to understand what these words are describing. when you search for the word “penguin”. where you can grasp the context of the discussion even when some of the words cannot be heard—or grasping the essence of a news article simply by skimming over the text. organization. Meaning Based Computing The ability to derive meaning. If no such metadata exists. and the hockey team. and a pioneer in the area of MBC. a Labrador. “A black and white flash jumped into the sea and appeared with a fish in its beak” Is it a penguin? This approach is similar to understanding a conversation in a noisy room. On the other hand. This requires some form of metadata (data about data) to be tagged to the item or generated on the fly as the item is saved. voice. summarization. enables massive amounts of constantly created and updated unstructured. Yet the lack of structure in human information still makes the search process challenging for the simple reason that people search or analyze data using an attribute of the data. and rich media to be analyzed in real time. video.The ability to understand concepts The leap forward in the ability to understand human information comes with conceptual search. Autonomy’s core technology. taxonomy generation. Although each word is much weaker. along with the bird. This issue is not an easy one to solve without human involvement. text files. or whether a website gives a positive review of a product. or a girl named Rose in a yellow dress? Compounding these challenges. its Intelligent Data Operating Layer (IDOL 10). eduction. For example. For example. combined with the latest advances in hardware and software. provides technology that allows you to derive insight. Autonomy. and retrieval. agents. sentiment. and concepts from structured and unstructured human information to drive better enterprise decisions. you will have difficulty finding it. the process becomes more human and the computer can do more of the work for us. But your search may also return information on the Batman villain. and availability. understands any type of unstructured information. human information is often more difficult to manage than structured or semi-structured information in terms of size. the probability that this document is about the flightless bird is about 98 percent. We use a theory that says the less frequently a unit of communication occurs. Autonomy’s core technology. audio and video—as well as structured application data—to give you the power to perform automatic operations such as hyperlinking.” together they offer much clearer information. In this case.’ and automate business processes is now possible using the technology developed by Autonomy and the power of Meaning-Based Computing (MBC). allows text to be searched and processed from databases. When a computer can understand that the letters “D-O-G” mean a dog. Revealing strong concepts in weak information Autonomy takes a unique approach to leveraging the power of weak information. spot patterns. you have to understand their context. By using a larger amount of conceptually-related weak information to drive a search. audio. an HP Company. a yellow Labrador named Rose. man’s best friend. the more information it conveys. Autonomy IDOL. 6 . such as the date a recording was captured. or an animal that likes to go for walks. profiling. the people pictured in a photo. alerting. you can yield more relevant results than a smaller amount of seemingly strong keywords. there is about an 85 percent chance of bringing back a document about the flightless bird. ‘connect the dots. For instance. including text. and does not even include the word “penguin. a group of weak terms like “a black and white flash jumped into the sea and appeared with a fish in its beak” paints a much more accurate picture.

Like many manual approaches. which makes metadata an ineffective method for dealing with unstructured information. Do these ten snippets of information provide an adequate description of the entire 30 minutes of rich content? Unfortunately. the number of links pointing at an internet page is a good measure of how many people find it useful. For example. Matching keywords Using a structured database and keyword search approach to find information would require you to put all of the words in a document into a database. for a half-hour television show. it does not. you might attach as few as ten pieces of metadata. metadata falls short for a number of reasons. One of the key challenges of understanding human information is not only grasping its meaning. If you could create a wealth of reliable. this approach might be viable. title. Businesses have tried to apply these traditional approaches to a new generation of information. look for the blue Ford. Before computers were able to understand the content itself. including a host of irrelevant data such as “When you get to the airport. Who is going to create it? As an example. though this does not help when you are searching for content related to your particular topic because it is still a metadata-based approach. particularly audio and video. Taking weeks to perform analytics can leave you managing via the “rear view mirror” instead of helping you to be agile enough to shift your course based on the latest incoming data. language. For example. Metadata: if only it were consistent Metadata is essentially data about data that typically includes a list of keywords. but doing so within a time frame that leaves the results actionable for decision making. metadata offered a way to put data into defined categories 7 . it illustrates why matching keywords is not the same as understanding meaning and context. perfectly applied metadata. author. and confirms that it is sought-after content.” Not only does this tactic result in valuable time lost for the compliance officer. consider a compliance officer at a large investment bank at Ford Motor Company that wants to receive an alert every time the keyword “ford” appears in a communication or document. This method falls short for a number of reasons. We often want to find content that people have searched for the most. and other attributes of the file. Implied metadata based on the properties of a file can be helpful. and the result has proven to have many limitations. A key issue with a metadata approach is that it is time-consuming.Structured data methods are not built to solve the unstructured challenge The first attempts to process human information involved traditional structured data methods. Using a keyword approach would return every matching instance of the word. in an attempt to apply structure to unstructured content. however. and then search within that database.

Research Analyst .” Keith Dawson.“Autonomy’s unique MeaningBased platform enables organizations to seamlessly incorporate untapped resources. such as phone recordings and emails. into their corporate strategy and benefit from a single point of access to all of their information.

“She’s a star. allowing you to spread the work of the map and reduction operations across multiple machines. and the meaning that was available when the words were organized in sentences. once again. if you read the sentence. even within a corporation. How then can an audio file be analyzed? How can Tweets be understood without the ability to understand 140 characters of abbreviations and slang written by teenagers. and strong open source community support committed to its ongoing development. the feature that matters most to them.Is applying metadata a priority? While there are entire industries and careers dedicated to creating and managing information. and place more value on. and punctuation errors. or finish it quickly and spend time creating a new marketing program. which means that when you look to describe an object. which would you choose? How should I tag this file? Another shortfall of metadata is the high instance of spelling. However. perform large-scale text analytics on help desk calls.” it does not mean that ‘she’ is a cosmic gas ball. a document on tolerance of religious symbols in France would be tagged differently by a religious studies expert than by a cultural historian. breaking them down into words. highly subjective and inherently random. you can match words and count occurrences. The “Reduce” step collects the answers to all the problems and combines them to provide the answer to the original problem. Correctly labelling and filing is a time-consuming. MapReduce segments the problem into parallel components. This approach is. If the question is whether to take the time to create accurate metadata. In an enterprise. business intelligence (BI) environment. In reality. if you can only analyze the hash tags? And what can you derive from YouTube videos. MapReduce and Big Data analytics To perform Big Data analytics. Hadoop is extremely flexible and friendly to developers but is not optimal for business analysts. MapReduce provides a way to distribute a heavy processing load across a number of computers. MapReduce. each person will start with. everyone forms their own understanding of what is most important. which means metadata does not conform to your company taxonomy. and placing the words into a database. MapReduce is a software framework that allows relatively finegrained control of how calculations can be performed on large data sets via a large number of computers. Which word should I use? When presented with the same piece of information. For example. However. it is unlikely that two people would create a description using the exact words in the exact order. which in turn processes smaller problems and returns them to the master computer. and improve product ranking by analyzing web logs. flexible programming language support. Hadoop offers limited integration with existing BI tools. Hadoop cannot be deployed on its own as a realtime analytics solution. manual process that is often not a priority for today’s busy professionals. if you can only understand its metadata? When it comes to Big Data Analytics. and distributes it to numerous “worker” computers. this approach lacks the sophistication to perform advanced keyword search where more weight would be given to a word that appears in the title of an article. The result: files are improperly categorized. Does everyone see it this way? The creation of reliable metadata relies on the assumption that there is one correct way to categorize information and that everyone will agree on that approach. Hadoop is one option that has received an incredible amount of attention regarding its scalable architecture based on commodity hardware. For example. because it would create an entirely homogenous environment where it would be nearly impossible to distinguish between and find files. The reality is that the majority of users do not check for basic errors. does offer one way to process human information. you could layer analytics on top of MapReduce to detect fraud patterns. However. divides it into smaller sub-problems. processing words as standalone items removes their context and could imply this. It would be impossible to insist that everyone use the same vocabulary to describe a file. The process removes the context of the words. The “Map” step takes the input. perform sentiment analysis in social media. To compare this with basic keyword search. audio and video files are not handled by MapReduce and Hadoop. grammar. However. and then processes the data in parallel. 9 . however. due to its batchoriented nature. which makes it less user friendly in an everyday business environment. Because of its highly technical and low-level programming interface. Processing human information with MapReduce and Hadoop MapReduce and Hadoop provide a way to process unstructured data by taking documents. the reality is that most people do not dedicate the time required to apply proper metadata.

The ability to understand meaning changes the game and provides new insights into Big Data. but is it clever? What is Big Data? Defining Big Data is a matter of perspective. which is often large in its own right. If your organization is already set up to manage data in a scalable way. Big Data becomes truly powerful when all data can be analyzed. The former places high requirements on storage in the system as duplicate data is needed. leading many sites that serve this type of data to turn to alternative technologies that typically leverage main or solid state memory. unstructured data ranges from the tens of kilobytes for web pages and email to megabytes for song files to gigabytes for high definition video. the data volume problem is an easier issue to deal with: to solve it. while the latter slows down the system as processes that update can get delayed while they wait on each other for a particular piece of data. such as retrieving records from more than one table. Hadoop. even a few hundred gigabytes may require an entirely new data management strategy. Though a moving target. 10 . Unlike records for structured data. For example. current limits are in the order of terabytes. such as videos or images. Big Data only delivers on its promise if rich meaning can be extracted. • Each Data Item Can Be Big. throughput in traditional diskbased systems may not be high enough to serve image and video. a single picture is much larger than the numbers and words stored in a typical database transaction. but also the impact of Big Data on the technology landscape of our daily lives. including the 90 percent of unstructured data. Big Data comes up in a number of disciplines. In many ways. The most objective definition describes big data as: Datasets that grow so large that they become awkward to work with using on-hand database management tools. one can throw a lot of hardware and software (and therefore a lot of money) at it. hundreds. realizing that the extent of the solution varies with the amount of the investment. analytics and visualizing. Methods such as SQL. search. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to spot business trends. it requires massive parallel software operations running on tens. which is a separate challenge from aggregating and analyzing Big Data-sized amounts of information. Structure—or lack thereof Big Data is often discussed in the same breath as unstructured data. the database would have to potentially sift through trillions of possible combinations of records. exabytes and zettabytes of data. hundreds of terabytes is not an issue. if one were to retrieve millions of rows of data from table A and millions of rows of data from table B and analyze the combined data set. There is a significant need to retrieve and serve fewer bits of non-text unstructured data. There are three main issues with data volume: • There Are a Lot of Individual Data Items. and InMemory do not provide new answers. Ignoring unstructured information means only 10 percent of the problem is being addressed. sharing. and datasets are constantly growing in size because they are gathered more frequently.Big Data—it’s big. To truly define Big Data. Data volume It is useful to think about the issue of structure separately from the issue of volume. particularly if transaction integrity is required. The high rate at which data is created can also put high demands on system resources. • Data Items are Being Created Rapidly. For example. which is most of what is considered Big Data. Difficulties include capture. combat crime. While analysis of this much data may seem daunting. one must not only consider technical details. A traditional data store would typically have to retain an additional copy of data or only allow one process at a time to update a specific set of data in order to ensure the validity of the data update even if the software system fails. The large volumes of structured and unstructured data that you create would require massive investments in hardware if implemented with traditional database systems. These requirement increases are in orders of magnitude relative to most structured data records and therefore require much more in terms of resources. prevent disease. with some data sets containing over a trillion items. it is virtually impossible to apply typical relational database techniques. For example. storage. In addition. pinpointing the size of Big Data is wholly dependent on the capabilities of your organization. which are typically sized in the kilobytes. When you apply traditional methods to processing these large datasets. Many data sets today contain billions of data items. Applying the ability to understand all Big Data makes it possible to abstract the meaning of it. or thousands of servers. For others.

when one chip is changed. this approach falls short. This approach results in a segmented world with little to no overlaps across information or applications. For instance. documents. SMS messages or more. each structured application has a set of accompanying information. and this tactic very rapidly became a whole lot more complex than the structured world. such as column 3.Architecture of unstructured information Understanding the architectures used to manage unstructured information requires a look back at the origins of IT architecture. For example. emails. In the unstructured world. a piece of information does not exist without any relationship to other pieces of information. row 4. it is necessary to look at lots of others. as every application has a separate connection to every data type. Business Intelligence is cut across a few repositories. This information must be paired with the application because its meaning is conferred by its location within the application. this DATA 1 connects to APP 1. and so on. this DATA 2 connects to APP 2. They may have varying distances or degrees of similarity. information is not interchangeable. all of the software must be re-written. 10% Structured Structured information 90% Unstructured With structured information. emails. but not in a meaningful way. When looking at one piece of data. or website comments. but are still linked. it only goes with the context it is found in. customer information may come in the form of call center calls. This very quickly becomes a rat’s nest. but to deal with the influx of unstructured information and derive meaning across separate silos. most of the IT industry still takes a stovepipe approach. As soon as any data type or source is changed. tweets. documents went with Documentum. The unstructured data enterprise Methods of dealing with unstructured information began in the same way as structured. Emails would go with the email server. all the connections must also be changed. In this approach. Every data set has its own connection to the application. Data relevant to discovery may include voicemails. When faced with how to access and leverage structured information. In a world of very structured information this provides some value. Each application has a custom connector to access the information. Single Access Layer 10% Structured 90% Unstructured 11 . This is the same problem with operating systems.

Desktops. eliminating copying requirements. • The new platform combines Autonomy’s infrastructure software for automatically processing and understanding unstructured data with HP’s high-performance real-time analytics capabilities for extreme structured data. This new platform promises dramatic business impact. The IDOL 10 nextgeneration information platform is designed to understand and act on 100 percent of enterprise information in real time. and hand-off risks. IDOL 10 offers you a range of capabilities: • IDOL 10 provides a single processing layer for forming a conceptual. and join filtering. data re-segmentation. • You benefit from enhanced scalability and contraction of clusters greater than 20x faster in cloud. and physical deployments. images. there is more to the story. When you can take a meaning based approach to cloud computing. database statistics. The ability to automatically understand the meaning within all forms of data is the ingredient that provides a significant advantage to your organization. having the ability to manage it all in a secure private cloud gives you an unparalleled opportunity to leverage this information. messaging. emails. • Performance enhancements for the HP Analytics Platform include: sub-queries. life cycle management. storage costs. query optimization. as organizations can develop new applications that leverage the diversity and richness of human information combined with extreme structured data. and all sorts of information can all be continually analyzed together.Using a single platform to understand 100% of information Today’s approach: single layer access The solution to accessing and processing all structured and unstructured information is a single layer approach that goes across your entire enterprise—one system that is able to process both structured and unstructured information together. you can add a layer of intelligence that lets you answer all sorts of questions. phone calls. without requiring you to integrate any data. • The NoSQL interface provides a single processing layer for crosschannel analytics of structured and unstructured data. • Manage-in-Place technology indexes all data where it resides. Not Only SQL Extreme Structured Data Unstructured Data Connectors 12 . virtual. both inside and outside the enterprise. and real-time understanding of all forms of data. such as: • How many people are working on the same problem? • Who works with whom? • How many times has this been done before? • What segment of our customer base does this? Autonomy’s private cloud enables you to automatically recognize concepts and patterns in the billions of files that your cloud deployment ingests and indexes every day. corporate videos. Although most technology providers consider cloud computing nothing more than making data and applications accessible via the Internet. Human information in the cloud Realizing that unstructured information is the lifeblood of most companies. contextual.

Human information: The next evolution of IT. 13 .

regulations. Instead. text and web pages. and is a once-in-a-generation opportunity. and HP solutions for better business outcomes autonomy. Now.” While analytics on historical data can produce pretty charts. HP brings together a portfolio that spans printing. it is possible to answer the questions you didn’t even know to ask. they do not give us the timely insights we need to run our businesses competitively. we need systems that can help us uncover the “unknown unknowns. content management and compliance. Policies. and tweet.. and to their purchase history—in real time. and put computers to work for us. Today. and to their entry in the database. it is the “I” in “IT” that is changing. not the “T. business process management and OEM operations. businesses. such as web content management.Conclusion This shift towards human information represents the biggest change in the IT industry. not on programming languages like SQL. you can link a customer call to their website activity. The day has arrived where humans no longer have to fit the machine. More information about HP (NYSE: HPQ) is available at hp. email. People do not live in rows and columns. services and IT infrastructure to solve customer problems. 20130124_PI_WP_HP_Human_inFormation . documents. but stop non-compliant posts or even transactions before they occur.com/go/getconnected Get the insider view on tech trends. including social media.com Copyright © 2013 Autonomy Inc.com. online marketing optimization and rich media management. governments and society. is a powerful tool for companies seeking to get the most out of their data. Autonomy also offers information governance solutions in areas such as eDiscovery. SMS messages. as well as marketing solutions that help companies grow revenue. social media. We have reversed this position. Autonomy’s powerful management and analytic tools for structured information together with its ability to extract meaning in real time from all forms of information. regardless of format. You can implement policies across emails. but instead they call. About HP Autonomy HP Autonomy is a global leader in software that processes human information. Other trademarks are registered trademarks and the properties of their respective owners. audio.” While the majority of the IT industry has been built on only 10 percent of the digital universe. alerts. All rights reserved. voicemails. video.com to find out more. and transaction histories. Customers do not send you databases. and governance practices are built on human meaning and intent. email. software. Please visit autonomy. About HP HP creates new possibilities for technology to have a meaningful impact on people. Autonomy’s product portfolio helps power companies through enterprise search analytics. to not only flag non-compliant materials. For the first time. today’s world requires solutions that can handle the 10 percent and the 90 percent in equally efficient ways. or unstructured data. We can no longer rely on systems that can answer only the questions you already know to ask. etc. an HP Company. personal computing. The world’s largest technology company. Get connected hp.