Running head: INFORMATION OVERLOAD A.K.

A INFOGLUT 1

Information Overload a.k.a Infoglut Julie L. Gill Luisa M. Morales Igor D. Pokryshevskiy

INFORMATION OVERLOAD A.K.A INFOGLUT

2

Abstract The problem of infoglut is growing and computer scientists and engineers need to find a solution. Deciding which information is good and which is bad, or unnecessary, is a tough and complex problem. Our intelligent embedded system is a web portal, which allows a user to access all of his or her various social media, email, news, and blogs in one place from any computer. Using a system based on the concept of genetic algorithms, the portal learns and continuously refines favorite keywords and relationships between the user’s preferences and actions and displays only the most relevant information.

The Internet has granted us entry into a repository and exchange of information. The information present exceeds measurable amounts. The problem with this is that we

INFORMATION OVERLOAD A.K.A INFOGLUT 3 often cannot separate the information, which may increase our knowledge and improve our lives from that which has no value and leads us astray from productive goals. This dilemma is defined as Infoglut, the mixing of quality information with irrelevant and time-wasting data. “It's not so much about technology as it is about deciding which information is of value and to whom, and then configuring the technology accordingly. It's bringing a human dimension back to an inhumane consequence of information technology” (Denning 2006). Infoglut causes an increase in the time one must spend sorting through extraneous information to find what one needs. Employees may spend enormous amounts of time searching through, deleting, or organizing emails, a mundane task made complicated by a very high amount of data. According to an article in IEEE Spectrum, “Information workers typically receive 50 to 200 work-related e-mails daily” (Zeldez, 2009). While much of the information in e-mails may be non-critical, typically all emails must be processed and answered, so this introduces a heavy time-load for going through information that may not be critical or important to the employee’s job. Often, humans have neither the time nor the capability to read and process all of the information that is available, so it is necessary to have a system that will reduce the amount of information that needs to be processed. In order not to lose any data and to understand the important data, it is necessary to employ the help of an intelligent embedded software system by which data and information can be filtered and irrelevant data can be held aside. It is necessary for this system to have an intelligent algorithm that will ensure that any and all relevant data makes it to the user, and irrelevant information remains unseen. The definition of relevant depends on the user, and the solution is to gather data and decide what the particular user believes to be relevant (each algorithm must be implemented and held accountable for one user at a time, as all data the user seeks is

INFORMATION OVERLOAD A.K.A INFOGLUT 4 relative). The system will have the capabilities to learn about the user and make future decisions based on the user’s preferences. Over time, it will be able to hone its definition of relevant and “solicited” information, resulting in a highly personal and focused set of information for any one user. An attempt to aid in the relevance of data obtained based on what a user seeks to find was implemented by Google, and other search engines, when they implemented Search Engine Optimization. Although the implementation of such optimization has proven to be helpful, it is often more important how "good" a user is at searching. Often the user will need to be an intelligent and knowledgeable searcher if they seek information from respected sources and not from those who paid for a top spot or have, over the course of time, been able to generate more clicks, leading to a higher spot in the search results. In comparison, RSS feeds allow for users, who are not sophisticated in terms of web-searching, to swiftly access data from reliable sources. RSS feeds are a step-up from Search Engine Optimization because it sorts the data you seek, the data is from sources you have inquired about, and it eliminates the hassle of pop-ups and frustrating advertisements run through Search Engines. However, it is a difficult and overwhelming process to keep up with the numerous online postings when there may be 30 or more postings in a given day from each feed you chose to subscribe to. We believe Neural Networks and Genetic Algorithms will aid in the processing and sorting of data. “Neural networks come with panoply of various learning algorithms, both supervised and unsupervised, that help implement a nonlinear mapping between input and output variables” (Pedrycz 2002). A system that is designed to sift through the massive amounts of data that is put on the Internet and into the world every minute must be automated, to an extent, to learn what the user does and does not want to see. The

INFORMATION OVERLOAD A.K.A INFOGLUT system will be able to make decisions based on previous preferences of viewing

5

information in order to display or hide the latest information. Based on the volume of data that must be processed, in addition to computational intelligence, artificial intelligence will aid in the labor-intensive process of making decisions about data. The system must learn as it runs in order to develop a figurative sense of what decisions it should make without the engineer (or user) needing to specify and spell out each case, leading to time efficiency and improved functionality and reliability. Genetic Algorithms randomly generate "an initial set of solutions called a population. A set of genetic operators such as selection, crossover, and mutation, are...applied to the solutions [in order to] generate a new set of solutions for the next iteration called a generation" (Fang et al., 2008). Through the use of tree structures and population-based incremental learning, the genetic algorithm is enabled to produce a more accurate generation (or population) while continuously learning from previous generations and the user as time (and user interests) move forward. Through the use of building blocks and arrays of strings the genetic algorithm is able to grab the "fittest" individuals within a generation of data and reproduce the information in a more compact and direct manner. Population-based incremental learning leads to greater efficiency and a more accurate representation of the data sought by the user. Genetic algorithms have a tendency to lack a condensed way of sorting data, but through population-based incremental learning, the system will continuously learn from its encounters with the user and, through arrays of strings, the continuous data flow will be represented in trees with the most important information, relatively speaking in terms of each individual user, pushed to the top, while the rest of information trickles down the hierarchy.

Our System Design (iWingNow Portal):

INFORMATION OVERLOAD A.K.A INFOGLUT This section denotes the outcomes acquired through the creation of an intelligent

6

embedded system, which classifies, analyzes, and prioritizes content under a scalable framework while addressing the challenges of extreme computing to efficiently target the hampering effects of infoglut. Our proposed system functions as a portal connecting all of the service providers of any given user including email, news, and social media, each of which is contained within a module on the main screen. The portal functions as an embedded system because all information generated is stored on a server in a cloud and accessed through any browser. Along with being a platform independent embedded system, it is scalable allowing for its functionality to be accessed from any computer or mobile device via developed mobile applications. For the purposes of the remaining sections, “keyword” and “filter”, including all their respective derivations, will be redefined as such: “Keyword” will represent the result acquired through the semantic analysis of any given piece of content. The result acquired is, in its totality, a simplified meaning of the analyzed content and NOT a stand-alone word. “Filter” is the process by which new content is compared to previously analyzed content in order to maintain a continuous and real-time hierarchy of displayed information, which is then pushed to each individual end user. Using the framework of genetic algorithms and population-based incremental learning, the system develops an understanding of each user and which data will appeal to him or her. Each module will have its own hierarchy of most popular keywords, which can be moved up or down in the tree-structure depending on factors relevant to the particular module. The modules will analyze keywords against each other in order to classify and prioritize the relevant and desired information and, as the user interacts with

INFORMATION OVERLOAD A.K.A INFOGLUT 7 the filtered data, the system will develop an understanding of user preferences based on behavior resulting in the generation/updating of a module’s hierarchy. In the field of extreme computing, there is an exhibited issue with the speed, space, and immense amount of consumed energy programs take up in order to run; this system will limit the space used, optimize speed, and give way for a more efficient use of energy. The biggest load the system will experience is the analysis of new content against previously classified content. In order to decrease the energy exerted by the system and increase performance, the limited amount of keywords for each module will allow for efficient classification processes. The Algorithms Behind Each Module: This section contains a detailed description of the learning module (algorithm) constructed through user behavior, relative to content. Although not specifically cited, our system uses Google searches entered by the user to complement hierarchies generated in an effort to maintain real-time, user-specific content. Content is only analyzed and classified when the threshold of frequency and duration spent on the search allots of the generation of a new keyword. Recently Viewed Pages: Unlike a generic history display, this “Recently Viewed Pages” section keeps track of the pages from which a user finds information and uses on a regular basis. To save and display the recently viewed sites, the system marks the time that a user spends on a given webpage as well as the number of related links clicked and the comparison of the site to the keywords of its linked modules. With this data, the system decides if a given page is more important and relevant to the user to display. For example, if a user spends fifteen minutes on PageX.com and twenty minutes on PageY.com, PageY.com will move up on the display module. However, if, during the fifteen minutes on

INFORMATION OVERLOAD A.K.A INFOGLUT 8 PageX.com, the user clicks on several related links and spends time on those pages, PageX.com will move up in the hierarchy. If a user spends ten minutes on PageZ.com several times in a day, PageZ.com will move up higher than PageX.com and PageZ.com which were visited only once for a longer period of time. In addition, the titles and keywords on a given page are compared to the other keywords of its linked modules and, if there are more matches, the page will be given a higher priority. Social Media: Facebook, Twitter, and other social media services are also available in the portal. These social media sites can be linked to either personal or work emails as well as to blogs, news sites, and Google searches. Based on the keyword hierarchies of the linked modules, postings that have a certain threshold of keyword matches will be displayed to the user. In addition, if the user interacts with a Facebook friend twenty times in a workday, the posts of that friend will get priority in the display module over those that only meet the threshold of keyword matches. For example, if a user often searches for “software engineering” in Google and on Twitter, and person A posts a blurb about his latest software engineering endeavors, it will move up in the hierarchy because of the keywords. However, if the user re-tweets the posts of person B five times in one day, person B’s post will move ahead of person A’s in the hierarchy and be displayed to the user first. News Feeds: The portal will contain a subsection designated for News, which will include three subsections: BLOGS, SOCIAL, and NEWS. The BLOGS portion will consist of blogs the user currently has, blogs the user follows, and, as time progresses, blogs the user may be interested in following. Personal blogs will fall under a separate hierarchy; the hierarchy will update based on blogs most

INFORMATION OVERLOAD A.K.A INFOGLUT 9 frequented. Blogs the user follows will fall under a hierarchy of their own; the hierarchy will be based on the frequency of the user’s visits to each blog, the time spent on each blog, and the frequency of updates made to each blog. For example: the user follows yourblog.com, hisblog.com, and herblog.com. On a given day the user spends five, three, and ten minutes on each blog, respectively; however, the user frequents each blog four, two, and one time per day, respectively. Based on this, yourblog.com would be on the top of the hierarchy, followed by herblog.com and hisblog.com; each blog would then alternate in the hierarchy based which blog is most up-to-date. Blogs that might be of interest to the user will be placed in the hierarchy based on key words within the blogs currently being viewed and from Google searches. The SOCIAL portion will consist of sites the user frequents and of which he or she is an active member. The hierarchy will run the same way it does for blogs followed by the user. Furthermore, the NEWS portion will consist of a section for sites frequented by the user (or of which the user is a member), which will function under the algorithm in place for blogs followed by the user, and sites the user may want to visit, which will function the same as it does for blogs. NEWS is strictly limited to news sites such as CNN.com, Economist.com, wsj.com, BBC.com and others of the like. Email:
recommended pile glutpile keyword Emails changes Browsing Email Keyword User actions Pass Fail Submit (IMAP) heap

Heaps Manipulatin (SMTP and IMAP) Test g Sending

INFORMATION OVERLOAD A.K.A INFOGLUT 10 Our website portal also adds email functionality. This component is implemented by means of a delivery and filtering system. There will be an interface with two profiles, personal and work. Each profile will associate with the user’s multiple email accounts (the user may have more than one personal or work email accounts). An email account can be linked with another module (social networking, news, internet activity/searches) and the module’s respective keyword heap. The heaps linked to a specific email account are used as a basis for filtering the emails. The emails may be pushed to the portal using the IMAP protocol, so that emails stay on the user’s mail server and can be manipulated without saving a copy outside of the email server. As the emails arrive, they are filtered against the keyword heaps. Each email must pass one of two tests:
• •

the subject line matches with at least one keyword the body of the email matches a certain number of keywords (determined by the changeable threshold variable) If it passes either test, it will be delivered into the preferred pile. If it fails both

tests, it will be delivered to a glut pile. These piles will manifest themselves as expandable folders or headings, with the preferred section being darker and more visible to the user. It is important to note that in order to protect data from imperfect filtering, the filter must never delete unimportant emails on-delivery, but rather allow the user to move the emails to the preferred pile if he or she feels that the filter failed to detect an important email. The user may browse and manipulate emails from the component on the website. All changes are submitted to the mail server regularly, just like with any mail client. End User Experience:

INFORMATION OVERLOAD A.K.A INFOGLUT 11 The following section describes what a user will experience when first interacting with our system, followed by a detailed example based on the email module. To start, the user connects all accounts to the system and can then choose to modify a set of basic settings. The standard settings include links generated between his or her work email accounts to the news module, personal email to the social media module, and Google searches to news articles. Base keywords can be added by the user so that the system takes less time to learn; however, the system can begin learning by picking out commonly used words from the modules, excluding simple connective words such as “the,” “is,” “of,” etc., and then proceeding to refine the filters as the user interacts with the system. With all of the information the system acquires and continuously learns, the user has the ability to veto, change, and add his or her own keywords and preferences. Settings | Twitter News Articles | Blogs Recently the Internet FacebookViewed Pages Search Work Email | Personal Email Manage links to this module

For example, when a user receives an email, it must pass through a filter of linked keyword hierarchies. If the email is linked to Facebook, then the words in the email will be compared to the structure of preferred keywords in the Facebook module as well as in

INFORMATION OVERLOAD A.K.A INFOGLUT 12 the email module. If it contains a certain number of the keywords, it can be sent through to the user. Upon reaching the user, the system monitors the actions of the user and will confirm or deny the correct placement of the keyword on the hierarchy. If the user immediately deletes the email and the keyword was near the top, the system will bump it down to a lower priority. Likewise, if a keyword was in the middle of the hierarchy and a user spends a lot of time on the particular piece of information, replies to the email, or stars it, the keywords will be given a higher priority.

References

Barstow, D. (1987). Artificial Intelligence and Software Engineering. Proceedings of the 9th international Conference on Software Engineering. 201-211. Accessed from http://delivery.acm.org/10.1145/50000/41786/p200-barstow.pdf?

INFORMATION OVERLOAD A.K.A INFOGLUT 13 key1=41786&key2=4189929621&coll=ACM&dl=ACM&CFID=8307 1988&CFTOKEN=93933442. Deb, K., P., A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multi-objective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6:182–197. Denning, Peter J. (2006). The profession of IT: Infoglut. Communications of the ACM. 15-19. Accessed from http://delivery.acm.org/10.1145/1140000/1139936/p15denning.html? key1=1139936&key2=4889968621&coll=ACM&dl=ACM&CFID=80 237519&CFTOKEN=33449546. Fang, H., Wang, Q., Tu Y.C.,Horstemeyer, M. (2008). An Efficient Nondominated Sorting Method for Evolutionary Algorithms. Massachusetts Institute of Technology. Accessed from http://www.mitpressjournals.org/doi/abs/10.1162/evco.2008.16.3.3 55?url_ver=Z39.882003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub %3dncbi.nlm.nih.gov Goldberg, D. E. (1989). Genetic algorithms in search optimization and machine learning. Boston, MA: Addison Wesley. Lueg, C, & Sam M. (2007). Users Dealing with Spam and Spam Filters: Some Observations and Recommendations. Proceedings of the 7th ACM SIGCHI New Zealand Chapter's international Conference on Computer-Human interaction: Design Centered HCI. 67-72. Accessed from http://delivery.acm.org/10.1145/1280000/1278970/p67-lueg.pdf? key1=1278970&key2=6460039621&coll=ACM&dl=ACM&CFID=83 073213&CFTOKEN=20697899. Pedrycs, W. (2002). Computational intelligence as an emerging paradigm of software engineering. SEKE, 7-14.Accessed from http://delivery.acm.org/10.1145/570000/568763/p7-pedrycz.pdf? key1=568763&key2=1660078621&coll=ACM&dl=ACM&CFID=802 38591&CFTOKEN=93428931. Stergiou, C., & Siganos, D. Neural Networks. Imperial College London. Accessed from http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.h tml. Zeldez, N. (2009). How to Beat Information Overload. IEEE Spectrum. Accessed from http://spectrum.ieee.org/computing/it/how-to-beatinformation-overload/0.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.