Wikimaps: Dynamic Maps of Knowledge

Reto Kleeb, Northeastern University & MIT Center for Collective Intelligence (rkleeb@mit.edu) Peter Gloor, MIT Center for Collective Intelligence (pgloor@mit.edu) Keiichi Nemoto, Fuji Xerox (nemoto@mit.edu)

Wikipedia does not only provide the digital world with a vast amount of high quality information, it also opens new opportunities to investigate the processes that lie behind the creation of the content as well as the relations between knowledge domains. The goal of our project is to create a dynamic Map of Knowledge, visualizing the evolution of links between articles in chosen subject areas. The project consists of three parts: (1) Wikimap - creating a visually appealing temporal map of the changes in contents over the lifetime of Wikipedia, (2) Wikisearch - identifying the most prominent pages and links to be displayed for a chosen topic, (3) Wikipulse - finding the most recent and most relevant changes and updates about a topic. Wikipedia has a great amount of hidden information in the metadata surrounding the content pages: what the edits of the pages over time, who – which authors edited the pages, and how – which links to other Wikipedia pages and outside Web pages are embedded in the content. Including this metadata in the visualization will contribute greatly to the knowledge experience provided by Wikipedia by offering networks of ideas and concept maps and “Current Hot Topics” by intellectual field, societal issue, culture, etc. Even more, studying and comparing the idea networks over time and in different language Wikipedia will contribute towards better understanding of different cultures through what they are passionate about. In addition, by constructing co-authorship networks we will be able to locate domain experts as well as trusted arbitrators, which will give us yet another dimension of analyzing Wikipedia content and weighing the importance of edits and articles. In the first phase of the project, we focus on creating dynamic visualizations of the article-link network. To achieve a reasonable animation of the data we addressed two sub-tasks: fetching data to generate graphs of Wikipedia articles and their relations as close to real time as possible as well as the visualization of these graphs over chosen time-periods. The analysis of Wikipedia articles and their connections is something that has been done extensively in the past, most of these projects have however worked on static data sets, based on database dumps that are provided by the Wikipedia foundation. This approach, which usually focuses on the most recent revision of the articles, is only approximately 29.5 GB (uncompressed, English Wikipedia, 3.6 Million articles) and can be reasonably handled with modest hardware requirements. This dataset however does not include any historical data and does not allow the study of changes in the structure of articles over time. The Wikipedia foundation also provides database files that include the historical data, these files however are currently around 5 TB which makes the handling of it rather unwieldy. Another factor that reduces the usefulness of these dumps for our requirements to display data as close to real time as possible is that these datasets are only provided about once a month. This makes it impossible to closely monitor the development of events that are currently in progress. 1

One of the goals of our project therefore was to find efficient ways to gather the required data that would allow a historical analysis while maintaining a reasonable data size and minimizing the number of required requests to the Wikipedia API. The current version of the visualization only considers the article pages and the respective links between these; future versions of our visualization software will include other factors such as information about the editors or information related to the page-rank of the articles in the network. A second research question is finding the most prominent pages about a chosen subject. While the search function provided by Wikipedia offers a good starting point to find the most relevant pages, it does not return a more fine-grained semantic network of relevant articles. As a first approximation, we collected all pages linked from the returned search results, as well as the links pointing back to the search results from other pages. In the language of networks, this amounts to in- and out-degree centrality, which is a local metric. We are currently extending this with other global filtering functions such as page rank, betweeness centrality, and other metrics to construct a semantic network of most relevant Wikipedia pages about a search topic. While this project is in an early stage, it builds on three years of research in our group, studying Wikipedia coauthorship and edit networks, as well as a vast body of research in Wikipedia authorship and content by a vibrant global research community. We already have been able to show that Wikipedians form long-lasting collaboration network resulting in high quality output. We are convinced that including these and other results will help us in building a new lens into the knowledge of mankind captured in Wikipedia, providing – we hope – yet another stepping stone towards more creativity and innovation.

2