Semantic Distances For Technology Landscape Visualization PDF

You might also like

You are on page 1of 32

MIT Sloan School of Management

MIT Sloan School Working Paper 4956-11

Semantic distances
for technology landscape visualization

Wei Lee Woon, Stuart Madnick

Wei Lee Woon, Stuart Madnick

All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted with-
out explicit permission, provided that full credit including notice is given to the source.

The electronic copy of this paper is available for download without charge from the
Social Science Research Network Electronic Paper Collection at:
http://ssrn.com/abstract=2338582
Semantic distances
for technology landscape visualization

Wei Lee Woon


Stuart Madnick

Working Paper CISL# 2011-10

November 2011

Composite Information Systems Laboratory (CISL)


Sloan School of Management, Room E62-422
Massachusetts Institute of Technology
Cambridge, MA 02142
Manuscript
&OLFNKHUHWRGRZQORDG0DQXVFULSWZRRQBUWH[ &OLFNKHUHWRYLHZOLQNHG5HIHUHQFHV

1
Semantic distances for technology landscape
2
3
4
5
visualization
6
7 Wei Lee Woon, Stuart Madnick
8 Masdar
9 Institute of Science and Technology,
10 Information Technology Program,
11
12 P.O. Box 54224, Abu Dhabi, U.A.E.
13
Sloan
14 School of Management, M.I.T.,
15
16 E62-422, Cambridge MA, 02139, U.S.A.
17 wwoon@masdar.ac.ae, smadnick@mit.edu
18
19
20 Abstract
21
22
This paper presents a novel approach to the visualization of research domains in science and technology. The
23
24 proposed methodology is based on the use of bibliometrics; i.e., analysis is conducted using information regarding
25 trends and patterns of publication rather than the actual content. In particular, we explore the use of term co-
26
occurrence frequencies as an indicator of semantic closeness between pairs of terms. To demonstrate the utility
27
28 of this approach, a number of visualizations are generated for a collection of renewable energy related keywords.
29 As these keywords are regarded as manifestations of the associated research topics, we contend that the proposed
30 visualizations can be interpreted as representations of the underlying technology landscape.
31
32
33 Index Terms
34
35 Data Mining, Technology Forecasting, Clustering, Semantic Distance
36
37
38 I. I NTRODUCTION
39
40 A. Technology mining
41
42 The planning and management of research and development activities is a challenging task that is further
43 compounded by the large amounts of information which researchers and decision-makers are required to sift
44
45 through. One difficult problem is the need to gain a broad understanding of the current state of research,
46
47 future scenarios and the identification of technologies with potential for growth and which hence need to be
48 emphasized. Information regarding past and current research is available from a variety of channels (examples of
49
50 which include publication and patent databases); the task of extracting useable information from these sources,
51 known as tech-mining[Porter, 2005], presents both a difficult challenge and a rich source of possibilities; on
52
53 the one hand, sifting through these databases is time consuming and subjective, while on the other, they provide
54
a rich source of data with which a well-informed and comprehensive research strategy may be formed.
55
56 There is already a significant body of research addressing this problem (for a good review, the reader is
57
58 referred to [Porter, 2005, Porter, 2007, Losiewicz et al., 2000, Martino, 1993]); identification of important
59 researchers or research groups [Kostoff, 2001, Losiewicz et al., 2000], the study of research performance by
60
61
62
63 July 19, 2011 DRAFT
64
65
1

country [de Miranda et al., 2006, Kim and Mee-Jean, 2007, Woon et al., 2011] the study of collaboration
1
2 patterns [Anuradha et al., 2007, Chiu and Ho, 2007, Braun et al., 2000] and the prediction of future trends and
3 developments [Smalheiser, 2001, Daim et al., 2005, Daim et al., 2006, Small, 2006].
4
5
6 B. Visualization of technology
7
8 However, of particular relevance in the current context are studies which are centered on the visualization or
9
10 clustering of research topics. There are many such studies in the literature and it would be impractical to list
11 them all. However, in general, we have identified the following general approaches:
12
13 1) Co-citation patterns [Small, 2006, Saka and Igami, 2007, Igami, 2008, Porter and Rafols, 2009, Upham
14
15 and Small, 2010], in which individual articles are visualized, where similarity between a pair of article
16 is quantified by the number of time they are cited in the same paper.
17
18 2) Citation networks [Kajikawa et al., 2007, Takeda et al., 2009, Takeda and Kajikawa, 2009, Kajikawa and
19 Takeda, 2008], where inter-citation between any pair of documents is used to create a link. This frequently
20
21 results in very densely interconnected networks of articles.
22
3) Term co-occurrence statistics [Ding et al., 2001, Porter, 2005, Janssens et al., 2006, Woon and Madnick,
23
24 2009] - the use of term co-occurrence in abstracts and/or full text to generate visualizations.
25
26 4) Co-authorship [Borner et al., 2005, Glanzel and Schubert, 2005, Morel et al., 2009] - this is yet another
27 type of bibliometrics-based visualization technique. However, in general it has a slightly different purpose,
28
29 which is to study collaboration patterns or to identify highly productive authors and groups.
30
31 While some of these techniques, in particular co-citation analysis, are known to be very effective at capturing
32 the flow of information, they suffer from a number of disadvantages. Firstly, bibliometric records, which may
33
34 include abstracts, author names and bibliographies for all of the articles to be visualized have to be downloaded
35 and processed locally (sometimes even the full-texts of articles are required, for example in the case of [Janssens
36
37 et al., 2006] in which the authors used optical character recognition techniques). This can be a time-consuming
38
and expensive process if large collections of records need to be retrieved. For example, in [Kajikawa and Takeda,
39
40 2008], the bibliographic records of 79,705 articles were collected and analyzed (in the end, the authors only
41
42 used a subset consisting of 62,745 maximally-connected articles). An alternative would be to use smaller
43 document collections (for e.g., in [Ding et al., 2001], where 2,012 articles were analyzed, while 938 full-text
44
45 articles were used in [Janssens et al., 2006]). This will help to relieve some problems, but on the other hand
46 would mean that ultimately, less information is taken into account. In addition, using locally or internally
47
48 compiled records might also result in non-standard results as a result of variations in the way in which the
49
50 records are parsed and analyzed.
51 Secondly, for methods that process individual articles (such co-citation and inter-citation based methods), the
52
53 resulting maps can contain a very large number of nodes, which results in issues of computational complexity
54 as well as difficulty in interpreting the maps. For example, in some cases, a further stage is required where
55
56 general topics or themes still need to be assigned to the clusters after they have been constructed. There are
57
different ways of doing this, for example in [Saka and Igami, 2007], experts are consulted, while in [Takeda
58
59 et al., 2009, Kajikawa and Takeda, 2008], the cluster names are manually assigned based on the article abstracts
60
61
62
63 July 19, 2011 DRAFT
64
65
2

and titles contained in each paper). Each of these issues mean that these techniques can be computationally
1
2 expensive, and scale very poorly to larger document collections.
3 While this brief overview may not be comprehensive, we can already make a number of pertinent observations.
4
5 One is that the visualization and clustering of technological domains is certainly not a new topic; in fact it is one
6 which has been attracting quite a lot of attention for some time now, which is indicative of the importance of
7
8 the problem. At the same time, it would appear that conventional methods each have their respective advantages
9
10 and disadvantages and that, while it is difficult to conceive of a single perfect solution, there is still scope
11 for further development in this area. We believe that the method proposed in this paper will help to address
12
13 some of the issues which still exist in this domain.
14
15
16 C. Contributions and scope
17
18 An important motivation for attempting technology-mining is the possibility of gaining a better understanding
19 of future developments and trends in a given field of research. This is a complex task that is composed of a
20
21 number of closely inter-related components or activities. While there is no single authoritative classification,
22 we present the following scheme, proposed in [Porter et al., 1991], to help focus our discussion:
23
24 Monitoring - Observing and keeping up with developments occurring in the environment, and which are
25
26 relevant to the field of study [Kim and Mee-Jean, 2007, King, 2004].
27 Expert opinion - An important method for forecasting technological development is via intensive consul-
28
29 tation with subject matter experts [Van Der Heijden, 2000].
30 Trend extrapolation - This involves the extrapolation of quantitative historical data into the future, often
31
32 by fitting appropriate mathematical functions [Bengisu and Nekhili, 2006].
33
34 Modeling - It is sometimes possible to build causal models which not only allow future developments to be
35 known, but also allow the interactions between these forecasts and the underlying variables or determinants
36
37 to be better understood [Daim et al., 2005, Daim et al., 2006].
38 Scenarios - Forecasting via scenarios involves the identification of key events or occurrences which may
39
40 determine the future evolution of technology [Mcdowall and Eames, 2006, Van Der Heijden, 2000].
41
42 In this context, the emphasis of the current study is on the first item, viz technology monitoring, as the primary
43 objective is to devise methods for monitoring, understanding and mapping the current state of technology.
44
45 In particular, our aim is to develop novel approaches to visualize and understand the relationships between
46 connected areas of science and technology. Towards this end, this paper will address the following objectives:
47
48 1) To devise a method for quantifying the degree of similarity between research areas.
49
50 2) To use the distance measure to study the structure of the research landscape of the target domain. We
51 are also interested in detecting and exploiting clusters of closely related topics.
52
53 3) To conduct a preliminary case study in renewable energy as a demonstration of the proposed approach.
54
55
56 D. Case study
57
58 To provide a suitable example on which to conduct our experiments and to anchor our discussions, a
59 preliminary case study was conducted in the field of renewable energy.
60
61
62
63 July 19, 2011 DRAFT
64
65
3

The importance of energy to the continued well-being of society cannot be understated, yet 87%1 of the
1
2 worlds energy requirements are fulfilled via the unsustainable burning of fossil fuels. A combination of
3 environmental, supply and security problems compounded the problem further, making renewable energies
4
5 such as wind power and solar energy one of the most important topics of research today.
6 An additional consideration was the incredible diversity of renewable energy research, which promises to be
7
8 a rich and challenging problem domain on which to test our methods. Besides high-profile topics like solar cells
9
10 and nuclear energy, renewable energy related research is also conducted in fields like molecular genetics and
11 nanotechnology. It was this valuable combination of social importance and technical richness that motivated
12
13 the choice of renewable energy as the subject of our case study.
14
15
II. M ETHODS AND DATA
16
17 In the following subsections, the methods used for both data collection and analysis will be discussed in
18
19 some detail. The overall process will be based on the following two stages:
20
21 1) Identification of an appropriate indicator of closeness (or distance) between terms which can be used to
22 characterize the relationships between areas of research,
23
24 2) Use of this indicator to perform feature extraction on the data, which could be in the form of intuitive
25
visualizations or clusters.
26
27
28 A. Keyword distances
29
30 The key requirement for stage one is a method of evaluating the similarity or distance between two areas
31
32 of research, represented by appropriate keyword pairs. Existing studies have used methods such as citation
33
analysis [Saka and Igami, 2007, Small, 2006] and author/affiliation-based collaboration patterns [Zhu and Porter,
34
35 2002, Anuradha et al., 2007] to extract the relationships between researchers and research topics. However, these
36
37 approaches only utilize information from a limited number of publications at a time, and often require that the
38 text of relevant publications be stored locally (see [Zhu and Porter, 2002], for example). As such, extending
39
40 their use to massive collections of hundreds of thousands or millions of documents would be computationally
41 unfeasible.
42
43 Instead, we choose to explore an alternative approach which is to define the relationship between research
44
areas in terms of correlations between the occurrences of related keywords in the academic literature. Simply
45
46 stated, the appearance of a particular keyword pair in a large number of scientific publications implies a close
47
48 relationship between the two keywords. Accordingly, by utilizing the co-occurrence frequencies between a
49 representative collection of keywords, we seek to demonstrate that it is possible to infer the overall research
50
51 landscape for a particular domain of research.
52
The Normalized Google Distance
53
54 In practice, exploiting this intuition is more complicated than might be expected as it is not clear what the
55
56 exact expression for this distance should be. Rather than screen a number of alternatives on an ad-hoc basis,
57 can this distance be derived using a rigorous theoretical framework such as probability or information theory?
58
59
1 year
60 2005. Source: Energy Information Administration, DOE, US Government
61
62
63 July 19, 2011 DRAFT
64
65
4

As it turns out, there is already a method which provides this solid theoretical foundation, and which exploits
1
2 the same intuition. This method is known as the Google Distance [Cilibrasi and Vitanyi, 2006, Cilibrasi and
3 Vitanyi, 2007], and is defined as:
4
5 max {log nx , log ny } log nx,y
NGD(tx , ty ) = , (1)
6 log N min {log nx , log ny }
7
8 where NGD stands for the Normalized Google Distance, tx and ty are the two terms to be compared, nx and
9 ny are the number of results returned by a Google search for each of the terms individually and nx,y is the
10
11 number of results returned by a Google search for both of the terms. A detailed discussion of the theoretical
12 underpinnings of this method is beyond the present scope but the general reasoning behind eq.(1) is quite
13
14 intuitive, and is based on the normalized information distance:
15
K(x, y) min {K(x), K(y)}
16 NID(x, y) = , (2)
17 max {K(x), K(y)}
18 where x and y are the two strings (or other data objects such as sequences, program source code, etc.) which
19
20 are to be compared. K(x) and K(y) are the Kolmogorov complexities of the two strings individually, while
21
K(x, y) is the complexity of the combination of the two strings. The distance is hence a measure of the
22
23 additional information which would be required to encode both strings x and y given an encoding of the
24
25 shorter of the strings. The division by max {K(x), K(y)} serves as a normalization term which ensures that
26 the final distance lies in the interval [0,1].
27
28 In the present context, the Kolmogorov complexity is substituted with the prefix code length, which is given
29 by:
30  
31 N
K(x, y) G(x, y) = log , (3)
32 nx,y
33
K(x) G(x) = G(x, x). (4)
34
35
36 In the above, N is the size of the sample space for the google distribution, and can be approximated by the
37 total number of documents indexed by the search engine being used. Substituting (3),(4) (2) then leads to
38
39 eq. (1).
40
To adapt the framework above for use in technology mapping and visualization, we introduce these simple
41
42 modifications:
43
44 1) Instead of a general Web search engine, the prefix code length will be measured using hit counts obtained
45 from a scientific database such as Google Scholar or Web of Science.
46
47 2) N is set to the number of hits returned in response to a search for renewable+energy, as this represents
48
the size of the body of literature dealing with renewable energy technologies.
49
50 3) We are only interested in term co-occurrences which are within the context of renewable energy; as such, to
51
52 calculate the co-occurrence frequency ni,j between terms tx and ty , the search term renewable+energy+
53 tx +ty was submitted to the search engine.
54
55 As explained in [Cilibrasi and Vitanyi, 2007], the motivation for the Google distance was to create an index
56
57 which quantifies the semantic similarity between objects (words or phrases) which reflected their usage patterns
58 in society at large. By following the same line of reasoning, we can assume that term co-occurrence patterns
59
60
61
62
63 July 19, 2011 DRAFT
64
65
5

in the academic literature would characterize the similarity between technology related keywords in terms of
1
2 their usage patterns in the scientific and technical community.
3 This distance measure can now be used to calculate the distances between all pairs of keywords in the corpus,
4
5 resulting in the following distance matrix D:
6
7 d1,1 ... d1,n
8

9 . .. ..
D = .. . . , (5)
10

11
12 dn,1 . . . dn,n
13
14 where di,j denotes the distance between keywords or terms ti or tj .
15
Jaccard distance
16
17 While the NGD certainly seemed to be a good choice for the purposes of this research, it is only one particular
18
19 interpretation of distance. Given the somewhat abstract nature of distances between keywords, and to try to
20 reduce the arbitrariness of this choice, it was decided to conduct some brief experiments on one other distance
21
22 measure.
23 The method which was chosen was the Jaccard distance. This is basically a version of the well known
24
25 Cosine similarity measure which is commonly used in information retrieval [Manning et al., 2008], and which
26
27 has been specialized to deal with distances between sets of objects. It is defined as:
28
29 |X Y |
30 J (X, Y ) = 1 , (6)
|X Y |
31
32 where X and Y are the two sets to be compared. In the context of search results, this is easily adapted by
33
setting:
34
35
36
37 |X Y | = nx,y
38
39 |X Y | = nx + ny nx,y
40
41
where nx , ny and nx,y are the hit counts obtained when searching for terms tx , ty , and (tx AND ty ) respectively.
42
43 Having determined appropriate measures for inter-keyword distance, the next challenge is to investigate
44
45 methods for converting matrix D into useful representations of the data. This can be done in a variety of ways
46 but for now the focus will be on clustering and visualization; these will be described briefly in the following
47
48 section.
49
50
51 B. Data representations
52
53 Visualization
54
55 When dealing with high-dimensional or complex datasets, algorithms for visualizing the data in an intuitive
56 way are extremely useful, serving as a source of valuable insight into the general structure of the data.
57
58 For our experiments, we used the popular hierarchical visualization algorithm proposed in [Saitou and Nei,
59 1987]. The algorithm produces the keyword hierarchy which provides the simplest explanation for the distances
60
61
62
63 July 19, 2011 DRAFT
64
65
6

1
2
3
4
5
6 (A) (B)
7
8 Fig. 1: Creation Of Hierarchical Tree Using The Neighour Joining Method
9
10
11
12
13 observed between the keywords. Simplest here is achieved via finding the tree with the smallest total branch
14
15 length. Briefly, the algorithm proceeds in iterative fashion as follows:
16
17 Algorithm Neighbor-Join(T , D)
18 Input: A term-set T with the elements t1 , . . . , tN ; a matrix D with elements di,j representing the distances
19
20 between terms ti , tj T
21 Output: An unrooted tree visualization
22
23 1. Initialize the tree in a star topology as illustrated in fig. 1(a) (example depicted is of a five-keyword
24
25 collection)
26 2. for ti , tj T
27
28 3. do
29 4. Identify i, j = argmini,j (Si,j ), where:
30
31 1
N

1
32 Si,j = (d1,k d2,k ) + di,j + . . .
2(|T | 2) 2
33 k=3
34 1

35 ... + di,j . (7)


|T | 2
3ij
36
37 5. Combine nodes ti and tj as shown in fig. 1.
38
39 6. until no node has more than three branches emanating from it.
40 As an example, we consider the following collection of ten keywords which were highlighted as being high-
41
42 growth areas in renewable energy [Kajikawa et al., 2007]: combustion, coal, battery, petroleum, fuel cell,
43
wastewater, heat pump, engine, solar cell, power system.
44
45 Distance matrices generated using the Google Scholar2 search engine were used to create a hierachical
46
47 visualization tree as described above. These are shown in fig. 2. For comparison, the visualization tree generated
48 using the Scirus search engine 3
has also been included in fig.3. Though only intended as a preliminary
49
50 demonstration, we already see some interesting patterns:
51
52 1) Broadly speaking, the structures of the keyword trees seem logical in that keywords which seem related
53 to similar areas of research have been placed in related branches.
54
55 2) Also, it can be seen that the two trees have almost identical structures. In both cases there are three main
56 clusters; the first consists of {combustion, coal, petroleum}, the second {wastewater, heat pump}, while
57
58
59 2 http://scholar.google.com

60 3 http://www.scirus.org
61
62
63 July 19, 2011 DRAFT
64
65
7

combustion

Cluster 1
1
2 petroleum
3
4 coal
5
heat pump

Cluster 2
6
7
wastewater
8
9 battery

Cluster 3
10
11 solar cell
12
13 power system
14
15 fuel cell
16
17 engine
18
19 Fig. 2: Visualization tree for Kajikawa data (the three clusters referenced in the text are clearly labelled)
20
21
22
23 heat pump

Cluster 2
24

Cluster 1
25
combustion
26 coal
27
28 petroleum
29
30 wastewater
31
32 battery
Cluster 3

33
34 solar cell
35
36 power system
37
38 fuel cell
39
40 engine
41
42 Fig. 3: Visualization tree for Kajikawa data, generated using the Scirus database (Clusters are labelled)
43
44
45
46
47 the third cluster consists of {battery, solar cell, power system, fuel cell}. The only real difference is that
48
heat pump and wastewater are paired up in fig.2 while in 3 heat pump is an immediate ancestor of
49
50 wastewater.
51
52 3) This is an important observation, as it supports the notion that the distance measure proposed has at least
53 a certain degree of independence from the databases which were used to calculate it. This is not a given
54
55 fact as our observations have been that the results returned by these two search engines can vary a lot
56 - in general Google scholar returns a very much larger number of hits, and also includes patents in its
57
58 searches. Manual inspection of the actual publications returned by the two search engines also indicated
59
that the techniques used to index and sort these publications are likely to be very different, though detailed
60
61
62
63 July 19, 2011 DRAFT
64
65
8

information about the ranking and selection procedures used is not available.
1
2 4) All three of these clusters appear to consist of topics which are closely related: clusters 1 and 3 are
3 somewhat self-evident, while cluster 2 also makes sense as there is a significant amount of research in
4
5 the use of heat pumps to reclaim heat from wastewater [Baek et al., 2005, Elnekave, 2008].
6 5) The keyword {engine} is seen to be somewhat isolated from the rest of the group.
7
8
9 Clustering
10
11
Clustering is the process of dividing large sets of objects - in this case keywords - into smaller groups
12
13 containing closely related terms; this is useful as these groupings could then be used to construct enriched
14
15 keywords queries, organize the objects into topical hierarchies and to perform various classification tasks.
16 This is an important operation in data mining and can be attempted in a number of ways; one of the most
17
18 common methods is the k-means algorithm [Bishop, 2006]. This works by dividing the data into k clusters,
19 each anchored by a centroid vector representing the mean position of the cluster. The optimal clustering is
20
21 found iteratively by alternating between:
22
23 1) Re-estimating the position of the centroids (by calculating the mean of the assigned vectors),
24 2) Revising the groupings by re-assigning data points to the clusters with the closest centroids.
25
26 In the present context there is a slight complication in that instead of data vectors, only the distance matrices
27
28 are available. As such, instead of the regular k-means algorithm, the following modified algorithm, Matrix-k-
29 means, is proposed:
30
31 Algorithm Matrix-k-means(T , D,k)
32
33 Input: A term-set T ; a matrix D with elements di,j representing the distances between terms ti , tj T ; k,
34 the number of clusters
35 k k
36 Output: A clustering c = [C1 , C2 , . . . , Ck ], where i=i Ci = T and i=i Ci =
37 1. Select random centroids t1 . . . tk T
38
39 2. t [t1 , . . . , tk ]
40
3. c [{t1 }, . . . , {tk }] (= [C1 , C2 , . . . , Ck ])
41
42 4. repeat
43
44 5. T  T {t}
45 6. for ti T 
46
47 7. do l argminj {di,j : tj {t}}
48 8. Cl Cl + {ti }
49
50 9. for i [1, k]
51  
52 10. do j argminj tl Cj dj,l
53 11. ti tj
54
55 12. until termination criterion met
56
The complexity of the algorithm is easy to determine as it is a subset of the standard k-means algorithm,
57
58 where it is basically the assignment phase (no re-calculation is required since the distances between all points
59
60 are pre-calculated and only need to be looked-up during execution of the algorithm). As such, the overall
61
62
63 July 19, 2011 DRAFT
64
65
9

complexity is O(ikn), where i is the number of iterations, k is the number of clusters, and n is the number
1
2 of points in the collection [Manning et al., 2008]. This is a modest requirement as was borne out during our
3 experiments.
4
5 However, we note that this is a Greedy algorithm and that hence, there is a dependence on the initial choice
6 of cluster centroids which, for larger collections, can make a significant difference in the final outcome of the
7
8 iterations. As such, in practice, the algorithm above was run for a number of times, then Dunns validity index
9
was used to select the optimal clustering. This is defined as:
10  
11 di,j
12 D= min , (8)
{i,j:i,jc,i=j} max1kn k
13
14 where d is the inter-cluster distance, defined as mean distance between elements in clusters i and j, k is the
15 intra-cluster distance, defined as the mean distance between all elements within cluster k and n is the number
16
17 of clusters.
18 As in the previous section, the modified k-means algorithm described above was applied to the ten keywords
19
20 extracted from [Kajikawa et al., 2007]. Again, the Google and Scirus distances were generated as explained in
21
section II-A and used to decompose the keywords into a number of smaller sets. The procedure was repeated
22
23 10 times and the best clustering was selected based on the Dunn index. The same clusters were obtained in
24
25 both cases, and were as follows:
26 cluster 1: battery, fuel cell, solar cell, power system
27
28 cluster 2: heat pump
29
cluster 3: engine, combustion, petroleum, coal, wastewater
30
31 Comparing the results obtained here, and the clusters labelled in figures 2 and 3, we see that the divisions of
32
33 the keywords into categories are extremely similar. The only exceptions are that engine and wastewater have
34 now been moved into the same cluster with combustion, petroleum and coal, while heat pump is now in its
35
36 own cluster.
37
Topographic Mapping
38
39 As discussed in the preceding two subsections, the main visualization and clustering techniques used in this
40
41 paper are the hierarchical clustering technique proposed in [Saitou and Nei, 1987], and the k-means algorithm.
42 However, it is important to note that these are not the only methods which can be used for this purpose, and it
43
44 is likely that other visualization techniques, for example, could be fruitfully deployed to analyze or emphasize
45 other aspects of the data.
46
47 It is not within the scope of this paper to review all such techniques, but one class of methods which
48
thus far have not been mentioned are topographic mapping techniques, which aim to find a low-dimensional
49
50 representations of the data which preserve the topographic ordering of objects (i.e. a distance-preserving
51
52 transform).
53 One technique which falls into this category is Sammon Mapping [Sammon, 1969], which solves this by
54
55 minimizing the differences between the actual inter-object distances, and the corresponding distances in the
56 visualization space. This quantity is represented by a nonlinear cost function known as the Sammon Stress,
57
58 defined as: 
59 ij dij dij
Ess =  , (9)
60
ij dij
61
62
63 July 19, 2011 DRAFT
64
65
10

Google distances Scirus distances


0.5 0.5
Cluster 1 Cluster 1 Cluster 2
1
2 petroleum petroleum wastewater

3 0 engine combustion
coal
0 combustion

4 wastewater coal

5 fuel cell
engine heat pump

6 fuel cell
7 0.5
power system
battery
0.5

8 heat pump
battery

9 Cluster 2
power system

10 1 1
solar cell
11 solar cell
Cluster 3
12 Cluster 3

13
14
1.5 1.5
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.2 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.2

15 (a) Google distance (b) Scirus distance


16
17
18 Fig. 4: Topographic mapping of Kajikawa data, using both Google and Scirus keywords. As this is a
19
visualization space, the x and y axes have no real physical interpretations, which is why they are not labeled.
20
21
22
23
24 where Ess is the Sammon stress, dij is the distance between documents i and j, and dij is the distance
25 between two corresponding points in the visualization space. The derivative of Ess w.r.t. the positions of the
26
27 points in the vizualisation space can be calculated, allowing the stress function to be optimized via any non-
28 linear optimization algorithm (examples presented here were generated using the Scaled Conjugate Gradients
29
30 algorithm; see [Bishop, 1995] for more details).
31
32 To demonstrate how this could be used, we used Sammon Mapping to generate low dimensional representa-
33 tions of the ten keywords from [Kajikawa et al., 2007], and the resulting maps are presented in Figures 4 (a)
34
35 and (b) for the Google and Scirus distances respectively. Clusters 1 to 3 from the previous sections have been
36 clearly marked on these figures and it can be seen that the inter-relationships between these keywords have
37
38 been preserved in this alternative representation.
39
Finally, to facilitate comparisons and to provide some degree of validation of the hierarchical tree figures,
40
41 annotated Sammon Maps of all the datasets have been included in Appendix B.
42
43
44 C. Data collection
45
46 As mentioned in section I-D, a more extensive case study on renewable energy technologies was conducted
47
to evaluate the proposed techniques. The main data requirement was for a set of energy related keywords and
48
49 a populated distance matrix containing the inter-keyword distances.
50
51 Energy related keywords were extracted using ISIs Web of Science database: a search for renewable+energy
52 was submitted, and the matching publications were sorted according to citation frequency. The top 30 records
53
54 were retained, then two separate groups of keywords were collected for use in our experiments - the first
55 collection was obtained using the Author Keywords feature and the second collection was obtained using the
56
57 Keyword Plus feature; the former is composed of keywords specified by the authors, while the latter consists
58
59 of keywords extracted from the titles of linked publications (the complete lists of keywords are provided in
60 Appendix A of this paper). In total, 59 author keywords were extracted while 133 terms were extracted using
61
62
63 July 19, 2011 DRAFT
64
65
11

the keyword plus feature.


1
2 Once the keywords were collected, the distances discussed in section II-A could be calculated. Only hit counts
3 from Google scholar were used this time - the Scirus search engine was not used as there were many specialized
4
5 terms in the collections for which Scirus returned no hits at all. Similarly, a number of other alternatives were
6 considered including the Web of Science, Inspec, Ingenta, Springer and IEEE databases; again, a preliminary
7
8 survey indicated that very low numbers of hits, or none at all, were returned for a large proportion of the keyword
9
10 pairs. There appeared to be two main reasons for this observation: Firstly, most of these search engines simply
11 did not index a large enough collection to provide ample coverage of the more specialized of the keywords
12
13 that were in the list; Secondly, not all of the search engines allowed full text searches (the Web of Science
14 database, for example, only allows searching by keywords or topics) - while sufficient for literature searches
15
16 and reviews, keyword searches simply did not provide sufficient data for our purposes.
17
18
19 III. R ESULTS
20
The experiments described in the previous sections were performed on the two keyword collections. Some
21
22 overall observations were:
23
24 1) As expected, an informal inspection of the search results confirmed that terms which were closely related
25
had a large number of joint-hits, while distantly related terms only appeared together in a small number
26
27 of papers. For example, 14000 papers were found to contain the terms natural gas and power generation,
28
29 while only 484 hits were returned when a search for natural gas and genomics was conducted.
30 2) However, one problem which was encountered was the large number of highly generic keywords, such as
31
32 review, chemicals and fuels in the case of the author defined keywords, and liquid, mechanisms, metals,
33 cells and products in the collection of plus keywords. Problems might arise as these terms tended to have
34
35 a high degree of intersection with almost all other terms - for example, searching for Review and natural
36
gas resulted in 21000 joint hits, and Review and genomics yielded 1610 joint hits. Depending on the type
37
38 of data analysis technique used, these results could erroneously imply a high degree of similarity between
39
40 genomics and natural gas.
41 3) There were also some problems with data quality and consistency. As the data in the Google scholar
42
43 database is constantly evolving, it is not possible to ensure consistenty of all the hit counts. In one
44
specific case, we noticed that the number of publications which contained both Trichoderma Reesei QM-
45
46 9414 and System was actually more than the hit count returned when a search for only Trichoderma Reesei
47
48 QM-9414 was conducted. It later turned out that this was due to the two searches being conducted on
49 different days, and that in the intervening time additional publications had already been found containing
50
51 the two terms.
52 Another example is the fact that the hit counts returned by Google scholar are known to be approximations
53
54 of the total number of relevant publications (as the user clicks through the results pages, the number
55
reported gradually converges to the actual value). For instance, it was observed that the hit counts from
56
57 searches over a range of years, conducted individually, did not add up to the total number of hits returned
58
59 when the entire range of years were searched in a single query. Problems such as these arise because of
60
61
62
63 July 19, 2011 DRAFT
64
65
12

the novel ways in which these databases are being used. It is hoped that because we are using aggregate
1
2 data over a range of search terms, inconsistencies such as these will be averaged out.
3 In the following subsections the results obtained from carrying out the proposed analysis on the two sets of
4
5 keywords will be described in greater detail.
6
7
8 A. Author keywords
9 As mentioned previously, these are the keywords specified manually by the authors of publications (a full
10
11 list of the 59 keywords in this collection are provided in Appendix A).
12
As in section II-B, we start by using the hierarchical visualization to obtain an overall view of the keyword
13
14 inter-relationships. This is shown in fig. 5.
15
16 From the tree diagram, we can see that there is a definite clustered structure in the data. In some cases, it
17 is difficult to judge the validity of the clusterings, in particular in the case of general terms like chemicals,
18
19 review and electricity. However based on fig. 5, we can identify at least five major clusters. These have
20
been clearly labelled in the figure and are:
21
22 C1: This is composed of the terms {thermal processing, thermal conversion, co-firing, alternative fuels,
23
24 transesterification, sunflower oil, biodiesels, bio-fuels}. These terms are definitely closely linked, and are
25 representative of research efforts related to biodiesel processing.
26
27 C2: Consisting of the keywords {sugars, model plant, enzymatic digestion, populus, genome sequence,
28
QTL, Arabidopsis, genomics, poplar, corn stover, pretreatment, hydrolysis}, this second cluster spans a
29
30 selection of renewable energy relevant biotechnology applications, in particular the production of biomass.
31
32 C3: This cluster contains the terms {CdTe, thin films, carbon nanotubes, CdS, adsorption, high efficiency},
33 all of which are associated with the manufacture of thin film solar cells.
34
35 C4: This cluster consists of {gasification, GASIFICATION, energy economy and management, fast pyrol-
36 ysis, pyrolysis}, which are broadly related to the topic of gasification. The exception seems to be the node
37
38 energy economy and management, which seems a little out of place (however, it is a very generic term
39
40 and could be related in a number of indirect ways).
41 Note also the occurrence of the terms gasification and GASIFICATION - both terms were present in the
42
43 automatically scraped keyword lists and were included as a useful example of dirty data, which illustrates
44 the usefulness of grouping semantically similar words together as a means of removing redundancies.
45
46 C5: The final cluster consists of the keywords:{review, investment, emissions, electricity, fuels, energy
47
sources, energy efficiency, global warming, sustainable farming, least cost energy policies, landfill, energy
48
49 policy}, and is a collection of policy related research keywords.
50
51 Outside of these five clusters, the remaining terms also form a number of micro-clusters consisting of keyword
52 pairs or triplets. The pairs of {biomass,BIOMASS} and {renewable energy, RENEWABLE ENERGY} are
53
54 further examples of the semantic matching phenomena observed in cluster 4 earlier. Other keyword collections
55
which also appear reasonable include {natural gas, coal}, {gas engines, gas storage} and {review, investment,
56
57 emissions}.
58
59 Finally, it must also be noted that there are some observations which cannot be explained immediately or in
60 a straightforward manner. For example, there is no clear explanation for the positions of the keywords biomass
61
62
63 July 19, 2011 DRAFT
64
65
13

biomassfired power
1 inorganic material
2 thermal processing
3 thermal conversion
4 cofiring
5 alternative fuel
6 transesterification
C1
7 sunflower oil

8 biodiesel

9 biofuels

10 sugars

11 model plant

12 enzymatic digestion

13 Populus
C2
14 genome sequence

15 QTL

16 arabidopsis

17 genomics

18 poplar

19 corn stover

20
pretreatment
hydrolysis
21
gas engines
22
gas storage
23 renewable energy
24 RENEWABLE ENERGY
25 energy conversion
26 CdTe
27 thin films C3
28 carbon nanotubes
29 CdS
30 adsorption
31 high efficiency
32 energy balance
33 gasification
34
35
GASIFICATION
C4
energy economy and management
36 fast pyrolysis
37 pyrolysis
38 power generation
39 renewables
40 biomass
41 BIOMASS

42 chemicals

43 natural gas
C5
44 coal

45 Review

46 investment

47 emissions

48 electricity

49 fuels

50 energy sources

51 energy efficiency

52 global warming

53 sustainable farming

54 leastcost energy policies

55
landfill
energy policy
56
ash deposits
57
58
59 Fig. 5: Visualization tree for Author keyword data
60
61
62
63 July 19, 2011 DRAFT
64
65
14

Cluster# Keywords
1
2 K1 energy conversion, Cdte, adsorption, high efficiency, Cds, Thin Films
3
4 K2 energy economy And management
5
6 K3 sugars, populus, pretreatment, Arabidopsis, QTL, co-firing, genomics, corn stover, poplar, hydrolysis
7
K4 model plant, enzymatic digestion, genome sequence
8
9 K5 energy balance
10
11 K6 ash deposits, inorganic material, biomass-fired power boilers
12
13 K7 transesterification, Gas Engines, bio-fuels, thermal conversion, thermal processing, carbon nanotubes,
14 sunflower oil, Pyrolysis, fast pyrolysis
15
16 K8 natural gas, renewable energy, review, energy efficiency, investment, electricity, global warming, renew-
17 ables, fuels, energy sources, energy policy, power generation, coal, emissions, renewable energy
18
19 K9 alternative fuel, biomass, gasification, biodiesel, gas storage, chemicals, GASIFICATION, BIOMASS
20
21 K10 sustainable farming and forestry, least-cost energy policies, landfill
22
23 TABLE I: Clusters generated automatically by applying the k-means algorithm to the author keywords data
24
25
26
27
28 fired power and inorganic material. It is still too early to speculate on the nature of these relationships, except
29
30 to note that even as we proceed with guarded optimism, some degree of caution must be exercised when dealing
31 with data that is automatically extracted from source over which we have no control.
32
33 Next, we study the keyword clusters generated using the k-means algorithm. The matrix-k-means algorithm
34
(page 8) was used to automatically partition the author keyword collection into 10 categories. As described in
35
36 section II-B, the clustering operation has an element of randomness - to reduce this, the operation was repeated
37
38 a total of 60 times and the best clustering in terms of the Dunn index was selected as the ideal solution. The
39 clusters thus generated are presented in table I. In general we observed the following:
40
41 1) Broadly, the clusters generated in this way exhibited a structure that was similar to the groupings observed
42
43 in the hierarchical tree visualization (to facilitate the following discussions, we have labelled the clusters
44 derived using k-means as K1K9, to help distinguish the two sets of clusters)
45
46 2) Cluster K1 is exactly the same as cluster C3.
47 3) The combination of clusters K3 and K4 (Biomass related terms) were practically identical to cluster C2,
48
49 with the only exception being the term co-firing, which only appeared in K3; however, it is an ancestor
50
of C2, which explains its appearance in this group.
51
52 4) It appears that a number of keywords relevant to Biodiesel, Biomass and Gasification have become
53
54 somewhat inter-mingled in clusters K7 and K9, though the emphasis in K7 seems to be on Biodiesel, and
55 K9 seems more focussed on Gasification. This is not surprising given the broad overlaps between these
56
57 three topics.
58 5) Finally, the combination of clusters K8 and K10 contains many policy related issues, and closely matches
59
60 the keywords found in C5.
61
62
63 July 19, 2011 DRAFT
64
65
15

As mentioned previously, alternative visualizations were also generated using the Sammon map and the
1
2 Jaccard distance. To help maintain the readability and flow of this paper, these results have been included in
3 Appendix B rather than in the main text.
4
5 As can be seen, the Sammon Maps reveal structure which is similar to that obtained with the hierarchical
6 visualizations (though it is quite difficult to see this clearly due to the large number of keywords). Similarly,
7
8 for the k-means clusters generated using the Jaccard distances, similar themes were picked up though the
9
10 exact cluster memberships and orderings are different. However, there were some closely matching clusters; in
11 particular, cluster K6 (biomass related) is identical to cluster JK5 in table III. Other similar clusters are K7 and
12
13 JK7 (related to gasification), K1 and JK4 (related to thin film PV) and K3 and JK10 (biomass/waste to energy
14 related).
15
16 It is important to note that the aim of this exercise is not to prove that the results are identical, but to show
17
that the results are not completely arbitrary and might change completely when a different distance measure is
18
19 used, for instance.
20
21
22 B. Keyword plus
23
24 Next, the set of key terms extracted using keyword plus of the ISI Web of Science database were studied in
25 the same way. For the hierarchical visualizations, it is not possible to present the entire tree diagram due to the
26
27 large number of keywords (133 in this collection). Instead, it has been broken into two sub-trees and these are
28 shown in figures 6 and 7 respectively. As in the previous section, the keyword tree indicated a clear clustered
29
30 structure with a number of prominent, identifiable clusters, labelled as CP1CP7 (in the interest of brevity,
31
32 we have been a little more selective this time around due to the larger number of keywords):
33 CP1: This cluster contained the following terms: {SP Strain ATCC-29133, Bidirectional Hydrogenase,
34
35 Anabaena Variabilis, Anacystis Nidulans, Nitrogen Fixation}; these keywords are associated with bio-
36
production of hydrogen using Cynaobacterial strains.
37
38 CP2: Consisting of the following keywords: {Transgenic Poplar, Genetic Linkage Maps, RAPD Markers,
39
40 Agrobacterium mediated transformation, Hybrid Poplar, Molecular Genetics, FIMI, Trichoderma Reesei
41 Q, Corn stover, Wood, Fuels}, this second cluster contained terms related to research on the production
42
43 of Biomass.
44 CP3: This next collection of terms included the following: {Ruthenium Polypyridyl Complex, Sensitized
45
46 Nanocrystalline TiO2 , Metal Complexes, Differentiation, Nanocrystalline seminconductor films, water ox-
47
48 idation, CDS, Recombination, Sputtering deposition, Electrodes, Films, Grain Morphology, Adsorption},
49 all of which are relevant to solar cell production.
50
51 CP4: The fourth cluster comprised the following terms {Herbaceous biomass, Lignin removal, Biomass
52 conversion processes, Waste paper}, and is also linked to research on Biomass.
53
54 CP5: This cluster consisted of: {Synthesis gas, Devolatilization, Pulverized coal, Fluidized bed, Pyrolysis},
55
all of which are keywords related to gasification.
56
57 CP6: This was a very large cluster consisted of the following terms: {Fermi level equilibrium, Charge-
58
59 transfer dynamics, Gel electrolyte, Photoelectrochemical properties, Photoelectrochemical cells, Photoin-
60 duced electron transfer, Ti02 thin films, TiO2 films, Titanium dioxide films, Sensitizers, Chalcopyrite, CdTE,
61
62
63 July 19, 2011 DRAFT
64
65
16

DIESELPOWERPLANT
1 COCOMBUSTION

2 WHEATSTRAW MIXTURES

3 BRIQUETTES
LIGNOCELLULOSIC MATE
4 GASIFICATION
5 PARTIAL OXIDATION

6 CHEMICAL HEAT PIPE

7 CH4

8
LIQUID
LIQUEFACTION
9 ELECTROCATALYTIC HYDROGEN EVOLUTION
10 IGNITION

11 HYDROGEN

12 SP STRAIN ATCC29133

13
BIDIRECTIONAL HYDROGENASE
ANABAENAVARIABILIS
CP1
14 ANACYSTISNIDULANS

15 NITROGENFIXATION

16 16S RIBOSOMALRNA

17
ELEMENTAL SULFUR
SEAWATER
18 SEDIMENT
19 CO2

20 MECHANISMS

21 EFFICIENCY
VALUES
22 TRANSGENIC POPLAR
23 GENETICLINKAGE MAPS

24 RAPD MARKERS

25 AGROBACTERIUMMEDIAT
CP2
26
HYBRID POPLAR
MOLECULARGENETICS
27 FIMI
28 TRICHODERMAREESEI Q

29 CORN STOVER

30 WOOD

31
FUELS
PRODUCTS
32 LIME PRETREATMENT
33 PRESSURE COOKING

34 RECYCLED PERCOLATION

35 KINETICS
AQUEOUS AMMONIA

36 OXIDATIVE ADDITION
37 PHYSISORPTION

38 COMPOSITES

39 COUPLED ELECTRONTRA

40
FUNCTIONALIZED GOLD
PHOTONIC CRYSTALS
41 ACTIVESITE
42 PLACEEXCHANGEREACT

43 NICKEL
CP3
44 METALS
RUTHENIUM POLYPYRIDYL COMPLEX
45 SENSITIZED NANOCRYSTALLINE TIO2
46 METALCOMPLEXES

47 DIFFERENTIATION

48 NANOCRYSTALLINE SEMICONDUCTORFILMS

49
WATEROXIDATION
CDS
50 RECOMBINATION
51 SPUTTERING DEPOSITION METHOD

52 ELECTRODES

53 FILMS
GRAIN MORPHOLOGY
54 ADSORPTION
55 CELLS

56 GASOLINE

57 FLASH PYROLYSIS

58
59 Fig. 6: Visualization tree for the keyword plus data (set 1)
60
61
62
63 July 19, 2011 DRAFT
64
65
17

STEP GENE REPLACEMEN


1 GLYCOSYL HYDROLASES
2 CELLULASES
3 MUTAGENESIS

4 HYDROLYSIS

5 CELLULOSE

6 HERBACEOUS BIOMASS

7
LIGNIN REMOVAL CP4
BIOMASS CONVERSION P
8 WASTE PAPER
9 SURFACEPLASMON RESO
10 GENETRANSFER
11 HYDROGENPEROXIDE
12 SYNTHESIS GAS
13 DEVOLATILIZATION

14 PULVERIZED COAL
CP5
15 FLUIDIZEDBED

16 PYROLYSIS

17 FUELCELL VEHICLES
HYDROGENPRODUCTION
18 CATALYSTS
19 FERMILEVEL EQUILIBR
20 CHARGETRANSFER DYNA
21 GEL ELECTROLYTE
22 PHOTOELECTROCHEMICAL
23 PHOTOINDUCED ELECTRO
24 PHOTOELECTROCHEMICAL

25 TIO2 THINFILMS CP6


26 TIO2 FILMS

27 TITANIUMDIOXIDE FIL

28
SENSITIZERS
CHALCOPYRITE
29 CDTE
30 SOLARCELLS
31 DYE
32 PHOTOPRODUCTION
33 PROTON REDUCTION
34 ELECTROCHEMICAL REDU

35 HOMOGENEOUS CATALYSI

36 ELECTRONTRANSFER

37 MONTECARLO SIMULATI

38 GRAPHITE NANOFIBERS
SOLAR FURNACE
39 WALLED CARBON NANOTU
40 ACTIVATION
41 PORES
42 PARTICLES
43 FAMILIES
44 INFRASTRUCTURE
45 SYNERGISM

46 PHOTOSYSTEMII

47 LIGHT INTERCEPTION
CP7
48 OPENTOP CHAMBERS
CANOPY STRUCTURE
49 SHORTROTATION
50 ENERGY
51 SYSTEM
52 CONVERSION
53 TRANSPORT
54 FUEL
55 LIQUIDS

56 COMBUSTION

57 ENZYMATICHYDROLYSIS

58
59 Fig. 7: Visualization tree for the keyword plus data (set 2)
60
61
62
63 July 19, 2011 DRAFT
64
65
18

Solar-Cells, Dye}. All of these keywords are related to research in the field of Solar Cell.
1
2 CP7: Finally, the last cluster, which was focused on the area of Biomass crops, contained the following
3 keywords {Photosystem II, Light interception, Open-top chambers, Canopy structure, Short rotation}.
4
5 Again, as in the previous set of keywords, the structure of the hierarchy grouped terms which were relevant
6
7 to particular research issues in renewable energy. Also, there is a good correspondance between the clusters
8 observed here and the clusters created from the author keywords collection. This is to be expected since
9
10 these keywords were obtained from the same corpus of documents. However, that said, there were two notable
11 exceptions:
12
13 1) C1 contains biodiesel related terms, which do not seem to occur in the present clustering. However, on
14
15 closer inspection, we see that this is because all of the biodiesel terms originated from one publication
16 ([Antoln et al., 2002]), and that the Web of Science entry for this paper does not have any keyword plus
17
18 terms.
19 2) Cluster CP1 is related to hydrogen production using Cyanobacteria, a subject which was not encountered
20
21 when studying the author keywords. Again, it was discovered that these terms mostly originated from a
22
23 single document ([Hansel and Lindblad, 1998]); this time, there were no author defined keywords in the
24 Web of Science record for this document.
25
26 Next, the k-means algorithm was used to cluster these keywords and the resulting keywords listed in table
27
II.
28
29 In general, the results obtained in this second keyword collection have been less conclusive in that it has
30
31 been harder to find direct mappings between the k-means generated clusters and clusters derived from the tree
32 diagrams.
33
34 This was partly because the keyword plus collection was a lot larger. One result of this was that there were
35 invariably more than one cluster devoted to each research topic. Also, having more keywords also meant that
36
37 there were more degrees of freedom in the clustering process, making the final result a lot more variable. A
38
39 further complication was that the keyword plus collection has been divided into two sets of terms to allow the
40 visualization trees to fit onto a single page.
41
42 Nevertheless, the results still contained a great number of very informative clusters:
43
1) KP14 is identical with CP1, which is associated with the production of hydrogen.
44
45 2) Clusters KP13 and KP16 are both related to solar cells and match the contents of CP3 and CP6 very
46
47 closely.
48 3) In addition, the terms in KP16 are drawn from the field of nano-technology, a field with a great many
49
50 applications in renewable energy.
51 4) KP20 contains a collection of closely related keywords which are primarily related to biomass production
52
53 using cellulosic materials (e.g. poplar) - when compared with the hierarchical mappings, the same
54
55 keywords appear to have been split between clusters CP2 and CP7, which unfortunately appear in separate
56 trees.
57
58 5) Besides KP20, there were also a number of other clusters which were devoted to biomass. These included
59 clusters KP4, KP6 and KP19.
60
61
62
63 July 19, 2011 DRAFT
64
65
19

1
2
3 Cluster# Keywords
4
5 KP1 Elemental Sulfur, Chalcopyrite
6
KP2 Grain Morphology
7
8 KP3 Values, Products, Cds, Families, Energy, System, Fuel
9
10 KP4 Fimi, Trichoderma-Reesei Qm-9414, Diesel-Power-Plant, Active-Site, Cellulases, Synergism, Glycosyl
11 Hydrolases
12
13 KP5 Hydrogen, Nickel, Electrodes, Fuel-Cell Vehicles, Hydrogen-Production
14
15 KP6 Chemical Heat Pipe, Lime Pretreatment, Corn Stover, Pressure Cooking, Aqueous Ammonia, Lignocel-
16 lulosic Materials, Recycled Percolation Process, Enzymatic-Hydrolysis, Herbaceous Biomass, Hydrogen-
17 Peroxide, Lignin Removal
18
19 KP7 Cocombustion, Pulverized Coal
20
21 KP8 Coupled Electron-Transfer, Metal-Complexes, Water-Oxidation, Electrocatalytic Hydrogen Evolution,
22 Photosystem-Ii, Photoproduction, Proton Reduction, Biomass Conversion Processes, Homogeneous Catal-
23 ysis, Electron-Transfer, Photoinduced Electron-Transfer, Solar Furnace, Electrochemical Reduction
24
25 KP9 Composites, Infrastructure, Transport
26
27 KP10 Efficiency, Cells, Adsorption, Mechanisms, Films, Metals, Gasoline, Solar-Cells
28
KP11 Kinetics, Differentiation, Step Gene Replacement, Activation
29
30 KP12 16S Ribosomal-Rna, Anacystis-Nidulans, Agrobacterium-Mediated Transformation, Molecular-Genetics,
31
Gene-Transfer, Mutagenesis
32
33 KP13 Oxidative Addition, Ruthenium Polypyridyl Complex, Nanocrystalline Semiconductor-Films, Sensitized
34 Nanocrystalline Tio2, Fermi-Level Equilibration, Photoelectrochemical Cells, Tio2 Thin-Films, Tio2
35
Films, Photoelectrochemical Properties, Gel Electrolyte, Sensitizers, Titanium-Dioxide Films, Charge-
36
37 Transfer Dynamics
38
KP14 Sp Strain Atcc-29133, Anabaena-Variabilis, Nitrogen-Fixation, Bidirectional Hydrogenase
39
40 KP15 Recombination, Cdte, Dye
41
42 KP16 Photonic Crystals, Functionalized Gold Nanoparticles, Place-Exchange-Reactions, Surface-Plasmon Res-
43 onance, Graphite Nanofibers, Walled Carbon Nanotubes
44
45 KP17 Physisorption, Monte-Carlo Simulations
46
47 KP18 Sediment, Sputtering Deposition Method, Seawater, Particles, Pores
48
49 KP19 Flash Pyrolysis, Gasification, Partial Oxidation, Ch4, Co2, Liquefaction, Fuels, Ignition, Liquid, Wheat-
50 Straw Mixtures, Wood, Briquettes, Synthesis Gas, Devolatilization, Waste Paper, Hydrolysis, Liquids,
51 Pyrolysis, Conversion, Combustion, Cellulose, Short-Rotation, Catalysts, Fluidized-Bed
52
53 KP20 Transgenic Poplar, Genetic-Linkage Maps, Hybrid Poplar, Rapd Markers, Light Interception, Open-Top
54 Chambers, Canopy Structure
55
56
TABLE II: Clusters generated using K-means: keyword plus data
57
58
59
60
61
62
63 July 19, 2011 DRAFT
64
65
20

Finally, as in the case of the Author Keywords data, alternative visualizations have been generated, and these
1
2 are included in Appendix B as a comparison.
3
4 IV. D ISCUSSIONS
5
6 This paper presented a novel use of bibliometric techniques in the visualization of technology. It seems
7
8 clear that methods such as the ones demonstrated here will be very useful to researchers seeking a better
9 understanding of the key patterns and trends in research and technology. On the other hand, there are still many
10
11 problems which will have to be solved before such techniques can be developed into tools useable by end-users
12
in need of black box technology visualization solutions. These problems include:
13
14 1) Inconsistent quality of data; data obtained from publicly available sources are often unregulated and noisy,
15
16 and further underscore the need for appropriate filtering and data cleaning mechanisms.
17 2) Non-uniform coverage - the number of hits returned for very general or high-profile keywords such as
18
19 energy or efficiency was a lot greater than for more specialized topics. This is unfortunate as it is
20
often these topics which are of the greater interest to researchers. One way in which we hope to overcome
21
22 this problem is by aggregating information from a larger variety of sources, examples of which include
23
24 technical reports, patent databases and even mainstream media and blogs.
25 3) Inadequacy of existing data analysis tools; while - through the research presented here - we have tried
26
27 to push the envelope on this front, the problems encountered are common to many application domains
28 which deal with complex, high dimensional data, and are the subject of much ongoing research besides
29
30 our own. Problems related to the over-fitting of data, non-unique solutions and information loss resulting
31
32 from dimensionality reduction, are all symptoms of the inherent difficulty of this problem.
33 Another important consideration is the scalability of the method. Here, we consider two main aspects, the first
34
35 of which is scalability with respect to the size of the source database. We believe that our method addresses
36
this aspect very neatly since all that is required is that the database provides a search interface. In most
37
38 scenarios, such databases are curated by large organizations with very substantial resources (e.g. Google). It
39
40 is reasonable to expect that these organizations would have put in place appropriate indexing mechanisms to
41 allow the searches to be conducted efficiently. Using only search results to generate the distances shifts a large
42
43 portion of the computational load to the providers of the database, which in turn allows us to leverage the
44 optimizations or economy-of-scale advantages enjoyed by these providers. Even if locally stored databases are
45
46 used, the proposed methodology could make use of the efficient search functionality built into these databases,
47
48 rather than having to manually parse the contents of the entire database.
49 The second aspect is scalability with respect to the number of keywords to be analyzed. In the current
50
51 methodology, a search needs to be performed for each pair of terms in the collection to be analyzed, as well as
52 for each term individually. As such, for a collection with n terms, n.( n2 + 1) searches will need to be conducted.
53
54 Depending on the level of availability of the database, this can be time-consuming (also, some databases may
55
only permit a certain number of searches a day). However, as demonstrated here, reasonably sized collections
56
57 containing over a hundred keywords are quite easily analyzed.
58
59 A related issue is the difficulty of updating the distance matrix dynamically when new terms need to be
60 added. With the proposed method, adding each new term to the database will require n + 1 searches to be
61
62
63 July 19, 2011 DRAFT
64
65
21

made; i.e. the term will need to be compared to all existing terms, and a search for the term in isolation will also
1
2 need to be made. For moderately sized collections, this could be alright but if much larger collections are to be
3 analyzed, the basic methodology will need to be extended to incorporate a more efficient updating procedure.
4
5 This has not been investigated yet but one possibility is to group the keywords into clusters. New keywords
6 could then be compared only to the cluster centroids and not to all existing keywords. Direct comparison with
7
8 keywords within the most similar cluster could also be conducted to determine the local structure. Assuming
9
10 that there is a much smaller number of clusters than keywords, this will result in a huge boost in efficiency.
11 However, such a method will need careful tuning to achieve a good trade-off between efficiency and accuracy
12
13 of the approximated distances. Nevertheless, this would definitely be a very interesting direction for future
14 research.
15
16 Finally, a number of additional visualizations were generated using alternative mapping techniques and
17
distance measures. Specifically, the Sammon Map was used as an example of a topographic mapping technique,
18
19 and the Jaccard distance was presented as an alternative measure of publication similarity. While these tech-
20
21 niques were not comprehensively investigated, the reason for their inclusion was mainly to demonstrate that the
22 keyword relationships observed were not merely artifacts that resulted from using a particular set of visualization
23
24 techniques, but reflected the true, underlying structure of the research landscape. Here, our observation was
25 that the overall clustering structure and inter-relationships between the keywords were preserved, but that the
26
27 scaling and distribution of the keywords in the respective visualization spaces experienced some distortion. In
28
29 particular, we note that the proposed combination of the NGD with the hierarchical visualization or the k-means
30 algorithm seemed to provide results that were somewhat clearer than the alternatives mentioned above.
31
32 That said, the methods described in this paper were only intended as an early demonstration of the proposed
33 approach, and in spite of the above-mentioned problems, we believe that the results described here already
34
35 demonstrate the potential usefulness of the methodology. However, a note of caution would be that it is still
36
not known if it will be possible to fully automate these methods - while very interesting results were obtained,
37
38 distinguishing these from the background noise was still largely a manual process.
39
40 It must also be conceded that while promising, there were also many observations which were difficult to
41 explain. These may be viewed from a number of perspectives; on the one hand, they could be manifestations
42
43 of hitherto unknown relationships or underlying correlations, and may only be understood after further analysis
44 of these results. On the other hand, it should be realized that the Google distance is a numerical index derived
45
46 from the term co-occurrence frequencies - nothing more, nothing less. Under the correct circumstances and
47
48 provided that our assumptions are adequately met, it serves as a useful indicator of the similarity between
49 keywords. Certainly, from the results obtained so far it would appear that these requirements are satisfied for
50
51 at least a reasonable proportion of the time. However, under less favorable conditions, these numbers can be
52 misleading and yield artifactual results, as has also been observed in some of the examples presented in this
53
54 paper.
55
56
57 ACKNOWLEDGEMENT
58
We would like to thank the Masdar Institute of Science and Technology (MIST) and the Masdar Initiative for
59
60 their support of this work.
61
62
63 July 19, 2011 DRAFT
64
65
22

A PPENDIX A
1
2 R ENEWABLE ENERGY RELATED KEYWORDS
3
A. Keywords from Kajikawa et al
4
5 combustion, coal, battery, petroleum, fuel cell, wastewater, heat pump, engine, solar cell, power system
6
7
8 B. Author keywords
9 biomass, CDS, CDTE, energy efficiency, gasification, global warming, least-cost energy policies, power generation, populus, qtl, renewable
10 energy, review, sustainable farming and forestry, adsorption, alternative fuel, arabidopsis, ash deposits, bio-fuels, biodiesel, biomass, biomass-
11
fired power boilers, carbon nanotubes, chemicals, co-firing, coal, corn stover, electricity, emissions, energy balance, energy conversion,
12
energy economy and management, energy policy, energy sources, enzymatic digestion, fast pyrolysis, fuels, gas engines, gas storage,
13
14 gasification, genome sequence, genomics, high efficiency, hydrolysis, inorganic material, investment, landfill, model plant, natural gas,
15 poplar, pretreatment, pyrolysis, renewable energy, renewables, sugars, sunflower oil, thermal conversion, thermal processing, thin films,
16 transesterification.
17
18
19 C. Keyword Plus
20 16s ribosomal-rna, activation, active-site, adsorption, agrobacterium-mediated transformation, anabaena-variabilis, anacystis-nidulans, aque-
21
ous ammonia, bidirectional hydrogenase, biomass conversion processes, briquettes, canopy structure, catalysts, cds, cdte, cells, cellulases,
22
23 cellulose, ch4, chalcopyrite, charge-transfer dynamics, chemical heat pipe, co2, cocombustion, combustion, composites, conversion, corn
24 stover, coupled electron-transfer, devolatilization, diesel-power-plant, differentiation, dye, efficiency, electrocatalytic hydrogen evolution,
25 electrochemical reduction, electrodes, electron-transfer, elemental sulfur, energy, enzymatic-hydrolysis, families, fermi-level equilibration,
26 films, fimi, flash pyrolysis, fluidized-bed, fuel, fuel-cell vehicles, fuels, functionalized gold nanoparticles, gasification, gasoline, gel elec-
27 trolyte, gene-transfer, genetic-linkage maps, glycosyl hydrolases, grain morphology, graphite nanofibers, herbaceous biomass, homogeneous
28
catalysis, hybrid poplar, hydrogen, hydrogen-peroxide, hydrogen-production, hydrolysis, ignition, infrastructure, kinetics, light interception,
29
30 lignin removal, lignocellulosic materials, lime pretreatment, liquefaction, liquid, liquids, mechanisms, metal-complexes, metals, molecular-
31 genetics, monte-carlo simulations, mutagenesis, nanocrystalline semiconductor-films, nickel, nitrogen-fixation, open-top chambers, oxidative
32 addition, partial oxidation, particles, photoelectrochemical cells, photoelectrochemical properties, photoinduced electron-transfer, photonic
33 crystals, photoproduction, photosystem-ii, physisorption, place-exchange-reactions, pores, pressure cooking, products, proton reduction,
34
pulverized coal, pyrolysis, rapd markers, recombination, recycled percolation process, ruthenium polypyridyl complex, seawater, sediment,
35
36 sensitized nanocrystalline TiO2 , sensitizers, short-rotation, solar furnace, solar-cells, sp strain atcc-29133, sputtering deposition method,
37 step gene replacement, surface-plasmon resonance, synergism, synthesis gas, system, TiO2 films, TiO2 thin-films, titanium-dioxide films,
38 transgenic poplar, transport, trichoderma-reesei qm-9414, values, walled carbon nanotubes, waste paper, water-oxidation, wheat-straw
39 mixtures, wood.
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63 July 19, 2011 DRAFT
64
65
23

A PPENDIX B
1
2 A DDITIONAL VISUALIZATIONS
3
A. Author keywords
4
5 In this appendix, alternative visualizations/clusterings generated using the author-generated keywords are presented. As mentioned in
6 section III-A, two alternative forms are presented here. The first, which is shown in fig. 8, uses Sammon Mapping to generate a topographic
7 representation of the inter-keyword distances. Secondly, k-means clusters were generated using the Jaccard distance, which was described
8 in section II-A. This is presented in table III. We note that:
9
10 1) As discussed in section III-A, as well as by the labeling of highly similar clusters found in fig. 8, the different visualizations were
11 broadly in agreement with the earlier results as to the overall structure of the research landscape.
12 2) However, at the same time, the representations are not identical; this is not surprising since most of these methods are non-linear
13 and would result in distortion and stretching of the visualization space.
14
3) While the results were consistent with the earlier results, we felt that the visualizations obtained using the Google distance followed
15
16 by hierarchical clustering or k-means tended to be clearer.
17 4) For e.g., the Sammon map shown in fig. 8 is quite difficult to read except under very high magnification. This is a result of the
18 Sammon mapping technique which, by definition, places similar nodes very close to each other in the visualization space; in many
19 cases this resulted in overlapping terms, while on the other hand large sections of the visualization space remained relatively sparse.
20
5) Also, when using the Jaccard distance the clusters tended to be more unbalanced. For e.g. in table III, we see that many of the
21
22 keywords have been lumped together in cluster JK1. This implies that the mapping induced by the Jaccard distance was not able to
23 achieve a good uniform spread of the keywords.
24
25
Author keywords
26 1
27 qtl
enzymatic digestion
28 Biofuels (C1 & C2)
29
30 genome sequence energy economy and management
31 0.5
32 arabidopsis
model plant
33
34
renewable energy
35 genomics
energy balance
36 0
37 populus
transesterification
38 poplarbiodiesel
biofuels global warming
review
energy sources
39 corn stover
sugars biomass
fuels
investment
40 energy policy
emissions
hydrolysis chemicals leastcost energy policies
0.5 renewables
pretreatment coal energy
41 alternative fuel
natural gasefficiency
electricity
pyrolysis
42 gasification
cds adsorption
power generation
landfill sustainable farming and forestry
43 thin films
carbon nanotubes
44
energy
highconversion
efficiency Energy Policy (C5)
fast pyrolysis
45 1 thermal
gas storage
processing
46 cdte
thermal conversion
cofiring Thin film (C3)
47
48 sunflower oil
49 inorganic material
50 1.5
ash deposits
gas engines
51
52
53
54 2 biomassfired power boilers
55 2 1 0 1 2 3 4
56
57 Fig. 8: Sammon Map of author keywords data. Thematic clusters have been highlighted, and where possible
58
have been linked to clusters found in the hierarchical maps
59
60
61
62
63 July 19, 2011 DRAFT
64
65
24

Cluster# Keywords
1
2 JK1 Alternative Fuel, Natural Gas, Renewable Energy, Review, Energy Balance, Energy Efficiency, Invest-
3
ment, Electricity, Global Warming, Renewables, Fuels, Energy Sources, Biodiesel, Sustainable Farming
4
And Forestry, Energy Policy, Power Generation, Least-Cost Energy Policies, Biomass, Landfill, Coal,
5
6 Emissions, Renewable Energy
7
JK2 Energy Economy And Management
8
9 JK3 Qtl
10
11 JK4 Energy Conversion, Cdte, Adsorption, High Efficiency, Chemicals, Carbon Nanotubes, Cds, Thin Films
12
13 JK5 Ash Deposits, Inorganic Material, Biomass-Fired Power Boilers
14
15 JK6 Model Plant, Genome Sequence, Arabidopsis, Genomics
16
JK7 Transesterification, Gas Engines, Pretreatment, Gasification, Thermal Conversion, Co-Firing, Thermal
17
18 Processing, Gasification, Sunflower Oil, Pyrolysis, Fast Pyrolysis, Hydrolysis
19
JK8 Gas Storage
20
21 JK9 Bio-Fuels
22
23 JK10 Sugars, Enzymatic Digestion, Populus, Corn Stover, Poplar
24
25
TABLE III: Clusters generated automatically by applying the k-means algorithm to the author keywords data
26
27 (Jaccard distances)
28
29
30
31
32 B. Keyword plus
33
Similar trends were observed here as with the previous sub-section. Again, the broad structure of the research landscape seems to have
34
35 been preserved. As before, Sammon Mapping (shown in figs. 9 and 10) resulted in a somewhat cluttered representation of the landscape,
36 while application of the k-means algorithm to the Jaccard distances again resulted in a less uniform distribution of keywords amongst
37 clusters, where it can be seen that there are a significant number of clusters with only one keyword or phrase (9 such clusters were found
38 in table IV, as compared to only 1 such cluster in table II).
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63 July 19, 2011 DRAFT
64
65
25

1
2 Cluster# Keywords
3
4 JKP1 Elemental Sulfur
5
6 JKP2 Ruthenium Polypyridyl Complex
7
8 JKP3 Oxidative Addition, Coupled Electron-Transfer, Metal-Complexes, Water-Oxidation, Photosystem-Ii,
9 Proton Reduction, Electron-Transfer, Photoinduced Electron-Transfer, Charge-Transfer Dynamics
10
11 JKP4 Sensitized Nanocrystalline Tio2
12
JKP5 Titanium-Dioxide Films
13
14 JKP6 Photoelectrochemical Cells, Dye
15
16 JKP7 Recombination, Cells, Sputtering Deposition Method, Films, Cdte, Chalcopyrite, Solar-Cells
17
18 JKP8 Cocombustion
19
JKP9 Photonic Crystals, Nickel, Electrodes, Electrocatalytic Hydrogen Evolution, Fuel-Cell Vehicles,
20
21 Hydrogen-Production, Graphite Nanofibers, Walled Carbon Nanotubes, Electrochemical Reduction
22
JKP10 Fimi, Lime Pretreatment, Trichoderma-Reesei Qm-9414, Active-Site, Pressure Cooking, Aqueous Am-
23
24 monia, Recycled Percolation Process, Enzymatic-Hydrolysis, Cellulases, Step Gene Replacement, Muta-
25 genesis, Lignin Removal, Glycosyl Hydrolases
26
27 JKP11 Flash Pyrolysis, Transgenic Poplar, Agrobacterium-Mediated Transformation, Molecular-Genetics, Corn
28 Stover, Genetic-Linkage Maps, Hybrid Poplar, Rapd Markers, Lignocellulosic Materials, Herbaceous
29 Biomass, Waste Paper, Light Interception, Hydrolysis, Cellulose, Short-Rotation, Biomass Conversion
30 Processes
31
32 JKP12 Sp Strain Atcc-29133, Anacystis-Nidulans, Anabaena-Variabilis, Nitrogen-Fixation, Bidirectional Hydro-
33 genase, Open-Top Chambers, Photoproduction, Gene-Transfer
34
35 JKP13 Chemical Heat Pipe, Solar Furnace
36
37 JKP14 16S Ribosomal-Rna
38
JKP15 Efficiency, Gasification, Partial Oxidation, Ch4, Sediment, Physisorption, Co2, Liquefaction, Adsorption,
39
40 Values, Mechanisms, Diesel-Power-Plant, Hydrogen, Fuels, Kinetics, Products, Ignition, Differentiation,
41 Composites, Functionalized Gold Nanoparticles, Grain Morphology, Cds, Seawater, Metals, Liquid,
42 Wheat-Straw Mixtures, Gasoline, Wood, Briquettes, Synthesis Gas, Families, Infrastructure, Devolatiliza-
43 tion, Pulverized Coal, Liquids, Synergism, Pyrolysis, Conversion, Combustion, Monte-Carlo Simulations,
44
Transport, Particles, Homogeneous Catalysis, Energy, Activation, Canopy Structure, Catalysts, Pores,
45
46 Fluidized-Bed, Hydrogen-Peroxide, System, Fuel
47
JKP16 Surface-Plasmon Resonance
48
49 JKP17 Fermi-Level Equilibration
50
51 JKP18 Nanocrystalline Semiconductor-Films, Photoelectrochemical Properties, Gel Electrolyte, Sensitizers
52
53 JKP19 Tio2 Thin-Films, Tio2 Films
54
JKP20 Place-Exchange-Reactions
55
56
57 TABLE IV: Clusters generated using K-means: keyword plus data (Jaccard distances)
58
59
60
61
62
63 July 19, 2011 DRAFT
64
65
26

1
2
3
4
5
6
7
8
9
10
11
12 Keyword plus (set 1)
13 3
14
15
16 sensitized nanocrystalline tio2
17 2
Solar cell production (CP3)
sputtering deposition method
18
nanocrystalline
ruthenium polypyridyl semiconductorfilms
complex
19 functionalized gold nanoparticles
20 coupled
placeexchangereactions electrontransfer
grain morphology
21 1 photonic crystals
22 Biomass (CP2)
23
geneticlinkage maps
24 physisorption
transgenic
rapd markerspoplar
25 metalcomplexes
electrocatalytic hydrogen evolution
0 oxidative addition
26 agrobacteriummediated transformation
recombination
moleculargenetics
cds activesite 16s ribosomalrna
27 electrodes
films
adsorption
efficiency
values
cells
metals
mechanisms nitrogenfixation
28 trichodermareesei
differentiation
products
nickel wateroxidation
composites
fuels
gasoline
qm9414
hydrogen
liquid
kinetics
29 fimi wood co2 sp strain atcc29133
30 1 ch4
aqueous sediment
ammonia anacystisnidulans
bidirectional hydrogenase
31 recycled percolation seawater
gasification
process anabaenavariabilis
32 hybrid
limecornpoplar
stover ignition
pretreatment partial oxidation
liquefaction
lignocellulosic materials
33 elemental sulfur Cyanobacteria (CP1)
34 2
pressure cooking
35 flash pyrolysis
briquettes
36 chemical heat pipe
37
cocombustion
38
3 wheatstrawdieselpowerplant
mixtures
39
40
41
42
43 4
44 3 2 1 0 1 2 3 4 5 6
45
46 Fig. 9: Sammon Map of keyword plus terms, set 1. Thematic clusters have been highlighted, and where
47 possible have been linked to clusters found in the hierarchical maps
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63 July 19, 2011 DRAFT
64
65
27

1
2
3
4
5
6
7
8
9
10
11
12 Keyword plus (set 2)
13 3
14
15
16
step gene replacement
17 2
18
19 graphite nanofibers
20
21 1
22 glycosyl hydrolasesGasification (CP5)
23 Solar cell (CP6)
24 montecarlo simulations
surfaceplasmon resonance
biomasscoal
conversion processes titaniumdioxide films
tio2 thinfilms
25 pulverized
26 0 walled carbon nanotubes
devolatilization energy hydrogenperoxide fermilevel equilibration
gel electrolyte
27 Biomass (CP4) synergism electrontransfer
hydrolysis solarcells
activation
system cdte
herbaceous biomass synthesis gas dye tio2
chalcopyrite films
28 families
enzymatichydrolysis catalysts photoelectrochemical
waste
genetransfer cellulose
paperfluidizedbed
liquids
conversion
transport
fuel pores reduction cells
sensitizers photoelectrochemical
electrochemical properties
29 lignin removal cellulases particles
pyrolysis
mutagenesis
fuelcell hydrogenproduction
infrastructure
vehicles photoproductionphotoinduced electrontransfer
combustion
30 1 shortrotation photosystemii chargetransfer dynamics
31
homogeneous catalysis
32 proton reduction
33
solar furnace
34 2 light interception
35
36
37 Biomass crops (CP7)
canopy structure
38
39 3
40
opentop chambers
41
42
43 4
44 2 1 0 1 2 3 4
45
46 Fig. 10: Sammon Map of author keyword plus terms, set 2. Thematic clusters have been highlighted, and
47 where possible have been linked to clusters found in the hierarchical maps
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63 July 19, 2011 DRAFT
64
65
28

R EFERENCES
1
2 [Antoln et al., 2002] Antoln, G., Tinaut, F. V., Briceno, Y., Castano, V., Perez, C., and Ramrez, A. I. (2002). Optimisation of biodiesel
3 production by sunflower oil transesterification. Bioresource Technology, 83(2):111114.
4 [Anuradha et al., 2007] Anuradha, K., Urs, and Shalini (2007). Bibliometric indicators of indian research collaboration patterns: A
5 correspondence analysis. Scientometrics, 71(2):179189.
6
[Baek et al., 2005] Baek, N. C., Shin, U. C., and Yoon, J. H. (2005). A study on the design and analysis of a heat pump heating system
7
8 using wastewater as a heat source. Solar Energy, 78(3):427440.
9 [Bengisu and Nekhili, 2006] Bengisu, M. and Nekhili, R. (2006). Forecasting emerging technologies with the aid of science and technology
10 databases. Technological Forecasting and Social Change, 73(7):835844.
11 [Bishop, 1995] Bishop, C. (1995). Neural networks for pattern recognition. Oxford university press.
12
[Bishop, 2006] Bishop, C. (2006). Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, Singapore.
13
14 [Borner et al., 2005] Borner, K., DallAsta, L., Ke, W., and Vespignani, A. (2005). Studying the emerging global brain: Analyzing and
15 visualizing the impact of co-authorship teams. Complexity, 10(4):5767.
16 [Braun et al., 2000] Braun, T., Schubert, A. P., and Kostoff, R. N. (2000). Growth and trends of fullerene research as reflected in its
17 journal literature. Chemical Reviews, 100(1):2338.
18
[Chiu and Ho, 2007] Chiu, W.-T. and Ho, Y.-S. (2007). Bibliometric analysis of tsunami research. Scientometrics, 73(1):317.
19
Knowledge and Data
20 [Cilibrasi and Vitanyi, 2007] Cilibrasi, R. L. and Vitanyi, P. M. B. (2007). The google similarity distance.
21 Engineering, IEEE Transactions on, 19(3):370383.
22 [Cilibrasi and Vitanyi, 2006] Cilibrasi, R. and Vitanyi, P. (2006). Automatic extraction of meaning from the web. In IEEE International
23 Symp. Information Theory.
24 [Daim et al., 2005] Daim, T. U., Rueda, G. R., and Martin, H. T. (2005). Technology forecasting using bibliometric analysis and system
25
dynamics. In Technology Management: A Unifying Discipline for Melting the Boundaries, pages 112122.
26
27 [Daim et al., 2006] Daim, T. U., Rueda, G., Martin, H., and Gerdsri, P. (2006). Forecasting emerging technologies: Use of bibliometrics
28 and patent analysis. Technological Forecasting and Social Change, 73(8):9811012.
29 [de Miranda et al., 2006] de Miranda, Coelho, G. M., Dos, and Filho, L. F. (2006). Text mining as a valuable tool in foresight exercises:
30 A study on nanotechnology. Technological Forecasting and Social Change, 73(8):10131027.
31
[Ding et al., 2001] Ding, Y., Chowdhury, G., and Foo, S. (2001). Bibliometric cartography of information retrieval research by using
32
33 co-word analysis. Information processing & management, 37(6):817842.
34 [Elnekave, 2008] Elnekave, M. (2008). Adsorption heat pumps for providing coupled heating and cooling effects in olive oil mills.
35 International Journal of Energy Research, 32(6):559568.
36 [Glanzel and Schubert, 2005] Glanzel, W. and Schubert, A. (2005). Analysing scientific networks through co-authorship. Handbook of
37
quantitative science and technology research, pages 257276.
38
39 [Hansel and Lindblad, 1998] Hansel, A. and Lindblad, P. (1998). Towards optimization of cyanobacteria as biotechnologically relevant
40 producers of molecular hydrogen, a clean and renewable energy source. Applied Microbiology and Biotechnology, 50(2):153160.
41 [Igami, 2008] Igami, M. (2008). Exploration of the evolution of nanotechnology via mapping of patent applications. Scientometrics,
42 77(2):289308.
43 [Janssens et al., 2006] Janssens, F., Leta, J., Gl
44
anzel, W., and De Moor, B. (2006). Towards mapping library and information science. Information processing & management,
45
46 42(6):16141642.
47 [Kajikawa and Takeda, 2008] Kajikawa, Y. and Takeda, Y. (2008). Structure of research on biomass and bio-fuels: A citation-based
48 approach. Technological Forecasting and Social Change, 75(9):13491359.
49 [Kajikawa et al., 2007] Kajikawa, Y., Yoshikawa, J., Takeda, Y., and Matsushima, K. (2007). Tracking emerging technologies in energy
50
research: Toward a roadmap for sustainable energy. Technological Forecasting and Social Change, In Press, Corrected Proof.
51
52 [Kim and Mee-Jean, 2007] Kim and Mee-Jean (2007). A bibliometric analysis of the effectiveness of koreas biotechnology stimulation
53 plans, with a comparison with four other asian nations. Scientometrics, 72(3):371388.
54 [King, 2004] King, D. A. (2004). The scientific impact of nations. Nature, 430(6997):311316.
55 [Kostoff, 2001] Kostoff, R. N. (2001). Text mining using database tomography and bibliometrics: A review. 68:223253.
56
[Losiewicz et al., 2000] Losiewicz, P., Oard, D., and Kostoff, R. (2000). Textual data mining to support science and technology
57
management. Journal of Intelligent Information Systems, 15(2):99119.
58
59 [Manning et al., 2008] Manning, C., Raghavan, P., Sch
60 utze, H., and Corporation, E. (2008). Introduction to information retrieval, volume 1. Cambridge University Press Cambridge, UK.
61
62
63 July 19, 2011 DRAFT
64
65
29

[Martino, 1993] Martino, J. (1993). Technological Forecasting for Decision Making. McGraw-Hill Engineering and Technology
1 Management Series.
2
[Mcdowall and Eames, 2006] Mcdowall, W. and Eames, M. (2006). Forecasts, scenarios, visions, backcasts and roadmaps to the hydrogen
3
economy: A review of the hydrogen futures literature. Energy Policy, 34(11):12361250.
4
5 [Morel et al., 2009] Morel, C., Serruya, S., Penna, G., and Guimaraes, R. (2009). Co-authorship network analysis: A powerful tool for
6 strategic planning of research, development and capacity building programs on neglected diseases. PLoS neglected tropical diseases,
7 3(8):e501.
8 [Porter and Rafols, 2009] Porter, A. and Rafols, I. (2009). Is science becoming more interdisciplinary? measuring and mapping six research
9
fields over time. Scientometrics, 81(3):719745.
10
11 [Porter et al., 1991] Porter, A., Roper, A., Mason, T., Rossini, F., and Banks, J. (1991). Forecasting and Management of Technology.
12 Wiley-Interscience, New York.
13 [Porter, 2005] Porter, A. (2005). Tech mining. Competitive Intelligence Magazine, 8(1):3036.
14 [Porter, 2007] Porter, A. (2007). How tech mining can enhance r&d management. Research Technology Management, 50(2):1520.
15
[Saitou and Nei, 1987] Saitou, N. and Nei, M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees.
16
17 Mol. Biol. Evol., 4(4):406425.
18 [Saka and Igami, 2007] Saka, A. and Igami, M. (2007). Mapping modern science using co-citation analysis. In IV 07: Proceedings of
19 the 11th International Conference Information Visualization, pages 453458, Washington, DC, USA. IEEE Computer Society.
20 [Sammon, 1969] Sammon, J. (1969). A nonlinear mapping for data structure analysis. IEEE transactions on Computers, 100(18):401409.
21
[Smalheiser, 2001] Smalheiser, N. R. (2001). Predicting emerging technologies with the aid of text-based data mining: the micro approach.
22
Technovation, 21(10):689693.
23
24 [Small, 2006] Small, H. (2006). Tracking and predicting growth areas in science. Scientometrics, 68(3):595610.
25 [Takeda and Kajikawa, 2009] Takeda, Y. and Kajikawa, Y. (2009). Optics: a bibliometric approach to detect emerging research domains
26 and intellectual bases. Scientometrics, 78(3):543558.
27 [Takeda et al., 2009] Takeda, Y., Mae, S., Kajikawa, Y., and Matsushima, K. (2009). Nanobiotechnology as an emerging research domain
28
from nanotechnology: A bibliometric approach. Scientometrics, 80(1):2338.
29
30 [Upham and Small, 2010] Upham, S. and Small, H. (2010). Emerging research fronts in science and technology: patterns of new knowledge
31 development. Scientometrics, 83(1):1538.
32 [Van Der Heijden, 2000] Van Der Heijden, K. (2000). Scenarios and forecasting - two perspectives. Technological forecasting and social
33 change, 65:3136.
34
[Woon et al., 2011] Woon, W. L., Zeineldin, H., and Madnick, S. (2011). Bibliometric analysis of distributed generation. Technological
35
36 Forecasting and Social Change, 78(3):408 420.
37 [Woon and Madnick, 2009] Woon, W. and Madnick, S. (2009). Asymmetric information distances for automated taxonomy construction.
38 Knowledge and Information Systems, 21:91111. 10.1007/s10115-009-0203-5.
39 [Zhu and Porter, 2002] Zhu, D. and Porter, A. (2002). Automated extraction and visualization of information for technological intelligence
40
and forecasting. Technological Forecasting and Social Change, 69(5).
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63 July 19, 2011 DRAFT
64
65

You might also like