• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
 
 2
 
J
ANUARY
,
 
2009
R
ESEARCH
S
TATEMENT
 
Christopher M. Collins
  Textual data is at the forefront of information management problems today. Thousands of pages of text data, in many languages, are produced daily: emails, news reports, blog posts, product reviews,discussion forums, academic articles, and business reports. Computational linguistics interventionshave also increased, as we rely more and more on automated language translation, summarization,enhanced information retrieval, and opinion mining. Managing, exploring, and analysing the flow of linguistic data is becoming both an individual and a societal problem. Data scale issues are not only achallenge for end users of technology, but are equally challenging for natural language engineers asthey develop the systems that allow computers to manipulate and analyze text.My research focus in human-computer interaction uses techniques from information visualization andmy background in computational linguistics to create interactive visualizations of language. These visualizations help both computational linguistics researchers and everyday computer users to manageand analyze linguistic data. I approach visual linguistic analysis as a problem of computer-aidedcognition — visualizations produced by the high speed and high volume of data analysis achieved by computers complement human linguistic sophistication and decision-making ability. I use a holisticset of methods, including ethnography and user-centered design, as well as drawing on cognitivescience findings about human perception and results of laboratory-based studies. My dissertationresearch focuses on visualizing the processes and outcomes of natural language processing, supportinglarge scale text analytics, and providing novel information visualization techniques for multi-dimensional data, including linguistic data. I have identified several exciting next steps in each of theseareas. Additionally, I see an opportunity to use the synergy between analysis of language and analysisof other types of data to enable new discoveries about language change, language use, and how language affects society. While my dissertation results and mid-term research agenda focus onlinguistic data and problems, outcomes of my work have and will continue to include creative andgeneralizable human-computer interaction and visualization contributions.
D
ISSERTATION
R
ESEARCH
 
V
ISUALIZING
N
ATURAL
L
ANGUAGE
P
ROCESSING
 
 Traditional natural language processing (NLP) produces a single output, such as a translated sentence.However, these outputs are often associated with well-defined measures of uncertainty, both inherentin the data, and introduced by statistical NLP. To open the black box of NLP, we presented thelattice uncertainty visualization [C.3]
1
at EuroVis 2007. This visualization provides a quick andintuitive visualization of several translation hypotheses and their estimated uncertainty. Embedded inan instant messaging chat client, the visualization enables cross-language conversations, helpingreaders judge how much to trust the translation accuracy. The technique was also applied to theoutputs of the speech recognition system developed in my M.Sc. research [C.1].Understanding the outcomes of natural language processes is not only important for end users of natural language systems, but also for the researchers who develop them. In yet-unpublished work, Iundertook an ethnographic study of natural language engineers at work. During this study, I observed
1
See CV for self references.
Translationuncertainty latticevisualization with anun-translatable wordshown as a photonode.
 
 
ad hoc use of visualization (the quality of their translaoverwhelmed with the task response, I created a visuali which supports annotation adeploying and evaluating thias social network graphs andIn a related work, our investnew method for revealing rVisLink was applied to revemeasures of interest to NLsystem, presented in our Infointegrative visualization, incl
Future Directions
 There is a lot of potentialapplication of process visualiof automata used in tasks spruning and beam search forof these possibilities as we en visualization to aid NLP reresearch areas. Interactive viis increasingly common astasks include ensuring the qInteractive visual explorationcould build on our work witin NLP research. While the VisLink system hfor investigation into long- visual occlusion. We have b(Boeing), chemistry (Dow C
V
ISUAL
T
EXT
A
NALYTICS
 
One of the most fundamentthe electronic version of thuan unstructured overview, aontologies such as WordNe which combines word frequsimple visualization that reDocuBurst is a radial, spacecounts of words in a documeanalysis is supported with gaccess to the source text. Otimes and is in active usenewspapers, television, and
Chinese-to-Englishtranslation parse treewith overlaid setrelations as concave2D iso-surfaces.VisLink relatingvarious lexicalvisualizations (top)and heterogeneousdata about USelections (bottom).DocuBurst of termsunder ‘physics’ in ageneral sciencetextbook.
R
ESEARCH
S
TA
oth sketches and computer-based static information grtion system. I discovered that the team of researchof evaluating their system and diagnosing translation qation of their data structures (tree structures overlaidd collaboration for analysis and sharing of results. I am visualization, which is general enough to apply to otherscatter plots with associated set relations.igation of word similarity measures led to the contributielationships amongst 2D visualizations by linking theal patterns of similarity and difference amongst variousP researchers, but can be used to connect any 2D visVis 2007 paper [J.3], has been praised as an early exampluding by Georges Grinstein in his IV 2008 keynote addror improving the effectiveness of the outputs of NLPzation. Areas of importance include investigating real-tch as dialogue system construction and visualizing theparsing and other applications. At ACL in 2008, we egaged the research community in a tutorial on the potensearch [TW.1]. The exciting discussion raised seversualization is well-suited to exploratory data analysis, a tLP moves towards greater reliance on statistical techniquality and balanced coverage of the corpora used to traof parameter spaces, e.g., “What changes when I adjustmachine translation researchers to provide an even tigholds promise for additional linguistic analyses, it also preld visualization challenges such as 3D interaction, deen invited to apply VisLink in collaborations by researcemical), and forensic analysis (Royal Canadian Mounteal analytic tasks for text is to provide an overview of a lbing through a book. Existing visualizations, such as tad do not make use of the wealth of expert-created kno. We designed DocuBurst, the first visualization of dency with the human-created structure in lexical dataflects semantic content [TR.1, O.1, P.1-3,
 full pape 
filling layout of hyponymy (the IS-A relation), overlaidnt to provide visual summaries at varying levels of granueometric and semantic zoom, selection of individualr radial space filling layout and interaction code has beein several research labs. DocuBurst has been featureradio. Deployment inquiries include patent databases
TEMENT
 
2
OF
4
phics) to analyzeers had becomeuality issues. Inith set relations)in the process of sorts of data suchon of VisLink, ain a 3D space.lexical similarity alizations. Thee of the future of ss [1].systems throughime visualizationprocess of chartumerated severalial for interactivel possible futurepe of task whichues. Exploratory in NLP systems.this parameter?”t
 
er feedback loopovides a test bedta ordering, andhers in aerospacePolice).ng document —g clouds, provide wledge stored inocument contentases, to create a
 
in submission
]. with occurrencelarity. Interactiveords, and linkeddownloaded 75d in mainstream(Dow Chemical)
 
R
ESEARCH
S
TATEMENT
 
3
OF
4
and governmental repositories (NATO). In addition, many educators have seen its potential as ateaching aid, and are eager to use it in the classroom. I am investigating options for wide distribution.Moving beyond a single text, while working with Martin Wattenberg and Fernanda Viégas in thesummer of 2008, I developed a system which supports exploratory real-time visual analytics on very large numbers of long documents. Specifically, our visualization enables detection and display of changing themes amongst the 14 upper courts of the USA, over a period of more than 200 years,covering a total of over 600,000 written court decisions. Linked access to the source text allows fordrill down by investigators interested in discovering more about a term or time period. Trials withlegal scholars resulted in interesting discoveries about the geographically distributed themes (e.g.miner’s lung disease in the 4
th
Circuit) and fascinating legal jargon and language use differences acrossstates and even judges (e.g. the 7
th
Circuit uses the “ostrich instruction” to jurors abnormally often).
Future Directions
 While advanced computational techniques such as latent semantic analysis are powerful tools for dataexperts, they produce results that can be unintuitive for the average reader. Simple analysis techniquessuch as word-counting and lexical relationship mapping require little training to understand, however, we have only scratched the surface of how to present the results of these analyses. A lot of potentialremains for better visualization, increased capacity for large and changing datasets, support fordifferent viewing platforms, and large-scale collaboration. The Web offers a programming and visualization platform, exciting data sources, and an eager user base for continued research. Imagine a visualization platform for collaborative visual analytics of government documents to increasetransparency, or a method for tracking and understanding how linguistic ‘memes’ propagate online. The long documents we have worked with are luxury items, each providing a wealth of data. Anexciting text analytics challenge is to develop data analysis and visualization techniques for short texts— ‘tweets’ on Twitter, FaceBook status updates, and consumer-submitted product reviews comprise asignificant portion of computer-mediated communication, especially amongst younger generations.Visualizations of these data sources generally focus on keyword spotting and word counting. Closerintegration of analysis techniques such as sentiment analysis and word likelihood scoring could have adramatic impact on how we understand the flurry of short texts produced all day, every day, online.
T
HE
N
EXT
S
TEP
:
 
L
EVERAGING
L
INGUISTIC
D
ATA FOR
N
ON
-L
INGUISTIC
I
NSIGHT
 
Languages are always changing — over time new words emerge, spelling and pronunciation shift,fashionable slang explodes and fizzles, and older words lose favour. Sociolinguists and lexicaletymologists have been studying how written language is influenced by social structures for many years[e.g., 2]. The massive scale of digital data available on the Internet introduces new computer scienceresearch challenges for these types of analyses. How do we analyse ever-changing linguistic style overtime? How do we link computational linguistics to the closely intertwined social structures (friends,colleagues, geography) backing the use of language online? And, given two or more heterogeneousand mutually influential types of data that are changing through time, what types of visual metaphorsand graphical techniques will best support teasing apart the networks of influence? Can we then usethese models to predict linguistic influence? Through developing new linguistic algorithms and visualization techniques to address these computational challenges, we can also gain insight into thesociological aspects of language change in ways not previously possible.Explicit online connections such as “friending” patterns and bookmark sharing activities have beeninvestigated using methods such as information visualization [3] and ethnography [4]. These
Visualization of lexicaldifferences amongst14 US courts (top).Enlargement of several words withconnections (bottom).
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...