Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1


Ratings: (0)|Views: 9|Likes:
Published by Haopeng Zhang

More info:

Published by: Haopeng Zhang on Aug 11, 2010
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Sikuli: Using GUI Screenshots for Search and Automation
Tom Yeh Tsung-Hsiang Chang Robert C. Miller 
EECS MIT & CSAILCambridge, MA, USA 02139{tomyeh,vgod,rcm}@csail.mit.edu
We present Sikuli, a visual approach to search and automa-tion of graphical user interfaces using screenshots. Sikuliallows users to take a screenshot of a GUI element (such asa toolbar button, icon, or dialog box) and query a help sys-tem using the screenshot instead of 
name.Sikuli also provides a visual scripting API for automatingGUI interactions, using screenshot patterns to direct mouseand keyboard events. We report a web-based user studyshowing that searching by screenshot is easy to learn andfaster to specify than keywords. We also demonstrate sev-eral automation tasks suitable for visual scripting, such asmap navigation and bus tracking, and show how visualscripting can improve interactive help systems previously proposed in the literature.
ACM Classification:
H5.2 [Information interfaces and presentation]: User Interfaces.
- Graphical user interfaces.
General terms:
Design, Human Factors, Languages
online help, image search, automation
In human-to-human communication, asking for informationabout tangible objects can be naturally accomplished bymaking direct visual references to them. For example, toask a tour guide to explain more about a painting, we wouldsay
while pointing to. Givingverbal commands involving tangible objects can also benaturally accomplished by making similar visual refer-ences. For example, to instruct a mover to put a lamp ontop of a nightstand, we would
 put this over there
while pointing to and respectively.Likewise, in human-to-computer communication, findinginformation or issuing commands involving GUI elementscan be accomplished naturally by making direct visual ref-erence to them. For example, asking the computer to
information about this
while pointing to , we wouldlike the computer to tell us about the Lasso tool for Photo-shop and hopefully even give us links to web pages ex- plaining this tool in detail. Asking the computer to
all these
while pointing to and re-spectively, means we would like the computer to move allthe Word documents to the recycle bin.However, some interfaces do not interact with us visuallyand force us to rely on non-visual alternatives. One exam- ple is
. With the explosion of information on theweb, search engines are increasingly useful as a
resortfor help with a GUI application, because the web may havefresher, more accurate, more abundant information than the
-in help. Searching the web currentlyrequires coming up with the right keywords to describe an
GUI elements, which can be challenging.Another example is
. Scripts or macros that con-trol GUI elements either refer to an element by name,which may be unfamiliar or even unavailable to the user, or  by screen location, which may change.This paper presents Sikuli
, a visual approach to searchingand automating GUI elements (Figure 1). Sikuli allowsusers or programmers to make direct
reference toGUI elements. To search a documentation database about aGUI element, a user can draw a rectangle around it and take
In Huichol Indian language,
 power of seeing and understanding things unknown.Permission to make digital or hard copies of all or part of this work for  personal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise,or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
7, 2009, Victoria, British Columbia, Canada.Copyright 2009 ACM 978-1-60558-745-5/09/10...$10.00.
Figure 1:
Sikuli Search
allows users to search docu-mentation and save custom annotations for a GUIelement using its screenshot (captured by stretching arectangle around it).
Sikuli Script 
allows users to au-tomate GUI interactions also using screenshots.
Sikuli SearchSikuli Script 
 a screenshot as a query. Similarly, to automate interactionswith a GUI element, a programmer can insert
screenshot directly into a script statement and specify whatkeyboard or mouse actions to invoke when this element isseen on the screen. Compared to the non-visual alterna-tives, taking screenshots is an intuitive way to specify avariety of GUI elements. Also, screenshots are universallyaccessible for all applications on all GUI platforms, since itis always possible to take a screenshot of a GUI element.We make the following contributions in this paper:
Sikuli Search
, a system that enables users to search alarge collection of online documentation about GUIelements using screenshots;
an empirical demonstration of the
retrieve relevant information about a wide variety of dialog boxes, plus a user study showing that screen-shots are faster than keywords for formulating queriesabout GUI elements;
Sikuli Script 
, a scripting system that enables program-mers to use screenshots of GUI elements to controlthem programmatically. The system incorporates afull-featured scripting language (Python) and an editor interface specifically designed for writing screenshot- based automation scripts;
two examples of how screenshot-based interactivetechniques can improve other innovative interactivehelp systems (Stencils [8] and Graphstract [7]).This paper is divided into two parts. First we describe andevaluate Sikuli Search. Then we describe Sikuli Script and present several example scripts. Finally we review relatedwork, discuss limitations of our approach, and conclude.
This section presents Sikuli Search, a system for searchingGUI documentation by screenshots. We describe motiva-tion, system architecture, prototype implementation, theuser study, and performance evaluation.
The development of our screenshot search system is moti-vated by the lack of an efficient and intuitive mechanism tosearch for documentation about a GUI element, such as atoolbar button, icon, dialog box, or error message. The abil-ity to search for documentation about an arbitrary GUIelement is crucial when users have trouble interacting withthe element and the appli
-in help features areinadequate. Users may want to search not only the officialdocumentation, but also computer books, blogs, forums, or online tutorials to find more help about the element.Current approaches require users to enter keywords for theGUI elements in order to find information about them, butsuitable keywords may not be immediately obvious.Instead, we propose to use a screenshot of the element as aquery. Given their graphical nature, GUI elements can bemost directly represented by screenshots. In addition,screenshots are accessible across all applications and plat-forms by all users, in contrast to other mechanisms, liketooltips and help hotkeys (F1), that may or may not be im- plemented by the application.
System Architecture
Our screenshot search system, Sikuli Search, consists of three components: a screenshot search engine, a user inter-face for querying the search engine, and a user interface for adding screenshots with custom annotations to the index.
Screenshot Search Engine
Our prototype system indexes screenshots extracted from awide variety of resources such as online tutorials, officialdocumentation, and computer books. The system representseach screenshot using three different types of features (Fig-ure 2). First, we use the text surrounding it in the sourcedocument, which is a typical approach taken by currentkeyword-based image search engines.Second, we use visual features. Recent advances in com- puter vision have demonstrated the effectiveness of representing an image as a set of 
visual words
[18]. A visu-al word is a vector of values computed to describe the visu-al properties of a small patch in an image. Patches are typi-cally sampled from salient image locations such as cornersthat can be reliably detected in despite of variations inscale, translation, brightness, and rotation. We use the SIFTfeature descriptor [11] to compute visual words from sa-lient elliptical patches (Figure 2.3) detected by the MSER detector [12].
Figure 2: Screenshots can be indexed by surrounding text, visual features, and embedded text (via OCR).
 Screenshot images represented as visual words can be in-dexed and searched efficiently using an inverted index thatcontains an entry for each distinct visual word. To index animage, we extract visual words and for each word add theimage ID to the corresponding entry. To query with another image, we also extract visual words and for each word re-trieve from the corresponding entry the IDs of the images previously indexed under this word. Then, we find the IDsretrieved the most number of times and return the corres- ponding images as the top matches.Third, since GUI elements often contain text, we can indextheir screenshots based on embedded text extracted by opti-cal character recognition (OCR). To improve robustness toOCR errors, instead of using raw strings extracted by OCR,we compute 3-grams from the characters in these strings.For example, the word
might be incorrectly recog-nized as
. But when represented as a set of 3-gramsover characters, these two terms are {
 sys, yst, ste, tem
} and
{sys, yst, ste, ten
} respectively, which results in a 75%match, rather than a complete mismatch. We consider onlyletters, numbers and common punctuation, which together define a space of 50,000 unique 3-grams. We treat eachunique 3-gram as a visual word and include it in the sameindex structure used for visual features.
User Interface for Searching Screenshots
Sikuli Search allows a user to select a region of interest onthe screen, submit the image in the region as a query to thesearch engine, and browse the search results. To specify theregion of interest, a user presses a hot-key to switch to Si-kuli Search mode and begins to drag out a rubber-bandrectangle around it (Figure 1). Users do not need to fit therectangle perfectly around a GUI element since our screen-shot representation scheme allows inexact match. After therectangle is drawn, a search button appears next to it, whichsubmits the image in the rectangle as a query to the searchengine and opens a web browser to display the results.
User Interface for Annotating Screenshots
We have also explored using screenshots as hooks for an-notation. Annotation systems are common on the web (e.g.WebNotes
and Shiftspace
), where URLs and HTML pagestructure provide robust attachment points, but similar sys-tems for the desktop have previously required applicationsupport (e.g. Stencils [8]). Using screenshots as queries,we can provide general-purpose GUI element annotationfor the desktop, which may be useful for both personal andcommunity contexts. For example, consider a dialog boxfor opening up a remote desktop connection. A user maywant to attach a personal note listing the IP addresses of theremote machines accessible by the user, whereas a commu-nity expert may want to create a tutorial document and link the document to this dialog box.
annotation interface allows a user to savescreenshots with custom annotations that can be looked up
using screenshots. To save a screenshot of a GUI element,the user draws a rectangle around it to capture its screen-shot to save in the visual index. The user then enters theannotation to be linked to the screenshot. Optionally, theuser can mark a specific part of the GUI element (e.g., a button in a dialog box) to which the annotation is directed.
Prototype Implementation
The Sikuli Search prototype has a database of 102 popular computer books covering various operating systems (e.g.,Windows XP, MacOS) and applications (e.g., Photoshop,Office), all represented in PDF
. This database containsmore than 50k screenshots. The three-feature indexingscheme is written in C++ to index these screenshots, usingSIFT [11] to extract visual features, Tesseract
for OCR,and Ferret
for indexing the text surrounding the screen-shots. All other server-side functionality, such as acceptingqueries and formatting search results, is implemented inRuby on Rails
with a SQL database. On the client side, theinterfaces for searching and annotating screenshots are im- plemented in Java.
User Study
We have argued that a screenshot search system can simpli-fy query formulation without sacrificing the quality of theresults. To support these claims, we carried out a user studyto test two hypotheses: (1) screenshot queries are
tospecify than keyword queries, and (2) results of screenshotand keyword search have roughly the same relevance as judged by users. We also used a questionnaire to shed light
of both search methods.
The study was a within-subject design and took placeonline. Subjects were recruited from Craigslist andcompensated with $10 gift certificates. Each subject wasasked to perform two sets of five search tasks (1 practice +4 actual tasks). Each set of tasks corresponds to one of thetwo conditions (i.e., image or keyword) that are randomlyordered. The details of a task are as follows. First, the
Figure 3: User study task, presenting a desktop im-age containing a dialog box (left) from which to for-mulate a query, and search results (right) to judgefor relevance to the dialog box.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->