A Study of Fallback Procedures in a Keystroke Biometric System

Michael Friedman, Birendra Gurung, Derwin Lugo, Murat Ocak, Mark Ritzmann, Lars Weinrich Ivan G Seidenberg School of CSIS, Pace University 1 Martine Ave, White Plains, NY, 10606, USA {mf22990n, bg77633w, dl99837n, mo73153p, lw50479n}@pace.edu marksritzmann@yahoo.com Abstract
The Keystroke Biometric System developed at Pace University over the 2005/2006 academic year is able to identify subjects based on long text samples. This system used a grammar based extraction method where incomplete or insufficient data would be substituted with more generalized grammatical data. The resulting data when fed into a classification program based on Euclidean distances was able to consistently identify subjects with degrees of accuracy exceeding 96%. In an effort to further improve the results obtained from this “Linguistic” method of feature extraction, a new “Touch-Type fallback method” was developed. This fallback method is based on the geography of a standard computer keyboard rather than grammar. Further improvements to the overall system include a user interface for dramatic efficiency gains in running various types of tests, and a “trace mechanism” which allows the user to analyze and track feature extraction events. from these long-text-input entries is then extracted and a pattern classifier is then used to identify the author of the text. This program can be used by a researcher attempting to identify keystroke patterns in long-text passages. Potential uses include professors of online courses who need to validate the work submitted by students [1]. A paper by Villani et al. [2] explores the results when a subject trains on one style of keyboard (e.g., desktop keyboard, laptop keyboard) and/or one style of data entry (e.g., free text, copy task) and is tested against the same and/or a different style of keyboard and/or data entry style.

2. Keystroke Biometric System
The keystroke biometric system consists of five components: capturing demographic data; identification of keyboard type and selection of data entry task; data entry through the preexisting Java applet; keystroke feature extraction; and classification.

2.1 Capturing Demographic Data
To begin capturing demographic data, the user accesses a Web site hosted by a server with the capability of serving HTML and PHP files, and running a MySQL database [1]. To ensure all users have entered the demographic data, they are required to enter their first and last name (the researcher-approved composite primary key of the demographics table) and submit this information. The information is then entered in the MySQL database. If the first and last name combination is not found, the demographic data is captured through a Web form which will authenticate that all demographic questions have been answered and that the potential subject has agreed with the terms and conditions of this experiment. The

1. Introduction
We are exploring Keystroke Biometric Systems which measures typing characteristics believed to be unique to an individual and difficult to duplicate. A Keystroke Biometric Identification System (one-of-n response) has been developed at Pace University since the 2004/2005 academic year. This project is a continuation of the Pace University CS616, 2005-2006 Keystroke Biometric System created by Gary Giang Ngo, Justin Simeone and Huguens St. Fort for Mary Villani, Dr. Tappert and Dr. Cha. The system employs a Java applet to collect raw keystroke data over the Internet. Data

demographic data is then written to the demographics table and the user is shown gratitude for their participation.

Figure 3: Activity Selection Page [1].

Figure 1: Registration Activity. To allow users to leave and return to the Web site, four counters are also initialized to track the entry number the user will begin with upon returning to the Web site. The four experimental categories are copy task on a desktop, copy task on a laptop, free-text entry on a desktop and free-text entry on a laptop [1]. Figure 4: Once you click go, it takes you to the appropriate Java applet.

Figure 2: Four Experimental Categories At the completion of registration or, upon returning to the site, the user is redirected to the activity selection PHP page. This page receives the user’s first name and last name from the referring page and queries the database to obtain values of the counter fields [1]. (See Figure 3) Clicking go redirects the user to the appropriate Java applet based on his/her selections (See figure 5). There are six pieces of information sent to, and required by, the Java applet: first name; last name; experiment style (e.g., free text, copy task); sequence number for the selected experiment style (respective counter field value); keyboard style; and awareness [1]. Awareness refers to whether the user knows he/she is working with a keystroke biometric system. If the Java applet does not receive these six values, or if the user does not have a Java Runtime Environment (JRE) equal to or later version 1.4, the applet will not launch. Lastly, the user must use Microsoft’s Internet Explorer in order for the applet to function properly [1]. Figure 5: Java applet before any keystrokes have been entered [1] After analyzing previous raw data files, it was identified that typos or inconsistencies in a participant’s name causes problems in the feature extractor. By requiring the user to register once and use the same first and last name to access the system, the problem is eliminated. The same principle is true for the activity sequence number; should the user enter a number already used, the user will overwrite his/her existing raw data file. This is corrected through the use of counters in the database managed through PHP scripts [1]. Depending on the sample being collected, the system checks for a minimum number of keystrokes [1]. In the study by Villani et al. [2] the copy task entries must be at least 635 keystrokes and free text samples at least 677 keystrokes, otherwise the user is prompted to continue typing (See Figure 6).

a feature. Fallback is implemented by assigning each node on the tree a numeric pair consisting of that feature’s unique numeric identifier. This allows the programmer to easily change the pairs and thereby changes the structure of the tree [1].

2.3 Feature Classifier
After all features have been extracted into a data or “features” file, the data is ready to be classified in an attempt to identify an author. The identification is a measure of the sum of the Euclidian distances of all the collected features. This analysis is done in one of two methods. In the “train-on-one” or “leave-one-out” method, one features file is used and classification occurs by pulling out each data entry and comparing it to all the other data in the features file. Classification is successful if the Euclidean distance is least with respect to another data entry by the same author. The second method of classification is to train the classifier on one features file and then attempt to match the data from a second features file to those in the training file. Again, a successful match in this case is when the Euclidean distance is least between the data being tested and the data in the training file by the same author.

Figure 6: Warning if user clicks submit before meeting the minimum number of keystrokes [1]. When the user correctly completes the task and clicks submit, a PHP file is called, which writes the raw data information to a text file and (transparent to the user) updates the user’s counter field by one in the database. The user sees the Java applet in a nearly identical state as that pictured in Figure 5, except the sequence number has been incremented. The user can enter another sample or click back button to return to the activity selection page [1]. For ease of locating the raw data files, each experimental style/keyboard combination is given its own directory on the server. Before progressing to the feature extraction process, the researcher must FTP the raw data files to a directory on his/her local disk [1].

2.2 Feature Extraction
The software developers used Borland’s JBuilder as the IDE of choice. The feature extraction program reads all of the raw data text files from a directory on the researcher’s local disk. One string of data is created from file and stored in a vector. The vector is read in ascending order from index zero to index N, where N is the number of raw data files. A second vector is instantiated to track the frequency of each feature detected from the raw data. At the lowest level these features are simply the keys pressed. The higher level features are dependant on the fallback method used in the analysis. These features come into play when the frequency of the lower level features is insignificant. The “Linguistic” fallback method developed by Villani et al. [2] contains duration as well as transition features which have been implemented in feature extraction program. Fallback is used to minimize “bad” data caused by a less-than-optimal number of occurrences of

3. Methodology
We used the agile project development methodology, particularly Extreme Programming (XP) which involves small releases and fast turnarounds in roughly two-week iterations. We held various meetings with the client where an updated system was always delivered, critiqued and a new deliverable set for the following week.

3.1 Object-Oriented Approach
The object-oriented approach to programming was used in both the feature extraction and pattern classification programs [1].

4. Logic of Touch-Type Duration and Transition Model
In an effort to further improve results, a different fallback strategy was developed and tested. This method was based on Touch Typing of keyboarding, first introduced by Frank Edgar

McGurrin in the late 1800’s. This method is still taught today and is more than likely, the method most readers of this article employ. It calls for the use of the four fingers to press the keys while both thumbs exclusively press the space bar. The logic behind the Touch Type fallback duration model is that fingers and hands will act in a similar manner, regardless of the particular assigned letter is being depressed. Therefore, each key that is assigned to a specific finger would form a natural cluster suitable for substitution in the event of insufficient sample sizes. The logic behind the transition dimension of the Touch Type Model is, again, that fingers and hands will perform in a consistent, like manner and, therefore, the finger and hand assignments associated with each letter will lead to natural groupings.

The “frequent consonants” (t,n,s,r,h) roll up to a node. These letter are distributed among 3 nodes in the TT Model The “next most frequent consonants” (l,d,c,p,f) roll up to a node. These letters are distributed among 4 nodes in the TT Model The “least most frequent consonants” (all others) roll up to a node. These letters are distributed among 3 nodes in the TT Model

In again examining the 4th level starting with the Touch Type model (See figure 8):

4.1 Duration Dimension of each Model
Upon inspection, the Touch Type model does significantly differ from the Linguistic Model. For duration, both models are 4 levels, but that’s where the similarities end. While Linguistic does have 4 levels, there are 6 cases where leafs appear on the 3rd level. Touch Type only have 1 case that terminates on the 3rd level. An examination of the 4th level of each model provides the most illustrative support of how the models differ. Is starting with the Linguistic Model (See figure 7):

Figure 8: Touch-Type Fallback: Duration • There are 3 letters that roll up to “left little” (a,q,z). These letter are distributed among 2 nodes in the Linguistic model There are 3 letters that roll up “left ring” (s,w,x). These letter are distributed among 3 modes in the Linguistic model There are 3 letters that roll up to “left middle” (d,c,e). These letters are distributed among 2 nodes in the Linguistic model There are 6 letters that roll up to “left index” (f,g,r,t,v,b). These letters are distributed among 4 nodes in the Linguistic model There are 6 letter that roll up to “right index” (h,j,y,u,n,m). These letters are distributed among 4 nodes in the Linguistic model. There are 2 letters that roll up to “right middle” (k,i). These letters are distributed among 2 nodes in the Linguistic model.

• Figure 7: Linguistic Fallback: Transition • The 5 vowel all roll up to a node. These letter are distributed among 5 nodes in the TT Model •

• •

There are 2 letters that roll up to “right ring” (l,o). These letters are distributed among 2 nodes in the Linguistic model. There is 1 letter that rolls up to “right little” (p). It, obviously, rolls up to one node on the Linguistic model.

There is 1 letter pair that rolls up to “vowel/vowel” (ea). This pair also rolls up to 1 node in the TT model.

4.2 Transition Dimension of each Model
In comparing the transition dimension of each model, again, each model has 4 levels. The fourth level (the leaf level) is identical for each model in that the 15 frequently occurring transitions were captured for this study. Accordingly, these 15 leafs are found in both models. However, the Linguistic model also features 11 leafs on the 3rd level, while the Touch Type Model only has 4 on that level. On the 2nd level of the Touch Type model, there are 4 nodes that are found on the 3rd level of the Linguistic model. In examining the 4th level of the models starting with the Linguistic model (See figure 9):

In examining the 4th level of the models starting with the Linguistic model (note – “neighbor” keys are those which share and edge on the keyboard; “non-neighbors” are ones that do not) (See figure 10):

Figure 10: Touch-Type Fallback: Transition • There are 3 letter pairs in that roll up to the “left/left neighbor” node ( er, es, re). There pairs are distributed among 2 nodes in the Linguistic model. There are 3 letter pairs that roll up to the “left/left non-neighbor” node (st, at, ea). These pairs roll up to 2 nodes in the Linguistic model. There are 2 letter pairs that roll up to the “right/right non-neighbor” node (in, on). These are also found rolling up to 1 node in the Linguistic model. There is one letter pair that rolls up to the “left/right index-index” node (th). There is one letter pair that rolls up to the “left/right index-other” node (ti). There are 2 letter pairs that roll up to the “left/right other-other” node (an, en). These are also found rolling up to 1 node on the Linguistic model. There are 2 letter pairs that roll up to the “right/left index-other” node (nd, he). These pairs roll up to 2 nodes in the Linguistic model.

• Figure 9: Linguistic Fallback: Transition • There are 3 letter pairs that roll up to the “consonant/consonant” node (th, st,nd). These pairs are distributed among 3 nodes in the TT model. There are 8 letter pairs that roll up to the “vowel/consonant” node (an, in, er, es, on, en, at, or). These pairs are distributed among 5 nodes in the TT model There are 3 letter pairs that roll up to the “consonant/vowel” node (he, re, ti). These pairs are distributed among 3 nodes in the TT model. • • •

There is one letter pair that rolls up to the “right/left other-index” node.

Laptop Desktop 61.2% 68.3% Table 1: Copy Task Identification Success Rates Train Data Test Data Linguistic Success Rate 98.3% 99.5% 98.6% TouchType Success Rate 95.5% 98.4% 97.8%

5. Trace Mechanism
While we are certain that Fallback procedures do, indeed, improve overall performance and result in higher match percentages, we are somewhat in the dark (with the current version of the application) as to why. In the current version of the code, there is no mechanism that reports when and how Fallback occurred. In some respects, we take it’s invocation on faith. In order to produce a more granular explanation of results, a Trace Mechanism was developed. This functionality will allow for the identification of insufficient data (ie – which letters were not used with enough frequency to form a complete sampling) and allow for the identification of the path (percentages and weights) that was taken along the hierarchy of the Touch-Type model. This information is extremely valuable is examining results, fine tuning the model by adjusting parameters and weight, and improving results.

Desktop Laptop Laptop Desktop Combined Combined Keyboard Keyboard s s Desktop Laptop 58.5% 61.8% Laptop Desktop 55.1% 57.4% Table 2: Free-Text Task Identification Success Rates

7. Conclusion and Recommendations
Upon completion of this project, there will be two fallback models in place on the system already which are the Linguistic Model and the Touch-Type model. The improvements that were made to the current system were implementation of the “Touch-Type Model”, the development of “User Interface” for the Feature Extractor as well as the Feature Classifier, and a “Trace Mechanism” to help the researcher in detecting/identifying insufficient data. For future improvement, explorations of more of these types of fallback models will greatly help in minimizing the error rate and achieving higher success rate with the results. One fallback model is the “Statistical Model”, which has already been developed. It is based on the statistical analysis of data. But the foremost priority for future project team should be in implementing this model since our client believes that this “Statistical Model” will be the most accurate and the results can be used to explain the performance of the other two models (Linguistic and Touch-Type).

6. Results
Contrary to our expectations, a comparison of the results obtained while running the KeyStroke Biometric System was not clear cut. Our hypothesis was that a fallback method designed to reflect the geography of a keyboard (the TouchType method) would achieve greater rates of accuracy than a fallback method based on grammar (Linguistic method). What follows are some preliminary results from running the system in Train-On-One & Test-On-Another mode. 36 subjects were used in this test. Each had performed all tasks collected by the Java Applet Data Collector (‘Copy on Laptop’, ‘Copy on Desktop’, ‘Free on Laptop’ & ‘Free on Desktop) 4 - 8 times. Train Data Desktop Laptop Combined Keyboard s Desktop Test Data Laptop Desktop Combined Keyboard s Laptop Linguistic Success Rate 98.9% 98.9% 98.9% 56.9% TouchType Success Rate 97.3% 96.8% 98.1% 61.7%

References:
[1] G. Ngo, J. Simone and H. St. Fort, “Developing a Java-Based Keystroke Biometric System for LongText Input,” New York, USA; May 2006 [2] M. Villani, C. Tappert, G. Ngo, J. Simone, H. St. Fort and S. Cha, “Keystroke Biometric Recognition

Under Ideal and Application-Oriented Conditions,” Proc.- IBC, IBS, Montreal, Canada; July 2006

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.