Search Systems for Math Information Retrieval: A Survey
A. Muhammad, M. S. H. Khiyal
Abstract—The paper aims to cover the modern developments in the field of math information retrieval. This includes math search systems, search engines and digital library systems. Use of different retrieval techniques have been covered in the paper. This include similarity search, search using K-Means, self organizing map and agglomerative hierarchical clustering. Information access mechanisms utilized by different search systems are also surveyed. For instance, keyword search, expression search and hybrid search modes. Moreover, user studies conducted by researchers for math information retrieval are surveyed. Index Terms—Information Technology and Systems, Systems, Query Processing, Knowledge Management Applications, Mathematical Knowledge Management Applications, Modeling structured, textual and multimedia data.

• A. Muhammad. is with Department of Computer Science, Faculty of Basic and Applied Sciences, International Islamic University, Islamabad, Pakistan. • M.S.H. Khiyal is with the Department of Computer Science and Software Engineering, Fatimah Jinnah Women University, The Mall, Rawalpindi, Pakistan.

——————————  ——————————



ecently efforts have started to enable on line mathematical information retrieval. Notable examples include Digital Library of Mathematical Functions [1][3], Math Go [2] and Math Web Search [6][7]. There are multiple ways for building math search systems. A common approach is to augment a typical text based search system with math-aware search capabilities. Another possible approach is to use a radically different approaches such as use emerging XML based technologies and markup languages such as XML-Schema1, Resource Description Framework2, Owl Web Ontology Language (Overview3). Finally, another approach for developing math search systems is to utilize tree based data structures for indexing and retrieving math content. The paper aims to survey systems implementing these features one by one.


(Youssef, 3) developed the first version of the digital library of mathematical functions at NIST. The digital library of mathematical functions attempts to provide an on line source of mathematical content such as formulas, graphs with support for search and retrieval of that content [3]. The mathematical content of DLMF, originally in LaTeX is converted to HTML and XHTML using the Latex ML markup language and software tool developed at NIST [4]. [3] who developed the search system for the DLMF search tool added math-awareness to existing search techniques. As a result, the query language syntax is almost identical to text

search syntax, with the added power of recognizing mathematical symbols and structures to a great extent [4]. Even though text search technology has reached a high level of maturity, it is unable to fully capture all of the characteristics inherent in mathematical content [4]. As a result, the query language developed for the Digital Library of Mathematical Functions project has limited expressive power. For example, sub expression search from denominator, searching for a symbol from third row of a matrix is not possible using the DLMF search system. Presently, the DLMF project provides a web based interface. The interface supports searching for math content as well as browsing to it using hyper links. A plain text input field is provided for querying purpose. The DLMF digital library features a simplistic user interface with a search field and links to specific topics for browsing. The search field allows simple text based querying. The links allows one to reach his document of interest through browsing. Performance evaluation of the DLMF search system has been conducted by [5]. The evaluation was conducted on a collection of 300 mathematical documents containing about 2000 equations, measuring precision and recall. The system delivered 100% recall on all the queries. Precision ranged from 60% to 100%, averaging around 80%. A set of 50 queries was used to test the system. The lack of a standard math query benchmarks was noted by [5].




World wide web consortium: 2004, ”xml schema part 0: Primer second edition”, World wide web consortium: 2004, ”resource description framework (rdf): Concepts and abstract syntax”, World wide web consortium: 2004, ”owl web ontology language overview”,

Wolfram research offers a huge on line collection of mathematical functions, formulas and graphics at Wolfram Functions

© 2011 Journal of Computing Press, NY, USA, ISSN 2151-9617



Site . Formulas are encoded in different formats: Mathematica's standard notation and traditional notation. The library offers formula search that allows the user to construct his query using drop down menus. User can specify options to filter the search results based on function categories. More complex queries can be created by using boolean AND and OR operators. Fine grained search is possible by specifying constants, numbers and operations in different places of the formula. It is also possible to search for Mathematical expressions using mathematicaMathematica patterns. For instance, consider the query ``Find all integrals that contain powers of cos and sin under the integrand and that can be expressed using only elementary functions''. The corresponding Mathematica language query is :
Integrate[Sin[_]^(_.)*Cos[_]^(_.), _]


This query form unambiguously models the user information need though its expressive power is limited. The search tool used in Wolfram functions site limits input to specific Mathematica patterns. The query language does not support a search expression having non built in functions. In addition, searching is not supported for non built-in variables that do not appear within a pattern construct. A simple query $x^{2}+ y^{2}$ is not supported. This requires the user to learn a new language thus limiting his ability to search freely.

In the recent version [7], the system deploys a formula editor that interfaces to the Math Web Search. Multiple input syntaxes are supported with formula editor mode. The math web search system has introduced an interesting search mechanism. The formula editor is a typical linear / palette input hybrid with input query representation loosely coupled to input and output notations [7]. The table lists mathematical symbols and expressions divided into categories such as Calculus, Sets, Functions, Linear Algebra etc. Selection from table is updated in the field. The input syntaxes range supports QMath:en, Maxima style, Yacas style, Mathematica style, Maple style. For instance a query a + bi will be represented as complex(a,b), complex(a,b), Complex(a,b), Complex[a,b], complex(a,b) respectively. The syntaxes vary slightly with each other. These syntaxes cater for a wide range of users. The other mode is the raw mode in which plain MathML can be entered6 (checked at the time of researchwrite up). The above query for the complex number a+bi in MathML is shown below:
<m:math xmlns:m=""> <om:OMA xmlns:om=""> <om:OMS cd="complex1" name="complex_cartesian" /> <m:ci>a</m:ci><m:ci>b</m:ci></om:OMA> </m:math>

Math Go-I [2] is a search tool for providing formula based search. It clusters the formulas and retrieves results from the closest matching clusters. Math Go-I [2] deploys WebEQ editor5 to accept user query. The formula will be valid presentation MathML as it is processed through WebEQ editor. Math Go-I translates math expressions into text and convert into tokens. It then performs ranking of formulas using the vector space model.

[6] have developed the first generation of MathWebSearch, a semantic search engine, for mathematical content retrieval. The latest version of the MathWebSearch system harvests the web for content representation of formulas contains and indexes them with substitution tree indexing [7]. Substitution tree indexing is a form of tree-based indexing. A substitution tree is a tree, where substitutions are the nodes [6]. A term is constructed by successively applying substitutions along a path in the tree, the leaves represent the terms stored in the index [6]. Internal nodes of the tree are generic terms and represent similarities between terms [6]. The main advantage of substitution tree indexing is that only substitutions are stored and not the actual terms [6]. This leads to a small memory foot print. This technique does not support sub term search elegantly thus needing some work around.

Math Web search system represents the query internally in either OpenMath or Content MathML format [7]. Indexed documents are stored in MySQL database. The Math Web search system is a search engine for mathematical formulas. Recently formula search facility has been augmented with support for text based search [7]. This has been realized by augmenting math search with the Nutch system ( Nutch is a text based search engine built on top of open source lucene architecture. Nutch supplies a crawler, a link-graph database, parsers for HTML and other document formats. The systems maintains an additional lucene based text index [7]. The formula/expression and text result sets are intersected by ranking heuristic and presented as output [7]. The math web search system has yet to add ranking of The system has to perform dual matching of substitution trees as well as text content. (Kohlhase, 7) notes the need of further research for ranking results for combined math+text search.

Math Dex is an evolutionary web-based search engine developed by Design Science7. Math Dex crawls the web for different math content and converts into XHTML + MathML. Math fragments combined with text content serve as input to the textsearch engine [8]. All content is converted to XHTML + MATHML for indexing8. Extensive normalization of MathML
6 7

5  WebEQ editor,


  Math Web Search,   Design Science, Mathdex Introduction, activities/Miner-Robert/page1a.html



is used to correct common conversion artifacts, markup problems and irrelevant differences. The basic concept utilized by Math Dex is to linearize mathematical notation as a sequence of text tokens [9]. The users enter mathematical query expressions via a graphical equation editor applet [9]. The math query, is internally represented in MathML[8]. The math-processing layer converts the MathML query into a sequence of text-encoded math query terms, which form the basis of a text query performed by the underlying text-search engine[8]. Additional text query terms are entered via a standard HTML text box[9]. The Mathdex system attempts to tolerate poor markup, accept documents with different math encodings, allow searching on math and text [10]. Math Dex uses an index of mathematical n-grams to quantify similarity to a mathematical expression [9]. This is analogus to character based n-grams technique used in text retrieval systems. Mathematical n-grams are identified and indexed as text tokens in specific categories using Apache Lucene engine9. The search engine produces result based on structural and syntactic similarity with search terms.

Ontologies are useful tools in domain specific systems. For general search, the list of topics is too broad to impose any kind of hierarchy / classification. In domain specific search such as mathematics, a natural ontology of topics exists. Most of the documents clearly belong to a specific topic such as Algebra, Calculus, Sets etc. The World Wide Web Consortium recommends a number of semantic markup language standards as part of the semantic web standards. These include XML10, XML Schema11, Resource Description Framework12 and Web Ontology Language13. These technologies can be utilized to develop ontology driven digital libraries but are not yet mature enough. To the best of our knowledge, no ontology driven mathematical system with search and computation facility exists. The Math Go-II system (discussed next) uses limited controlled ontology to provide browsing and targeted search.

Math Go-II is an augmented search system with math-aware search capabilities added on top of a typical search facility[12]. The system is targeted to facilitate math information retrieval for the average math users. Math Go-II provides better query support, content indexing, relevance search, organized storage of data in database [12]. The system utilizes clustering to improve retrieval. Three clustering techniques have been used to improve the retrieval experience. These are K-Means clustering, Kohonen Self Orga9 10




Apache Foundation: Lucene Project, World Wide Web Consortium: 2008, Extensible Markup Language (XML) 1.0 (Fifth Edition), . World Wide Web Consortium: 2004, XML Schema Part 0: Primer Second Edition, World Wide Web Consortium: 2004, Resource Description Framework (RDF): Concepts and Abstract Syntax, World Wide Web Consortium: 2004, "OWL Web Ontology Language Overview",

nizing Maps (KSOM) and the traditional Agglomerative hierarchical clustering (AHC). Documents are clustered before retrieval with the hope to group them into relevant categories. The search process only retrieves the closest matching clusters. The results are combined, sorted and presented. The theme is that, if clusters are relevant then results should be hopefully relevant too. A math query language has been proposed in [12] for math information retrieval. The proposed query language tries to cover the notation commonly used by majority of mathematics users. A typical query to search for expressions including trigonometric identity sin is to just type sin. Support for major mathematical areas such as Algebra, Trigonometry, Calculus etc has been implemented. The query language enables search for math content using Plain Text, Math Expressions. In plain text search, a user can type keywords to do plain text search from the mathematical documents. For instance user can type ``'Differentiate the following``. This is a text enhanced search which will retrieve documents having the query keywords. The keywords can be text or even in math expression form. The math expressions will be translated to text keywords and then matched. For instance, $ dy/ dx $ will be translated to the keyword differentiate. When searching for math expressions, an expression is typed by the user using the graphical formula editor provided. The system utilizes design science WebEQ editor ( for math formula input. WebEQ editor is specifically designed for working with equations encoded in MathML. A unified search interface has been provided for carrying out search. Formula Query is processed into tokens using regular expression matching to form the query vector. MathML obtained from the equation editor is preprocessed. The system utilizes regular expressions based templates to match some patterns or key symbols in MathML to the appropriate keywords. The motivation behind MathML preprocessing is to discover important patterns from the equation and assigning meaning to them. Thereafter, by counting their respective number of occurrences, the final document vector is constructed. The vector space model with cosine similarity is employed for better weighting. The normalized vector is used to retrieve all the document vectors for all the equations. For large scale user authoring, well written templates can detect and transform different MathML forms to one standard intermediate form. Carefully written templates can recognize multiple forms of presentation markup as well as content markup. Presently, Math Go-II digital library has its math content encoded in presentation MathML form. Thus the templates are presently written to identify concepts from presentation MathMl. Math GO-II system translates math expression as text tokens. The vector thus formed is a hybrid vector with support for math in addition to text. With the formation of hybrid vector for math documents , the documents are ranked and displayed in the same way as their text counter parts. This is done by using the popular vector space model with term frequency - inverse document frequency method.



An inital user study for math information retrieval was conducted by [11]. The findings from the study indicates two areas which a math search engine shoud address. One is meta search which is being able to search through multiple math collections. The second area is that of resource categorization. This step is needed to classify math resources in different groups such as definitions, theorems, graphs and so on [11]. (Adeel, 15) presented another user study conducted to understand the thought process of an average user searching for math search. The conclusions from the study are as follows: 1. 2. 3. Searching by keyword for math content is fast but inaccurate and disorganized. Domain specific digital directories/libraries are useful but difficult to find, maintain, subscribe to etc. The web is very rich and resourceful but lacking in organization (content grouping) for a specific domain such as Mathematics. The need of digital library which facilitates mathematical knowledge management.


The paper survey the developments in the area of math information retrieval. Math search systems, search engines and digital library systems are covered. Different search modes such as free form text, math expression queries are investigated. User studies in math information retrieval are also surveyed.

content representation of mathematical expressions and not its presentation. Both LaTeX and Open Math fall short of capturing the full characteristics of math expressions. MathML is an XML application for describing mathematical notation. MathML offers good solution to math encoding problems. It specifies a concise and clear way to encode math content. MathML, with its two separate encodings for both presentation and conceptual aspects of mathematics, allows mathematics not only to be displayed, but also to be exchanged and processed on the Web [13] . MathML elements can be included in XHTML documents with name spaces, and links can be associated to any mathematical expression through XLink [14]. Moreover, it is a World Wide Web Consortium recommendation 15. MathML provides presentation MathML to capture the visual structure of a mathematical expression. Presentation MathML is used to encode the 2-dimensional layout and formatting of the notation that visually illustrate the expression, while Content MathML is used to encode the underlying structure of mathematical concepts represented by that collection of notation [13]. Naturally, presentation MathML will mostly deal with the presentational aspect of the mathematics. This highlights the importance of content MathML as a vehicle for authoring and processing math content. However, it is noted in [6], that most of the available MathML on the World Wide Web is Presentation MathML.

[1] Daniel Lozier. Nist digital library of mathematical functions. In Annals of Mathematics and Artifcial Intelligence, pages 1–3. Kluwer Adacemic Publishers, ISSN, 2003. [2] Muhammad Adeel, Hui Siu Cheung, and Malik Sikandar Hayat Khiyal. Math go! prototype of a content based mathematical formula search engine. Journal of Theoretical and Applied Information Technology, 4(10):11, Oct. 2008. [3] Bruce R. Miller and Abdou Youssef. Technical aspects of the digital library of mathematical functions. Annals of Mathematics and ArtificialIntelligence, 38:121–136, 2003. [4] Moody Ebrahem Altamimi and Abdou S. Youssef. A math query language with an expanded set of wildcards. Mathematics in Computer Science, 2(2):305–331, 2008. [5] Abdou Youssef. Search of mathematical contents: Issues and methods. In IASSE, pages 100–105, 2005. [6] Michael Kohlhase and Ioan A. Sucan. A search engine for mathematical formulae. In Proc. of Artificial Intelligence and Symbolic Computation, number 4120 in LNAI, pages 241–253. Springer, 2006. [7] Michael Kohlhase, Stefan Anca, Constantin Jucovschi, Alberto ¸ Gonz´lez Palomo, and Ioan A. Sucan. MathWebSearch 0.4, a semantic search engine for mathematics. unpublished., 2008. [8] Rajesh Munavalli and Robert Miner. Mathfind: a math-aware search engine. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 735–735, New York, NY, USA, 2006. ACM. [9] Robert Miner and Rajesh Munavalli. An approach to mathematical search through query formulation and data normalization. In Calculemus/MKM, pages 342–355, 2007. [10] Mathdex introduction, 2009. [11] Jin Zhao, Min-Yen Kan, and Yin Leng Theng. Math information retrieval: user requirements and prototype implementation. In JCDL,pages 187–196, 2008.

MathML is math markup language standard of the world wide web. The primary step in the creation of a mathematics digital library is a good mechanism for content creation, verification and its interchange. Without such a facility, the digital content cannot be properly created, verified and/or interchanged between software systems. The math content has two aspects: presentational aspect and conceptual aspect. The presentational aspect is the visible arrangement of math symbols. The conceptual aspect deals with the underlying organization of ideas. The content creation facility should capture the presentational as well as conceptual aspect of math expressions. In this way, math content can be easily displayed as well as processed in software systems. LaTeX, MathML and Open Math offer good solutions for authoring math content, . LaTeX is a set of extensions on top of the TEX (a popular formatting and typesetting language) for content authoring. It captures presentational aspects of the math expressions. LaTeX is not able to capture the conceptual aspect of mathematical expressions. For example, it is not possible to specify in \LaTeX whether a symbol e is a mathematical variable or a mathematical constant. The project Open Math 14 provides a semantic encoding of mathematics objects. Open Math aims to provide a standard way of communication between mathematical applications, So it is solely concerned with the



Open Math Standard 2.0, Mathematical Markup Language Version 2.0,



[12] A. Muhammad, M. S. H. Khiyal, "Engineering a domain specific niche search system for mathematical document retrieval," J. Chinese Institure of Engineers, submitted for publication. [13] Al-Tamimi, Moody, and Youssef, Abdou S.: "A more canonical form of content MathML to facilitate math search", The 2007 Extreme Markup Languages conference, Montréal, Canada., August 2007 [14] Eddahibi, Mustapha, Lazrek, Azzeddine, and Sami, Khalid: Arabic Mathematical e-Documents, TEX, XML, and Digital Typography, 158–168, 2004 [15] A. Muhammad, M. S. H. Khiyal, "MathGO: Digital Library for Math Content Storage and Retrieval," Unpublished Manuscript.