Experiences using F# for developing analysis scripts and tools over search engine query log data

Abstract. We describe our experience using the programming language F# for analysis of text query logs from the Bing search engine. The goals of the project were to develop a set of scripts for enabling ad-hoc query analysis, clustering and feature extraction as well as to provide a subset of these within a data exploration tool developed for non-programmers. Where appropriate we describe programming patterns for the text analysis domain that we used in our project. Our motivation for using F# was to explore the benefits and weaknesses of F# and functional programming in such an application environment. Our investigations showed that for the target application, common use cases involve the creation of ad-hoc pipelines of data transformations. F#’s language and librarylevel support for such pipelined data manipulation and lazy sequence evaluation made these aspects of the work both simple and succinct to express. We also found that when operating at the extremes of data scale, the ability of F# to natively interoperate with C# provided noticeable programmer efficiency. At these limits, reusing existing C# applications may be desirable.

1

Introduction

Query log analysis is one of the techniques that can drive improvement of the quality of search engine results. Typically, analyses of query logs require both extensive hardware resources and complex modeling techniques. Our goal in this research is to explore the suitability of the F# programming language [17] for ad-hoc analysis of query log data. We focus mainly on the support of F# for investigative programming and tool building. While the scripts and the application that we developed work on multi-core machines and out-of-main-memory scenarios, efficiency considerations were secondary to ease of use and the ability to quickly evaluate models over the data. To carry out our evaluation of F#’s suitability to the domain of query analysis we implemented a few clustering and feature selection algorithms. We incorporated the techniques that work best on our data in an application with a graphical frontend. The application allows browsing, filtering, grouping of the data and display of aggregated patterns. The field of large scale query log data analysis is driven by the need to improve search engine quality. The dominant systems approaches in the field are MapReduce [7], SCOPE [4] and DryadLINQ [10] where the focus is on scalability. Due to the high latency caused by the large amounts of data and the sharing of cluster resources between many users, those approaches may not offer quick turnaround times. An alternative for many information retrieval researchers and

the capabilities of F# for lazy evaluation over sequences that may not fit in memory. We organized the application around datasets derived . The main reason is that most of the queries are very short. text clustering algorithms are quite sensitive to initialization and data preprocessing. ideas from functional programming underpin the design of MapReduce and DryadLINQ which have proven effective solutions to data analysis problem similar to ours. We discuss related work in Section 3. reports detailing similar experiences are rare [6]. finding a clustering method that works well on our dataset was a challenging task. While the programming language used for implementation does not guarantee success for data analysis. data-mining and machine learning literature. This makes it hard to estimate distances between the queries which in turn precludes off-the-shelf clustering tools to work well.g. parameters and data inputs. Various data processing scripts are typically written in Python or Perl. The majority of open source tools in our field are written in Java. 2 Project Description Our project consisted of two stages: a) investigation of clustering algorithms applied to query logs (Table 2) and b) implementation of a graphical exploration tool (Figure 2). higher-order functions. In the first part we implemented a few standard clustering algorithms with custom modifications. While some of the algorithms we implemented were a good match to functional programming. On the hand. we also found that some data mining algorithms that involve mutation of state did not fit the functional programming paradigm.2 practitioners is to analyze a sample of the data using interactive tools like Matlab or external commands. We describe the project in the next section. We also took advantage of F# model for parallel computations. We implemented modelbased K-means [21] and LDA [16]. Even though we consider F# a good match to data analysis tasks such as ours. Even though clustering is well-studied topic in the statistics. We consider that F# was a good match for this problem because of its ability to create small and easily modifiable programs. The approach that we took with F# offers a middle ground by not requiring the data to fit in main memory while still allowing for quick compositions of primitives in scripts. We were still able to express them in F# using imperative programming. Those usually require that the data fits in main memory. pipelines of data transformations) were important for this project. We were able to build both research style scripts and seamlessly integrate them into a tool. Additionally. In Section 4 we detail our experience with F# and show typical use cases. In our project we reap both the benefits of scripting and highlevel programming abstractions that have proven useful in our field using the same programming language. the use of F# did help us in the evaluation of multiple possible algorithms. Thus. succinct syntax and the possibility to define custom abstractions (e. In the second stage we incorporated the most successful clustering algorithms in a query browser application with a graphical user interface (Figure 2).

install. int. return. tail. express. sequence. The operations on datasets always return a new dataset without modifying an existing one in place. 7) extraction. We import the data from an existing distributed application written in C# via a limited number of filtering commands. sheet. vs2010. 4. progression. We implemented the following groups of operations for query and url datasets: 1) filtering. map. . pattern. major. natural.g. unit. september. function. syntax. programming.. operator. functions. octaves . We store the datasets in binary files on disk. union. matching. urls) by various criteria. key . chords. power. ascending. language. piano. c. generic. v. operators. Example of discriminative words extracted after clustering queries containing the word F# through various operations. 2) clustering. comprehension. studio. active. developer. descending. 5) transformations. october. seq. melodic. powerpack. and because the data may not fit in memory.. tutorials. blog. computation. mono. command. runtime. array. comparison. types. module. workflows. matrix. iv. lazy. cheat. dorian. visual. 8) comparison. ctp. We carried out all computations on a high-end desktop using parallel programming techniques for the core operations. line. 2008. recursion. split. asynchronous scale. parallel. sample. msdn. compiler. the selected abstraction seems natural. arrays. specification. immutable. member. sax. so that the user can return to an existing dataset later. 2. shell. Thus F#’s standard library gave us a good design pattern to follow. asp. by filtering or weighting the data points). books. 2) datasets containing urls and 3) datasets containing textual features derived from the query or url datasets. class.3 Discriminative Words By Cluster type. 2009. emacs. cast. chart. violin. fold. static. math. It is further reinforced by the F# library design which encourages that data transformations be obtained by chaining a few primitives. phrygian. string. statistical. This approach is both for convenience. multiple levels of caching take place before we can import the data in our application. book. append. All operations except extraction. comparison and feature analysis return new datasets as results. min. Due to the large size of the query log. Based on insights from executing various operations the user can feed new datasets via queries from observed patterns into the system and proceed iteratively. samples. minor. documentation. 3) selection of top data points (words. f. loop. net. examples. Given that many machine learning algorithms take a dataset and produce a new one (e. async. fibonacci. vs. interactive. pack. fingering. harmonic. We used three different kinds of datasets: 1) datasets containing queries. 6) merging. linux. 0. performance. uukulele. beta. release. wiki. clarinet. code. queries. versus. 2010. framework tutorial. 4) grouping. interface. patterns. measure. match. pentatonic. workflow. monads download. monad. record. constructor list. units. ukulele. discriminated. Table 1. center. hash.

MapReduce was inspired by the functions map and reduce from the Lisp programming language. 1. To gain an intuition about the properties of the data. 2) explore various analysis views of top features and clusters. however.4 We demonstrate the usefulness of our tool by an example. MapReduce. Fig. is somewhat inappropriately named because it combines the functions map and reduce with a third operation: group by. 3) notice that the pattern “/Recipe/X/” where X is the name of a recipe appears more than expected by chance in the urls. Our application main screen 3 Related Work A popular approach for large scale data processing is MapReduce[7]. 4) observe that this pattern is associated with websites which have a large number of hits on cooking recipes. The task of this example is to extract from the query log a list of cooking recipes. the user can proceed in the following order: 1) search for queries containing the phrase “cooking recipes”. A similar monolithic function was proposed in [19] before MapReduce became popular .

It allowed us to immediately focus on the domain problem of exploring algorithms on query log datasets without presenting software engineering challenges. filter. concatenate. A SCOPE program is very much like a skeleton or framework used to piece together user-defined C# functions and orchestrate the distributed execution. group. F# was useful for this research problem in the following ways: 1) it helped implement well-understood baseline methods quickly and gain intuition about properties of the data. Among those operations are map. filters. random samples. Some of our code follows a similar style. groups and within group statistics) and in that respect we are similar to MapReduce and Sawzall. Another large scale data processing system in use is SCOPE [4].Binary for specifying serializers. as well as multiple parameterizations. DryadLINQ[10] is a more general approach based on a much larger number of combinators from functional programming than MapReduce. The only bottleneck related . 4 Experiences Our experience with F# is very positive. While some of our code in F# uses similar combinators as Dryad we do not compute query plans but manually point intermediate results to disk when necessary. if necessary. the F# code is simpler because F#’s type system does not track side-effects with monads. We found it beneficial to always start with the smallest program that could possibly solve the problem and then work. Their implementation uses the F#’s standard library and a library for external memory algorithms. At the core of the approach of MapReduce is the use of associative operators to split the problem whose intermediate results will be aggregated. We wanted to explore multiple clustering and feature selection algorithms. DryadLINQ offers a distributed execution of LINQ queries by dynamic generation of query execution plans. The data experiments in TrueSkill were carried out in F#.5 in order to solve elegantly the problem of histogram creation in functional languages. F# was used for a different type of data driven research task in the TrueSkill system [6]. TrueSkill models the skill level of players in a Bayesian framework. towards making it more robust and efficient. Sawzall [13] is a special purpose scripting language that is a generalization of MapReduce. sort. A basic research-style search engine in F# is described in [14]. The generalizations in Sawzall are in two ways: a) nested grouping and b) application of multiple associative operations within groups computed by a single pass over the data. 2) allowed us to easily achieve flexible parameterization via higher-order functions and thus explore multiple inputs to the algorithms. SCOPE is modeled after SQL but allows users to plug arbitrary extractors and reducers written in any .Net language. data inputs and initializations. Our code for file input and data transformations look similar to the Haskell code in [8]. However. In our application we also use associative operators (counts. We also use a combinatorbased approach similar to Haskell. As part of the query execution DryadLINQ also avoids out-of-memory conditions by data partitioning and serialization via reflection.

counts)) |> Array.distinctBy fst |> Seq.collect documentToWordsAndTf |> Seq.map2 (fun clusterId doc -> (clusterId.tf) -> let wordProb = float ((Dict.init (Array.groupBy fst |> Seq. 4. d) let kmeans (numClusters: int) (documents: Document []): int [] = let numUniqueWords = documents |> Seq.map docToCluster let assignmentsToClusters (assignments: int []) = Array. type Counts = Counts of int*Dictionary<string.length documents) (fun _ -> rand.map (snd >> Seq.sum.sumBy(fun (word.Random 1234 Array. This is an example of a data mining algorithm that can be expressed in functional style with the standard F# library. We show a simple implementation in figure 4 to illustrate the functional programming style we tend to use.minBy (snd>>(negLogProb d)) clusters|>fst documents |> Array.length |> float let negLogProb (Document wordsAndTF) (Counts (total.(float tf)*log wordProb) let docsToAssignments (clusters: (int*Counts) []) = let docToCluster(d: Document) = Seq.getDefault 0 counts word) + 1) / ((float total) + numUniqueWords) .fromSeq Counts (d |> Dict.ofSeq let rec loop times assignments = if times <= 0 then assignments else assignments |> assignmentsToClusters |> docsToAssignments |> loop (times .getValues |> Seq.1 Experiences with Algorithm Implementation We found that the functional programming style fitted well the implementation of model-based K-means via standard combinators like map and group by.6 to F# that we encountered had to do with sorting tuples of strings.int> type Document = Document of (string*int)[] let documentToWordsAndTf (Document wordsAndTf) = wordsAndTf let countsFromStream (strm: (string*int) seq) = let d = strm |> Seq. 2.mapi (fun i counts -> (i.1) let rand = System. Other bottlenecks were related to multiple disk writes or were of algorithmic nature. Basic implementation of model-based K-means for document clustering in F# (after [21]).collect (snd>>documentToWordsAndTf) >> countsFromStream) |> Seq.doc)) assignments documents |> Seq. counts)) = wordsAndTF |> Seq. This script gives results comparable to the ones .Next(numClusters)) |> loop 10 Fig. We were able to work around it effectively.countBy fst |> Dict.

described in the literature for this algorithm [21].readLines @"input. The resulting implementation was around 200 lines vs. We found that this was often the case when we the used functional programming style in F#.txt" // read a file into a stream of lines |> Seq. For this algorithm we lose the benefits of functional programming. We found that on published datasets modifications to K-means initialization improved results..getEdges // extract urls from the query object |> Seq.relatedQueries) -> // create pairs of query. relatedQuery. Unfortunately.collect (fun (query. we could not use the functional style for the LDA clustering algorithm.Split([|’\x15’|])) File. filtering and applications of statistical functions over a data collection.map with Array.writeLines "outputFile. url. We extended the program with other smoothing methods and used it for the related bisecting K-means algorithm [15]. Their purpose is usually to change the format or join datasets.txt" Fig.concat " ")) |> File. The reason is that efficient implementation of this algorithm involves mutation of state.relatedQuery))) |> Query.objectsToQueryNodes snd // use in-house library to expand the query string // to a query object with urls |> Seq. the ability to easily modify programs is valuable for data processing tasks.map (fun (url.count) -> [query.map (fun relatedQuery -> (query. but still retain some of the benefits of F# including code conciseness resulting from the type inference system. queryNode) -> queryNode |> Query. Thus.7 //create a parser by composing parsing combinators let lineParser = tuple2 (parseTill "\t") (restOfLine |>>= fun s -> s. count.ToString()] // format result |> String.parseWith lineParser // parse each line using user-defined parser |> Seq.relatedQuery relatedQueries |> Seq.relatedQuery). Proponents of functional programming have often emphasized that small changes in the specification have to be reflected as small changes in the implementation. An example of a typical query log processing task in F#.map in the assignDocsToCluster function. We did an equally simple implementation in C# by imperatively updating matrix values and explicitly manipulating indexes. We found that the kinds of data mining algorithms that benefit from rapid prototyping similar to k-means include grouping. 50 in F#. 3. Similar scripts are usually used only once. |> output . 4.2 Experience with Applicative Style A large amount of the code we wrote used applicative style with F#’s function application operator as in: input |> step1 param1 |> step2 |> .Parallel. We used this program as a starting point.. The high-level implementation can be easily made parallel by changing Array.collect (fun ((query.

A different canonical example from Information Retrieval is a document processing pipeline. However. 4. |> output Fig. which itself may use nested pipelines.2 the corresponding change is very small.. nested loops and mix indices from different conceptual stages together.. There are multiple possible designs for streams each with different trade-offs. However. Typically each of the steps does not use global mutable state except I/O. (step2 (step1 param1 input)))) It is a convention to use the former style in F# code. 4. This code is equivalent to: output(. F#’s library implements imperative streams as enumerators. Such code quickly becomes incomprehensible and unmodifiable. |> output Run on the complete dataset input // remove step |> sample 5000 |> step1 |> . the internal implementation of most of the standard library functions and our extensions are imperative for efficiency.3 Experience with Streams Typical query log datasets do not fit in memory. Functional streams are a .. F# supports a good syntax for generating sequences and a useful library of common operations over sequences.8 Run on a sample first input |> sample 5000 |> step1 |> .. In F# we use: document |> tokenize |> stopword |> stem Some object-oriented implementations of Information Retrieval Systems simulate this pattern by a special Pipeline interface which only applies to a stream of words. As shown in [14] and [12] this pattern does not only apply to processing words but to the index construction and query matching components of search engines. In our use cases the ability to easily modify the code is highly desirable. As can be seen in Figure 4. when one attempts to construct directly the code which corresponds to a long pipeline. Due to this reason the ability to stream over the data is very important. We found this style is preferable because one can easily add and remove processing stages. A common idiom is when one wants to first run the program on a small sample of the data (figure 4. An example from our practice that applicative style is easily amenable to modifications.2). one is forced to code loops. They may use local mutable variables as opposed to a more mathematical formulation via recursion... It is usually the case that many of the applied steps are quite simple when considered in isolation. Had the logic been fused into a single loop such a small change would be more difficult.

and streams based on message passing. We experimented with stream implementation via lazy lists as described in [20]. While sequence of pipelines (of maps. experiments revealed that lazy lists were not acceptable for our data loads.) are slower than direct loops. There is no protection against such problems in F# and in some cases the obtained result might be intended (e. Therefore. the major practical difference in our use cases being that one cannot nest other commands within a command. etc. The F# workflow syntax for sequences works well when the iteration resembles a for-loop. filters. F# does not have sophisticated code rewrite system for library writers to implement Haskells optimized streams. because the result depends on the order of evaluation.4 Experience with Scripting We give an example of a possible use of F# for scripting in our domain. but one could also use LINQ in the same way. we did not find this to be a bottleneck in practice. or user-defined implementations. but is harder to use in other more complex cases.. In ”push-based” streams the producer pushes items into the pipeline. The goal in this example is to expand each pair of query and related query to a list of corresponding urls and click counts. The default combinators are in F#’s Seq module. we found that even an optimizing compiler such as MLton may fail in certain complex cases (even when using the iterative implementation from [5]). but more restricted programming model.. In general. While one could blame the . We have experimented with lazy functional streams.9 more flexible alternative to imperative streams. Frequent materialization of sequences using arrays might end up slower than sequences because of memory accesses and cause of out-of-memory exceptions on high data loads. one has to be careful not to unintendedly reuse a sequence twice because this might repeat long computations or produce a different result. The final result is a list . We learned to be careful when using lazy evaluation of streams and external mutable state. 4. Functional streams themselves could be implemented via recursion [2] or iteration [5]. The main reason seems to be that a large number of thunks were created. We use an in-house service to fill in the required data. F# relies instead on the virtual machine to optimize pipelines of sequences. A common example of the latter is a random number generator. for example when merging or joining two sorted sequences. One should select carefully the tradeoffs between arrays and sequences.g inserting print statements while debugging). ”pushbased” streams. While we could create a good syntax due to F#’s workflow support.”. Stream-based interfaces are ubiquitous in functional programming.Net virtual machine for not optimizing such patterns. Programs that are expressed as a pipeline composed from various stages can be considered declarative. the internal implementation of the combinators matters for efficiency only. In this example we read a file whose lines are formatted as ”headQuery \t relatedQuery1 0x15 relatedQuery2 . The Unix command line offers a similar.

but the input type needs to be specified somehow because of the static typing in F#.relatedQuery1. Here is some typical code: // specification of the format let schema = string @ (list string) // use of @ operator to denote a tuple of two elements // output code someSequence |> Seq. for parsing objects of the corresponding types. 4.5 Experience with Parallel Computations Common cases of parallel computations in our application involve filters. Example of associative operations are sums. While FParsec may be quite useful. We found the approach of building a schema with functions convenient. and groupings. It is also very efficient because there is no string parsing. transformations. In some cases.url1. filters. We found it very useful that simple tasks such as the one described can be expressed both in short and readable code. SCOPE [4] and DryadLINQ [10]. Custom code for input/output formatting is typical for many text processing tasks. In this example. random samples. This is a very successful example of the power of functional programming combinators because it gives gains in both programmer and program efficiency. we use parsing combinators from FParsec [18] as a way to declare the input format. list. tuple2. Similar combinators are found in Haskell’s Data. Textbook examples of operations that can be decomposed into associative operations are the average and the standard deviation.10 of (headQuery1. etc.string. the most common of which are int. For those cases we describe the format using a simple library of serialization combinators. FParsec an implementation of the parsing combinators described in [11] for F#. In many cases the the operations we encountered were associative or could be derived from associative operations. The type of the record for output can in principle be obtained using reflection.outputRecords schema "filename" // input code Seq.Binary library. Our experience is that it is possible to use parsing combinators for many ad-hoc formats which arise in our practice. In the field of Information Retrieval the major paradigms for grouping and reductions within each groups are MapReduce [7] and its Sawzall [13] generalization.inputRecords schema "filename" |> processSequence This pattern avoids the need to deal with parsing and output formatting and error prone issues such as string escaping and byte manipulation.count1) tuples written to a file. Typically for each group we compute a number of statistics. An important characteristic in our use cases and in many other Information Retrieval applications is that the input is usually large and read from a file. counts. its use can be avoided if one has control over the formatting specification. We use the collect function twice to emit multiple values in the resulting stream. the generated output is also large. Programs written for those systems require dedicated clusters .

topBy (for obtaining top values by a criterion). Essentially.values) -> key.groupBy fst //materializes values within groups |> Seq. The run function applies the specification to each chunk to produce intermediate results. which can be split into chunks. values (for collecting intermediate results). compute the number of values and their sum for each group // input is a sequence of (key. 5.11 // group by key. By using our combinators we create a specification which is passed to a “run” function. (values |> Seq. We use ”push-based” streams to avoid storing lists of values within groups. our specifications allow for composable and more general “MapReduce” style programs. One option for us would have been to extend our combinators to handle out-of-memory conditions. When all intermediate results are computed the the run function aggregates them. but we found out that multiple disk writes are detrimental to speed-ups that can be achieved a muticore desktop. distinct (for obtaining distinct values). Those are collect (for emitting multiple values). In some cases. This function. The simplest strategy we used was to split the input file into chunks with number equal to the number of CPUs.value) tuples // Solution using standard F# Seq module input |> Seq. and specialized software. We used on object-oriented implementation which allowed for chaining transformers and consumers. usage of this function will cause the program to run out of memory. we took care to avoid materialization . To produce multiple results from the same stream we use the split combinator. “sum” and “split”. We can handle nested groups and support a few more combinators in addition to “by”. the values within groups may not be required but only some statistics that can be computed in limited memory. A comparision between F#’s groupBy and our push-based grouping operator “by”. The run function takes an parallel stream. Therefore. Our implementation avoids storing intermediate results. F#’s Seq module contains a groupBy function. (int*int)> Fig. “len”.sumBy snd)) //Result is seq<’key*(int*int)> // Solution using our "push-based" stream implementation to avoid // unnecessary materialization of intermediate results let spec = by (fst. sumBy snd)) input |> toParallelStream |> run spec //Result is Dictionary<’key. split2(len(). If the input is large. run computations for each chunk and aggregate the results. An important point is that even with a push-based stream implementation one can write programs that run out of memory. materializes intermediate values that fall within each group. We utilized a more complex strategy for computing groups and results within the groups.map (fun (key. map (for transformation of values).length . Instead only a few numbers are stored within each group. however.values |> Seq.

(map snd //take (query.topBy 50 (snd>>desc) |> Dict. While we investigated possible solutions to the simultaneous processing of split streams.clickCount) -> url |> featurize |> Seq. clickCount21). each query pointing to a list of urls and click counts: seq{(query1. while it fails if a sequence needs to be . This example also illustrates the need to split the program into two programs to avoid materialization of unnecessary results.])}. The following example shows a case we encountered..(url12.clickCount)))) cont let features = parallelStream |> run (emitFeatures (by(fst._. len()))) let query (query.url. 2) given top features.12 let emitFeatures cont = collect (fun (query.url. would be to compute the results for each feature and its ”score” in the same pass and then select best features.map (fun feature -> feature. b) compute top url features by number of queries generating the feature. We would like to a) extract features from each url.clickCount12).. Instead of writing the program as a single expression we split the program into two phases: 1) extraction of top features. A realistic example of the use of grouping and stream splitting combinators which operate in parallel on a stream read from disk. Therefore this program is useful for organizing queries by website as well. In this example. compute both query and url views simultaneously (using the split combinator). clickCount11). we read the input data twice but do not write to disk.collect (fun (url.)]). the input data is in the form of a stream of queries. sumBy clickCount))))))) //compute url view parallelStream |> run spec Fig.sumBy clickCount) //compute query view . c) compute two views for the top features: query view and url view. but slower to execute because of disk writes. of large intermediate results by decomposing a program into multiple programs. ”Pull-style” can easily handle merging or joining of sequences.hasKey topFeatures) //filter top features (by (fst. An alternative feature is the website extracted from the url. The solution to this task is given in figure 4.urlAndClicks) -> urlAndClicks |> Seq.clickCount) = clickCount let topFeatures = features |> Dict. 6..url.toSeq |> Seq. [(url21. An alternative that is shorter to write._.5.clickCount) data (split2 (by (query. The features we have in mind are strings like ”city=?” and are useful because they allow for extraction of values from a category.by (url.fromSeq let spec = emitFeatures //generate features and query-url data (filter (fst>>Dict. [(url11. (query2.._) = query let url (_.(query._) = url let clickCount (_. . we observed two styles for processing sequences: ”pull-style” corresponding to F# sequences and lazy evaluation. In this way.. and ”push-style” corresponding to message passing and reactive programming with events.

We used the C# user interface designer but implemented the behaviors for graphical elements from F#. The benefit for us was that parameters such as datasets that are used in a user operation are automatically captured.url) pairs in external memory and random sampling.Net sorting functions was unacceptably slow because . In addition to the combinators described above.7 Efficiency Query mining can be quite CPU and I/O intensive depending on the task. Our application is parameterization rich. We composed all selections to obtain a function that is passed as a parameter to an operation over a dataset. the user can input parameters from multiple graphical elements.Net does not implement a specialized string sorting algorithm as in [3]. Thus. Since both C# and F# share a common representation of types no foreign-function interface is necessary. The basic design that we followed when connecting behaviors to the user interface was to use closures as callbacks. This is in contrast to other functional languages such as OCaml or Haskell where C is the foreign-function language. To translate the user input to executable code. The duality between both styles has also been observed for event processing in user interfaces [1]. When interfacing with C polymorphism cannot be easily achieved without explicitly passing a dictionary of conversion functions.e. any callback that we create returns an IDisposable object representing the assigned closure. 4. i. under high loads. ”Push-style” fails on merging two sorted sequences. Haskell and Ocaml revealed lack of efficient implementation of string sorting in those libraries as well. An investigation of standard library implementation of sorting in Java.6 Experience with Tool Building F# was also very useful for developing the user interface of the application. compared to other functional languages F# has the advantage of running on the same virtual machine as a lower-level language. histograms. 4. We gather all objects corresponding to a view such as a tab and assign them to a field in the tab. Our implementation of string sorting was based on the reference C implementation from [3]. Thus. Those issues were due to F#’s new powerful equality or comparison constructs. transposition of a sparse matrix of (query. we mapped input from each selected user-interface element to a higher-order function. we implemented parallel filters. We also hit a performance issue by using comparisons of tuples. when a tab is closed we release the captured resources deterministically. Boxing and unboxing penalties are usually incurred in those cases.13 split. This programming pattern may cause resource leaks since references to datasets are kept in closures. when hashing or sorting tuples of objects we found that those . String sorting using the generic . Thus. It was easier to translate this implementation to C# than to F#. Unfortunately. To solve this problem. Of the CPU intensive operations string processing and especially string sorting are the most notorious. C# can be used as a lower-level imperative language when needed.

Due to this reason the possibility for variable mutation and C# interfacing is a strong point of F# in our use cases. In some cases the algorithm could be easily expressed in functional programming style. while in others we had to resort to imperative programming. The F# language allowed us to easily formulate scripts to carry out typical tasks and we believe offered productivity gains in implementation.groupBy that works well in main memory. They are either representation or algorithm dependent. Except the issue caused by tuple comparisions. we had to hash manually the strings to integers and group by integer ids to avoid string sorting. we implemented a function with the following signature: val withConvertToIds: ((‘a -> int) -> ‘b) -> (‘a Ids*‘b) where type ’a Ids = class member idToObject : int -> ’a member objectToId : ’a -> int end We can run a computation within this function that remaps the ids and returns the actual result and an object which can perform the reverse map. Another example of inefficiency is transposing a sparse matrix of queries and urls. Some of the programming abstractions we used have appeared in various domain specific languages for the analysis of search engine data. During the early stages of development efficiency is not a requirement but may become later on. In a few cases we had to modify our code to gain efficiency but lost some readability. We focused on algorithm implementation as well as ad-hoc scripts. There are not many programming languages which can combine the conciseness of popular scripting languages like Python or Ruby with the speed of C# for typical use cases. and at the same time encourage functional programming style. In such circumstances. In our view F# is a very practical language which proved to be a good match for our usage. To make this common task easier. Those abstractions have roots in functional programming . it is best practice to switch from unnamed tuples to using named types as hash or sorting keys. all of those pitfalls would have occurred independently of the programming language. Our usage of F# focused on key text analysis tasks and resulted in a graphical application for browsing logs. For example. in order to resolve the efficiency issue.14 constructs did not perform well. Instead of using the obvious algorithm with Seq. 5 Conclusion We described our experience using the programming language F# for ad-hoc analysis of query logs.collect and Seq. instead of using strings we had to remap them to integer ids.

and Jingren Zhou. A search engine in a few lines. Society for Industrial and Applied Mathematics. Mihai Budiu. Roman Leshchinskiy. April 2007. Sci. R. In OSDI’04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation. Jeffrey Dean and Sanjay Ghemawat. compelling evidence. 2008. Yuan Yu. Structure and Interpretation of Computer Programs. Jon L. In 18th Text Retrieval Conference (TREC). 6 Acknowledgements We are thankful to Don Syme for his useful feedback on our paper. 6. Cambridge. 13. Heard. P. Michael Isard. ACM. New York. pages 10–10. 9. 11. Larson. 2001. Generalising monads to arrows. Marc Najork Stephen Robertson Nick Craswell. Ralf Herbrich. Pierre Dangauthier. J. and Thore Graepel. Stefan Savev. Simon Weaver. NY. USENIX Association. pages 360–369. and Sean Quinlan. May 2000. 2007. Proc. Andrew Birrell. MIT Press. USA. NY. 10. VLDB Endow. Science of Computer Programming. Robert Griesemer. ACM. 1996. USA. 2005.. 37:67–111. and Dennis Fetterly.Net (Rx). USA. CA. 7. Mapreduce: simplified data processing on large clusters. Dryad: distributed data-parallel programs from sequential building blocks. 1997. We believe we are the first to apply those abstractions to log analysis tasks using a language with heritage from the functional programming community. Sussman. Beautiful code. pages 59–72. 2007. 8. 4. Dennis Fetterly and Emine Yilmaz. In SODA ’97: Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. 2009. Reactive extensions for .: yes.. Parsec: A practical parser library. MA. 13(4):277–298. 2. Philadelphia. John Hughes. USA. Bob Jenkins. functional programming for information visualization and visual analytics. Interpreting the data: Parallel analysis with sawzall. Program. Darren Shakib. USA. Sean Dorward. References 1. 12. we can! In SIGIR ’09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 14. 3. Scope: easy and efficient parallel processing of massive data sets. Tom Minka. Daan Leijen and Erik Meijer. Harold Abelson and Gerald J. Berkeley. . Fast algorithms for sorting and searching strings.15 and their use is encouraging because they bring elegant and useful techniques from functional programming into mainstream practice. 2009. Rob Pike. ICFP 2007. In EuroSys ’07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. In NIPS. Trueskill through time: Revisiting the history of chess. Ronnie Chaiken. Duncan Coutts. 5. New York. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming. and Don Stewart. PA. 2004. Bentley and Robert Sedgewick. Stream fusion: From lists to streams to nothing at all. Microsoft research at trec 2009: Web and relevance feedback tracks. 1(2):1265–1276. Bill Ramsey. pages 772–773.

Griffiths. 20. pages 328–335. Philip Wadler. and David Macqueen. and V. Adam Granicz. London. 2008. How to add laziness to a strict language without even being odd. A comparison of document clustering techniques. Steyvers and T. Don Syme.com/fparsec/. Karypis. Probabilistic topic models. Philip Wadler. Technical Report 00-034. M. Kumar. In Proc. University of Minnesota. Walid Taha.quanttec. Stephan Tolksdorf. 8(3):374–384. UK.. and Antonio Cisternino. M. Inf. Springer-Verlag. Fparsec: http://www. 21. Apress. 2009. 1987. Expert F#. A new array operation. Knowl. Generative model-based document clustering: a comparative study. 16. Steinbach. G. 17. 18. Syst. .16 15. 19. 2000. of a workshop on Graph reduction. 2005. Shi Zhong and Joydeep Ghosh.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.