Introducing Cascalog - A Clojure-Based Query Language For Hadoop - Thoughts From The Red Planet - Thoughts From The Red Planet

Introducing Cascalog: a Clojure-based query language for Hadoop - tho...
http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query...
thoughts from the red planet

Follow Nathan on
Twitter GitHub LinkedIn Blog RSS
Copyright 2012, Nathan Marz. All rights reserved.
Search
Featured Posts
You should blog even if you have no readers My experience as the first employee of aY Combinator startup Why your company should have a very permissive open
1 de 15
13/12/2012 21:07
source policy Introducing Cascalog: a Clojure-based query language for Hadoop Mimi Silbert: the greatest hacker in the world The mathematics behind Hadoop-based systems
Latest Posts
Storm's 1st birthday Sufferingoriented programming Early access edition of my book is available How to beat the CAP theorem My talks at POSSCON Inglourious Software Patents Cascalog workshop
All Posts
September 2012
(1)
February 2012
(1)
January 2012
(1)
2 de 15
13/12/2012 21:07
October 2011
(1)
March 2011
(1)
January 2011
(3)
December 2010
(2)
November 2010
(1)
October 2010
(2)
August 2010
(2)
July 2010
(2)
June 2010
(1)
May 2010
(3)
April 2010
(3)
March 2010
(5)
February 2010
(1)
January 2010
(4)
December 2009
(1)
Wednesday Apr142010
Introducing Cascalog: a Clojure-based query language for Hadoop

WEDNESDAY, APRIL 14, 2010 CASCADING, CASCALOG, CLOJURE, HADOOP
I'm very excited to be releasing Cascalog as open-source today. Cascalog is a Clojure-based query language for Hadoop inspired by Datalog.
Highlights
Simple - Functions, filters, and aggregators all use the same syntax. Joins are implicit and natural. Expressive - Logical composition is very
3 de 15
13/12/2012 21:07
powerful, and you can run arbitrary Clojure code in your query with little effort. Interactive - Run queries from the Clojure REPL. Scalable - Cascalog queries run as a series of MapReduce jobs. Query anything - Query HDFS data, database data, and/or local data by making use of Cascading's "Tap" abstraction Careful handling of null values - Null values can make life difficult. Cascalog has a feature called "non-nullable variables" that makes dealing with nulls painless. First class interoperability with Cascading - Operations defined for Cascalog can be used in a Cascading flow and vice-versa First class interoperability with Clojure Can use regular Clojure functions as operations or filters, and since Cascalog is a Clojure DSL, you can use it in other Clojure code. OK, let's jump into Cascalog and see what it's all about! I'm going walk us through Cascalog with a series of examples. These examples all make use of the "playground" that comes with the project. I recommend that you download Cascalog and follow along in your REPL (only takes a few minutes to get up and running - instructions are in the README).
Basic queries
First, let's start the REPL and load the playground:
lein repl user=> (use 'cascalog.playground) (bootstrap)
This will import everything we need to run the examples. You can view the datasets we're going to be querying by looking at the playground.clj file. Let's run our first query and find the people in our dataset who are 25 years old:
user=> (?<- (stdout) [?person] (age ?person 25))
This query can be read as "Find all ?person for which ?person has an age that is equal to 25". You'll see logging from Hadoop as the job runs and after a few seconds the results of the query will print. OK, let's try something more involved. Let's do a range query and find all the people in our dataset who are younger than 30:
user=> (?<- (stdout) [?person] (age ?person ?age) (< ?a
That's pretty simple too. This time we bound the age of the person to the variable ?age and then added the constraint that ?age is less than 30. Let's run that query again but this time include the ages of the people in the results:
4 de 15
13/12/2012 21:07
user=> (?<- (stdout) [?person ?age] (age ?person ?age) (< ?age 30))
All we had to do was add the ?age variable into the vector within the query. Let's do another query and find all the male people that Emily follows:
user=> (?<- (stdout) [?person] (follows "emily" ?person (gender ?person "m"))
You may not have noticed, but there's actually a join happening in this query. The value of ?person must be the same wherever it is used, and since "follows" and "gender" are separate sources of data, Cascalog will use a join to resolve the query.
Structure of a query
Let's look at the structure of a query in more detail. Let's deconstruct the following query:
user=> (?<- (stdout) [?person ?a2] (age ?person ?age) (< ?age 30) (* 2 ?age :> ?a2))
The query operator we've been using is ?<-, which both defines and runs a query. ?<- wraps around <-, the query creation operator, and ?-, the query execution operator. We'll see how to use those later on to create more complex queries. First, we tell the query where we want to emit the results. In this case, we say "(stdout)". "(stdout)" creates a Cascading tap which writes its contents to standard output after the query finishes. Any Cascading tap can be used for the output. This means you can output data in any file format you want (i.e. Sequence files, text format, etc.) and anywhere you want (locally, HDFS, database, etc.). After we define our sink, we define the result variables of the query in a Clojure vector. In this case, we are interested in the variables ?person and ?a2. Next, we specify one or more "predicates" that define and constrain the result variables. There are three categories of predicates: 1. Generators: A generator is a source of data. Two kinds: Cascading Tap - for example, the data on HDFS at a certain path An existing query defined using <2. Operations: Implicit relations that take in input variables defined elsewhere and either act as a function that binds new variables or a filter 3. Aggregators: Count, sum, min, max, etc. A predicate has a name, a list of input variables, and a list of output variables. The predicates in our
5 de 15
13/12/2012 21:07
query above are: (age ?person ?age) (< ?age 30) (* 2 ?age :> ?a2) The :> keyword is used to separate input variables from output variables. If no :> keyword is specified, the variables are considered input variables for operations and output variables for generators and aggregators. The "age" predicate refers to a tap defined in playground.clj, so it's a generator. That means that the "age" predicate emits variables "?person" and "?age". The "<" predicate is a Clojure function. Since we didn't specify any output variables, the predicate will act as a filter and filter out any records where ?age is less than 30. If we had specified:
(< ?age 30 :> ?young)
In this case, "<" will act as a function and bind a new variable ?young as a boolean variable representing whether the person's age is less than 30. The ordering of predicates doesn't matter. Cascalog is purely declarative.
Variables and constant substitution

Variables are symbols that begin with either ? or !. Sometimes you don't care about the value of an output variable and can use the symbol "_" to ignore the variable. Anything else will be evaluated and inserted as a constant within the query. This feature is called "constant substitution" and we've already been making heavy use of it so far. Using a constant as an output variable acts as a filter on the results of the function. For example:
(* 4 ?v2 :> 100)
There are two constants being used here: 4 and 100. 4 substitutes for an input variable, while 100 acts as a filter only keeping the values of ?v2 that equal 100 when multiplied by 4. Strings, numbers, other primitives, and any objects that have Hadoop serializers registered can be used as constants. Let's get back to the examples. Let's find all follow relationships where someone is following a younger person:
user=> (?<- (stdout) [?person1 ?person2] (age ?person1 ?age1) (follows ?person1 ?person2) (age ?person2 ?age2) (< ?age2 ?age1))
Let's do that query again and emit the age difference as well:
6 de 15
13/12/2012 21:07
user=> (?<- (stdout) [?person1 ?person2 ?delta] (age ?person1 ?age1) (follows ?person1 ?person2) (age ?person2 ?age2) (- ?age2 ?age1 :> ?delta) (< ?delta 0))
Aggregators
Now let's check out our first aggregator. Let's find the number of people less than 30 years old:
user=> (?<- (stdout) [?count] (age _ ?a) (< ?a 30) (c/count ?count))
This computes a single value about all of our records. We can also aggregate over partitions of records. For example, let's find the number of people each person follows:
user=> (?<- (stdout) [?person ?count] (follows ?person (c/count ?count))
Since we declared ?person as a result variable of the query, Cascalog will partition the records by ?person and apply the c/count aggregator within each partition. You can use multiple aggregators within a single query. They will run on the exact same partitions of records. For example, let's get the average age of people living in a country by combining a count and a sum:
user=> (?<- (stdout) [?country ?avg] (location ?person ?country _ _) (age ?person ?age) (c/count ?count) (c/sum ?age :> ?sum) (div ?sum ?count :> ?avg))
Notice that we applied the "div" operation to the results of the aggregators for our final result. Any operations that are dependent on aggregator output variables will execute after the aggregators run.
Custom operations
Next, let's write a query to count the number of times each word appears in a set of sentences. To do this, we are going to define a custom operation to use within the query:
user=> (defmapcatop split [sentence] (seq (.split sentence "\\s+"))) user=> (?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/count ?count))
"defmapcatop split" defines an operation that takes a single field "sentence" as input and outputs 0 or more tuples as output. deffilterop defines filter operations that return a boolean indicating whether
7 de 15
13/12/2012 21:07
or not to filter a tuple. defmapop defines functions that return a single tuple. defaggregateop defines an aggregator. These operations can also be used directly with Cascalog's workflow API - but that's for another blog post. Our word count query has the problem in that the same word will be counted differently if it appears with different combinations of uppercase and lowercase letters. We can fix our query as follows:
user=> (defn lowercase [w] (.toLowerCase w)) user=> (?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word1) (lowercase ?word1 :> ?word) (c/count ?count))
As you can see, regular Clojure functions can also be used as operations. A Clojure function is treated as a filter when not given any output variables. When given output variables, it is a map operation. Operations that emit 0 or more tuples must be defined using defmapcatop. Here's a query that will return counts of people bucketed by age group and gender:
user=> (defn agebucket [age] (find-first (partial <= age) [17 25 35 45 55 65 user=> (?<- (stdout) [?bucket ?gender ?count] (age ?person ?age) (gender ?person ?gender) (agebucket ?age :> ?bucket) (c/count ?count))
Non-nullable variables
Cascalog has a feature called "non-nullable variables" that allows you to handle null values gracefully. We've actually been using non-nullable variables this whole time. Variables prefixed with a "?" are non-nullable variables, and variables prefixed with a "!" are nullable variables. Cascalog inserts null checks to filter out any records in which a non-nullable variable is binded to null. To see the effect of non-nullable variables, let's compare the following two queries:
user=> (?<- (stdout) [?person ?city] (location ?person user=> (?<- (stdout) [?person !city] (location ?person
The second query includes some null values in the result set.
Subqueries
Finally, let's look at some more complex queries that make use of subqueries. Let's determine all the follow relationships in which both people follow more than 2 people:
8 de 15
13/12/2012 21:07
user=> (let [many-follows (<- [?person] (follows ?perso (c/count ?c) (> ?c 2))] (?<- (stdout) [?person1 ?person2] (many-follows (many-follows ?person2) (follows ?person1 ?per
Here, we use a let form to define a subquery "many-follows". The subquery is defined using <-, the query definition operator. We can then make use of many-follows within the query we execute in the body of the let form. We can also run queries that have multiple outputs. If we also want the result of many-follows in the query above, we can write:
user=> (let [many-follows (<- [?person] (follows ?perso (c/count ?c) (> ?c 2)) active-follows (<- [?p1 ?p2] (many-follows ?p1) (many-follows ?p2) (follows ?p1 (?- (stdout) many-follows (stdout) active-follows))
Here we define both of our queries without executing them. We then use the query execution operator ?- to bind each query to a tap. ?- executes both queries in tandem.
Conclusion
Cascalog is being actively improved. You can expect more features to allow for richer queries and query planner improvements to be added over time. I'd love to hear your feedback on Cascalog. If you have any comments, questions, or concerns please leave a comment below, contact me on Twitter, send me an email at nathan.marz@gmail.com, or chat with me in the #cascading room on freenode. See the next article to learn about more Cascalog features such as outer joins, sorting, and combiners. You should follow me on Twitter here.
20 Comments and 0 Reactions | Share Article
Like
3 people liked this.
Login
Add New Comment
Sort by oldest first
Showing 20 comments
9 de 15
13/12/2012 21:07
Cool stuff. I'd be curious to hear about how you evaluate it. How do you implement recursion? I'm guessing you only allow stratified queries?
No recursion or negation yet. The Cascading flow is constructed bottom up. Recursion doesn't make quite as much sense in a MapReduce setting, b/c the dominant use case is to run queries over the entire dataset such as "give me all people who are male and younger than 30 years old". This is as opposed to "Give me all of Sally's ancestors" - a recursive query about one particular entity in the system. A recursive query over a full dataset would would require a dramatic amount of space for the output. That said, the possibility to support recursion in the future is there if I find there are good use cases for it. Negation is something I want to implement at some point. Queries like "give me all people who are interested in basketball and not interested in football" are interesting.
Negation is just "count(...) == 0", so it doesn't seem that adding it would be too hard if you've already got aggregation right. Not sure I buy that recursion is uninteresting just because you're evaluating the language using MapReduce, but I can see how it might not be relevant for your use cases. It would certainly vastly expand the set of things you can compute with the language, though.
Negation is a bit more complex than "count(...) == 0". "count" counts the number of matching tuples - so if no tuples match a group, there's nothing to count. This means that every tuple emitted via a count will have count >= 1. A negation basically requires an outer join followed by a special aggregator to determine what
10 de 15
13/12/2012 21:07
tuples to keep.
mixing lisp parentheses and datalog ?variables sure makes for an impressive syntax.
Beautiful, elegant syntax--well done. You should know it was all I could do to avoid bursting out laughing at the :> operator. :)
Great timing, I was just starting to look into Hadoop and this library is just awesome! Looking forward to your next article.
Can you show an example of a few joins?
Here's a query to get everyone's age and gender, it joins together the "age" and "gender" datasets: (?<- (stdout) [?p ?a ?g] (age ?p ?a) (gender ?p ?g)) Since the variable "?p" is used in both the age and gender predicates, Cascalog uses a join to resolve the query. Joins are always caused by using the same variable name across multiple predicates. The next post, http://nathanmarz.com /blog/new..., shows examples of mixed and outer joins.
It would be nice if more people would start to consider non-ASCII character sets in language design, especially for DSLs
11 de 15
13/12/2012 21:07
Hi Nathan, First of all I would like to tahnk you for your effort to bring such cool promising tooling . Our company is a traditional " fat elephant " and paying huge amont of money to the Big Named Brans for "Proven" solutions. you know this story. So I and my team closely interesting the other alternative specially Hadoop, Bigtable-Dynamo, environment. I also liked the cascalog approach. I have a quick question if have time to look: We have a huge database (around 4 billion record 30 TB) storing the video watch infromation ie view count , comment , favorited etc. I want to produse dailiy report for all videos view counts. It means I need to look 2 day , today and yesterday so subtract yesterdey view count from today view count so I can find the daliy impression. Our Fat DB team doing this a few complex queries. I would like to ask you is this possible with cascalog or hive / pig ? I want to make a demonstration to the maangement to show this ne noSQL and Opensource world dynamics. Thanks
Hadoop is the perfect tool for doing the kinds of analytics you described. In fact, you can do much, much more complex things with Hadoop than the count metrics you described. The main tradeoff is that computations on Hadoop will have high latency (i.e., not real-time), but doing something like producing daily reports is right up Hadoop's alley.
Hi Nathan, Thank you for your reply. I am new on hadoop and spreading this to in my company. I like your cascalog simplicity. Can you give me some sneak code examples that calculation daily views from the db to show the the guys here a quick example. There is one table and fields like : date_of_stat (date we update total views) view_count (like youtube its total views) video_id
12 de 15
13/12/2012 21:07
From this table I want to calculate DAILY_views. It means todayviewcount yesterday view count.. So I couldnt figure out how it could be written in cascalog. thanks..
You wouldn't store data in that form in Hadoop. More likely, you would store each page view with its timestamp as a separate record. Then you would compute daily stats by running a simple MapReduce job to compute page view counts by day. A Cascalog query to do this might look like: (<- [?day ?count] (pageview ?video-id ?timestamp) (to-day ?timestamp :> ?day) (c/count ?count)) You would then update the counts in a random-access database with the results of this computation. You generally want to store unaggregated, raw data on Hadoop since you're not limited by computations you can do on the data. This will give you great flexibility with what you can do with that data. For example, by storing individual views as individual records, you could then compute view counts by hour, by video category, etc.
Hi Nathan, This is the DB dump of the current system . So we have only video_id , its total view count and the stats_taken_date. So We need to extract daily view count from this. It is the DB export. :) P.S. what could be the best Dev Ide for clojure and
13 de 15
13/12/2012 21:07
cascalog ? I saw you are using Aquamans and i will give it a try. I am mostly netbeans guy :)
I see. Well Cascalog can certainly do that query as well. Send an email to the cascalog-user Google group if you're having trouble formulating the query. I use GNU Emacs with a bunch of Clojure extensions. Check out this article for info on how to set it up: http://technomancy. us/126
Thanks Nathan, I wrote te request in cascaloguuser group. By the way the Emacs links very usefull :) .. Emacs like a old fashioned girl.. Needs much attention but she is true lover :)
I replied :) http://grou ps.google.c om/group...
14 de 15
13/12/2012 21:07
Having trouble getting your example to run (the word count with lowercasing). The error I'm getting is: user=> (?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (lowercase ?word :> ?word1) (c/count ?count))AssertionError Assert failed: Vanilla functions must have vars associated with them.opvar cascalog.predicate/eval1062/fn--1063 (predicate.clj:348)user=> Using [cascalog "1.8.7"]
This was a regression in 1.8.7. If you downgrade to 1.8.6 your code will work fine. Here's the relevant issue on Github: https://github.com/nathanmarz/...
M Subscribe by email S RSS
Reactions
15 de 15
13/12/2012 21:07

Introducing Cascalog - A Clojure-Based Query Language For Hadoop - Thoughts From The Red Planet - Thoughts From The Red Planet

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introducing Cascalog - A Clojure-Based Query Language For Hadoop - Thoughts From The Red Planet - Thoughts From The Red Planet

Uploaded by

Copyright:

Available Formats

Introducing Cascalog: a Clojure-based query language for Hadoop - tho...

thoughts from the red planet