You are on page 1of 9

1. INTRODUCTION Joins are one of the most important operations performed by a relational database system.

An RDBMS uses joins to match rows from one table with rows from another table Join combines records from two or more tables in a database. It creates a set that can be saved as a table. A JOIN is a means for combining fields from two tables by using values common to each table. Joins are used to select data from more than one table in one SQL statement. The SQL JOIN clause is used to retrieve data from 2 or more tables joined by common fields. The most common scenario is a primary key from one of the tables matches a foreign key in second table. The JOIN keyword is used in an SQL statement to query data from two or more tables, based on a relationship between certain columns in these tables. Joins are one of the most important operations performed by a relational database system. An RDBMS uses joins to match rows from one table with rows from another table. For example, we can use joins to match sales with customers or books with authors. Without joins, we might have a list of sales and customers or books and authors, but we would have no way to determine which customers bought which items or which authors wrote which books. We can join two tables explicitly by writing a query that lists both tables in the FROM clause. We can also join two tables by using a variety of different sub-queries. Finally, SQL Server may introduce joins for a variety of purposes into a query plan during optimization. 2. NEED OF JOINS Curt Monash has a reasonable post where he points out that we need joins because we normalize. Furthermore, he offers reasons for normalization:

To simplify the programming of the updates. Simply put, if the string Montreal appears once in your database, and the city changes its name, it is trivial to do the update. This applies mostly when you have complex schemas. For faster updates. Updating a single entry in a database is much faster than searching and updating for all occurrences of the value Montreal. This is mostly applicable when you have large update volumes.

However, the case against joins is even stronger than what suggests Curt:

Normalization is good if you have to maintain a complex schema. But how complex would your schema be if you stopped over-normalizing your data? I have seen university databases made of hundreds of tables. The average query is well over 256 characters and involves dozens of joins. It is simply impossible to make sense of the content of any one table. Building new applications on top of this mess is expensive and bug prone. Complexity is bad for your health.

Database engines can physically normalize the data auto magically. And indeed, many database compression techniques are types of normalization. Have you ever noticed how sluggish your enterprise database is? Complex schemas rarely scale well, no matter what your database textbook says.

The dogma of normalization too often leads to over-engineering. We are so afraid that a programming error could leave the database in a wrongful state that we invest massively in inflexible schemas. In turn, this over-engineering comes back to haunt us when we need to be more agile, or to scale out. Example: Suppose you want to design a database of research papers. Let us simplify the problem by omitting the paper identifiers, the dates, and so on. Let us also assume that there is only one author per paper. Maybe your main table looks like this: authorID smith01 author name John Smith publisher title Springer Databases are bad The other guy is wrong, databases are good

lampron01 Nathalie Lampron IEEE

Being helpful, your friendly database expert points out that your database schema is not even in the second normal form. Clearly, you are an amateur. Being helpful, he creates a secondary table which maps the authorID field to an author name. And voil! You have saved storage, and wont ever get someones name wrong. Updates to someones name will be much faster in the future. But wait?!? What if Nathalie gets married and changes name? And indeed, people have their names changed all the time. Yet, we never retroactively change the names of the authors on a paper. Maybe you never thought about it, but many ladies hold two or more names in their lifetime. Did the bunch of guys in IT knew about this? (As an aside, are the digital librarians worried at all about researchers changing name and seeing their publication list cut in half? Yes: See update below.) My point is that normalization effectively enforces dependencies decided upon when you created the schema. These envisioned dependencies break down all the time. Life is complicated. I could come up with hundreds of examples. Strict normalization makes as much sense as the waterfall model. What about the physical layer? Because normalization has removed entire fields from the main table, you might think that normalization will save storage! That may well be true on the

database engine you are using. However, other database engines will automatically detect the dependencies and compress the data accordingly. In this case, it is trivial to discover that there is a bijective (1-to-1) mapping between author ID and author name. And if the bijectivity breaks down, the database engine will simply have to work a bit harder to compress the data. 3. TYPES OF JOINS

Inner join Outer join Cross join Cross apply Semi-join Anti-semi-join

Here is a simple schema and data set that I will use to illustrate each join type: create table Customers (Cust_Id int, Cust_Name varchar(10)) insert Customers values (1, 'Craig') insert Customers values (2, 'John Doe') insert Customers values (3, 'Jane Doe')

create table Sales (Cust_Id int, Item varchar(10)) insert Sales values (2, 'Camera') insert Sales values (3, 'Computer') insert Sales values (3, 'Monitor') insert Sales values (4, 'Printer') 3.1 Inner joins Inner joins are the most common join type. An inner join simply looks for two rows that put together satisfy a join predicate. For example, this query uses the join predicate S.Cust_Id = C.Cust_Id to find all Sales and Customer rows with the same Cust_Id: select * from Sales S inner join Customers C on S.Cust_Id = C.Cust_Id

Cust_Id

Item

Cust_Id

Cust_Name

----------- ---------- ----------- ---------2 3 3 Notes:


Camera

John Doe Jane Doe Jane Doe

Computer 3 Monitor 3

Cust_Id 3 bought two items so this customer row appears twice in the result. Cust_Id 1 did not purchase anything and so does not appear in the result. We sold a Printer to Cust_Id 4. There is no such customer so this sale does not appear in the result.

Inner joins are fully commutative. A inner join B and B inner join A are equivalent. 3.2 Outer joins Suppose that we would like to see a list of all sales; even those that do not have a matching customer. We can write this query using an outer join. An outer join preserves all rows in one or both of the input tables even if we cannot find a matching row per the join predicate. For example: select * from Sales S left outer join Customers C on S.Cust_Id = C.Cust_Id Cust_Id Item Cust_Id Cust_Name

----------- ---------- ----------- ---------2 3 3 4 Camera 2 John Doe Jane Doe Jane Doe NULL

Computer 3 Monitor 3

Printer NULL

Note that the server returns NULLs for the customer data associated with the Printer sale since there is no matching customer. We refer to this row as NULL extended. Using a full outer join, we can find all customers regardless of whether they purchased anything and all sales regardless of whether they have a valid customer:

select * from Sales S full outer join Customers C on S.Cust_Id = C.Cust_Id Cust_Id Item Cust_Id Cust_Name

----------- ---------- ----------- ---------2 3 3 4 NULL Camera 2 John Doe Jane Doe Jane Doe NULL Craig

Computer 3 Monitor 3

Printer NULL NULL 1

The following table shows which rows will be preserved or NULL extended for each outer join variation: Join A left outer join B A right outer join B A full outer join B Preserve all A rows all B rows all A and B rows

Full outer joins are commutative. In addition, A left outer join B and B right outer join A are equivalent. 3.3 Cross joins A cross join performs a full Cartesian product of two tables. That is, it matches every row of one table with every row of another table. You cannot specify a join predicate for a cross join using the ON clause though you can use a WHERE clause to achieve essentially the same result as an inner join. Cross joins are fairly uncommon. Two large tables should never be cross joined as this will result in a very expensive operation and a very large result set. select * from Sales S cross join Customers C

Cust_Id

Item

Cust_Id

Cust_Name

----------- ---------- ----------- ---------2 3 3 4 2 3 3 4 2 3 3 4 Camera 1 Craig Craig Craig Craig 2 John Doe John Doe John Doe John Doe 3 Jane Doe Jane Doe Jane Doe Jane Doe

Computer 1 Monitor Printer 1 Camera 1

Computer 2 Monitor Printer 2 Camera 2

Computer 3 Monitor Printer 3 3

3.4 Cross apply We introduced cross apply in SQL Server 2005 to enable joins with a table valued function (TVF) where the TVF has a parameter that changes for each execution. For example, the following query returns the same result as the above inner join using a TVF and cross apply: create function dbo.fn_Sales(@Cust_Id int) returns @Sales table (Item varchar(10)) as begin insert @Sales select Item from Sales where Cust_Id = @Cust_Id return end

select * from Customers cross apply dbo.fn_Sales(Cust_Id) Cust_Id Cust_Name Item

----------- ---------- ---------2 3 3 John Doe Camera Jane Doe Computer Jane Doe Monitor

We can also use outer apply to find all Customers regardless of whether they purchased anything. This is similar to an outer join. select * from Customers outer apply dbo.fn_Sales(Cust_Id) Cust_Id Cust_Name Item

----------- ---------- ---------1 2 3 3 Craig NULL

John Doe Camera Jane Doe Computer Jane Doe Monitor

3.5 Semi-join and Anti-semi-join A semi-join returns rows from one table that would join with another table without performing a complete join. An anti-semi-join returns rows from one table that would not join with another table; these are the rows that would be NULL extended if we performed an outer join. Unlike the other join operators, there is no explicit syntax to write semi-join, but SQL Server uses semi-joins in a variety of circumstances. For example, we may use a semi-join to evaluate an EXISTS sub-query: select * from Customers C where exists ( select * from Sales S where S.Cust_Id = C.Cust_Id)

Cust_Id

Cust_Name

----------- ---------2 3 John Doe Jane Doe

Unlike the previous examples, the semi-join only returns each customer one time. The query plan shows that SQL Server indeed uses a semi-join: |--Nested Loops(Left Semi Join, WHERE:([S].[Cust_Id]=[C].[Cust_Id])) |--Table Scan(OBJECT:([Customers] AS [C])) |--Table Scan(OBJECT:([Sales] AS [S])) There are left and right semi-joins. A left semi-join returns rows from the left (first) input that match rows from the right (second) input while a right semi-join returns rows from the right input that match rows from the left input.

4. REFERENCES http://lemire.me/blog/archives/2010/11/29/why-do-we-need-database-joins///need http://blogs.msdn.com/b/craigfr/archive/2006/07/19/671712.aspx

You might also like