Just another WordPress.com site

Posts tagged ‘Join’

How to understand SQL joins

SQL is a functional language. Try to think of a SELECT statement as a function. That is, a mathematical function, or mapping, which — this is important — maps an input to an output. When you select data from a table, think of the table as a source. Data streams out of the table. If it helps you, think of a little grinding cog icon. Then it streams out of the cog onto your screen as a familiar tabular result set. The cog is the SELECT statement, the function. It transforms the data. Maybe it just passes it straight through, but it really is a mapping of input to output. (By the way, if you take this approach when programming in XSLT or LISP, you will grasp things much more easily.)

A join is a SELECT statement with multiple data sources. The data streams from those sources into your cog icon, and a single stream flows out again. A SELECT statement always has one and only one output. (Why? Think of a function… think back to your math classes). Joins are functions that perform matching between data streams. The matching is necessary to merge the multiple input streams into a single output.

Let’s look at two tables of data, apples and oranges.

apples
Variety Price
Fuji 5.00
Gala 6.00
oranges
Variety Price
Valencia 4.00
Navel 5.00

Here is an example SELECT statement:

select apples.Variety, oranges.Price from apples 
 inner join oranges on apples.Price = oranges.Price

Here is (conceptually) what happens when we join these tables:

  1. Choose a left-hand table (the first table in the SELECT statement).
  2. For each row in the right-hand table, take the entire left-hand table and stack its rows next to the row in the right-hand table.
    apples and oranges
    Variety Price Variety Price
    Fuji 5.00 Valencia 4.00
    Gala 6.00
    Fuji 5.00 Navel 5.00
    Gala 6.00
  3. Fill in the missing rows in the right-hand table by duplicating them into the empty spaces.
    apples and oranges
    Variety Price Variety Price
    Fuji 5.00 Valencia 4.00
    Gala 6.00 Valencia 4.00
    Fuji 5.00 Navel 5.00
    Gala 6.00 Navel 5.00
  4. The result is a large table containing the cross-product or Cartesian productof the two data sets. Now satisfy the matching criteria by applying them as a predicate to each row in this new data set. If the predicate is true for the row, include it, otherwise exclude it. The result contains a single row:
    apples and oranges
    Variety Price Variety Price
    Fuji 5.00 Navel 5.00
  5. Now choose only the desired columns from the result:
    apples and oranges
    Variety Price
    Fuji 5.00

This may not be what a given query optimizer really does to execute a join, but the result is the same regardless of the algorithm. If a query optimizer does something different, it is for efficiency, not correctness. Every join always involves a cross product followed by choosing the desired data from the result.

Advertisements
%d bloggers like this: