Monday, August 3, 2015

Notes on Sorting Data

When is order guaranteed?

If we run a simple query such as the one below, how do we know the order of the result when it comes back? 

SELECT empid, firstname, lastname, city, MONTH(birthdate) AS birthmonth
FROM HR.Employees
WHERE country = N'USA' AND region = N'WA';


Before, I always thought the results were ordered by the primary key by default. Boy, was I wrong. But after finding out that I wasn't the only one, I didn't feel as bad. The common misconception was the rows will always be returned in insertion order; for some, by primary key order (I belonged to that group); some assume by clustered index order. The correct assumption however is that a query has no guaranteed order unless we explicitly instruct the query by adding the ORDER BY clause.

The example query above returned the following results like this after I ran it in my machine:>

empidfirstnamelastnamecitybirthmonth
1SaraDavisSeattle12
2DonFunkTacoma2
3JudyLewKirkland8
4YaelPeledRedmond9
8MariaCameronSeattle1

Although this result exactly matched the one listed in the Querying Microsoft SQL Server 2012 Training Kit, it does not mean it will be the same if we execute this in another machine. Actually, we are not guaranteed that the result will display like this if we execute this query again later at a different time in the same machine.


Factors that determine the default ordering

I tried executing the query again several times but it always returned the same result sorted like in the above table. However, according to the Training Kit, this is not guaranteed. The database engine can return the data in any order because there is no explicit instruction to return the data using a specific order. One of the factors might cause the query to return the result in this order is optimization. Perhaps, the SQL Server decided that it will return the query in this order for this period of time and there is a good chance that the result will be returned this way unless some circumstances change. Some examples of changes that may affect the database engine are:
  •  changes in data distribution, 
  • availability of physical structures such as indexes, 
  • physical resources such as CPU and memory. 
  • A change in the database version can also affect these like for example, if the database has been upgraded with service packs and patches.

The ORDER BY clause

The only way we can guarantee an ordering is by adding the ORDER BY clause to the query instruction.

SELECT empid, firstname, lastname, city, MONTH(birthdate) AS birthmonth
FROM HR.Employees
WHERE country = N'USA' AND region = N'WA'
ORDER BY city;

After ordering it by 'city', the output of the query becomes:

empidfirstnamelastnamecitybirthmonth
3JudyLewKirkland8
4YaelPeledRedmond9
8MariaCameronSeattle1
1SaraDavisSeattle12
2DonFunkTacoma2

The default direction is always ASC (ascending order). We can add that to the end of the ORDER BY clause but it is optional. The DESC (descending) however needs to be explicitly specified we want the result to be sorted in descending order.

It is important to note that the result of a query with an ORDER BY clause is considered non-relational (from an ordering perspective).  This is result is also called a cursor.

Deterministic/Non-deterministic

The result above is sorted by city, but if we look closely, empid 8 and 1 have the same city (Seattle). So in this case, how did the database engine know that empid 8 will be displayed first before empid 1? In such as case, the order is considered to be non-deterministic. If the city is not unique, the query does not guarantee order among the rows with the same city. Just like before, when it did not have an ORDER BY clause, it has no guarantee on the ordering of the rows returned. 

To make the query above deterministic, we need to add more criteria to the order by clause. Here we are adding a tiebreaker by adding empid so that if the city is the same, the empid will determine the ordering.

SELECT empid, firstname, lastname, city, MONTH(birthdate) AS birthmonth
FROM HR.Employees
WHERE country = N'USA' AND region = N'WA'
ORDER BY city, empid;


Here's the output of the modified query: (deterministic)
empidfirstnamelastnamecitybirthmonth
3JudyLewKirkland8
4YaelPeledRedmond9
1SaraDavisSeattle12
8MariaCameronSeattle1
2DonFunkTacoma2



Bad practice

In T-SQL the ORDER BY clause allows sorting by using the ordinal positions of the columns in the SELECT statement.

SELECT empid, firstname, lastname, city, MONTH(birthdate) AS birthmonth
FROM HR.Employees
WHERE country = N'USA' AND region = N'WA'
ORDER BY 4, 1;


However, this is considered bad practice. This has been mentioned several times in the Training Kit so this practice is not recommended and probably an exam item. It's considered best practice to refer to column names and not the ordinal positions.

Another cool bit that I learned is that you can also order the resulting rows by using column elements that we're not returning in the SELECT clause. For example:

SELECT empid, city,
FROM HR.Employees
WHERE country = N'USA' AND region = N'WA'
ORDER BY birthdate;


empidcity
4Redmond
1Seattle
2Tacoma
8Seattle
3Kirkland

However, the result would be more meaningful if we include the birthdate as part of the result.

The DISTINCT clause

If the DISTINCT clause is added to the SELECT clause, then the previous query in the above section becomes invalid. This is because the resulting rows will not be mapped to the source rows in a one-to-one relationship but rather in a one-to-many relationship. The DISTINCT eliminates  duplicates in the result set. In the example above, we were ordering the result by birthdate, but if there are employees that have distinct birthdates that belong to the same city (for example, Seattle), the query will not pick which employee to return and will fail. So if the DISTINCT case is used, we are limited in the ORDER BY list to only use elements that appear in the SELECT list.

Alias in the ORDER BY clause

Last week, I wrote a post about Logical Query Processing Order. In that post, we know that the ORDER BY gets conceptually evaluated after the SELECT statement. This means that aliases are allowed in the ORDER BY clause.

Treatment of NULLs

How are NULLs sorted? In standard SQL, NULLs are supposed sort together, however it does not dictate whether they should be sorted before non-NULLs or after. It lets the implementation decide. In SQL Server, NULL values are sorted before non-NULLs in the default ascending direction.

SELECT orderid, shippeddate
FROM Sales.Orders
WHERE custid = 20
ORDER BY shippeddate;


After I executed the example query above from the Training Kit, I got the following result
orderidshippeddate
11008NULL
11072NULL
102582006-07-23 00:00:00.000
102632006-07-31 00:00:00.000
103512006-11-20 00:00:00.000
103682006-12-02 00:00:00.000

If we want to make the NULLs sort after non-NULL values in T-SQL. We need to change the query to the one below:

SELECT orderid, shippeddate
FROM Sales.Orders
WHERE custid = 20
ORDER BY CASE
WHEN shippeddate IS NULL THEN 1
ELSE 0
END, shippeddate;

Since ORDER BY allows the use of expressions, we can use a CASE statement to assign to assign a value of "1" if the shippeddate IS NULL and "0" if it is non-NULL. This will put the non-NULLs before the NULL values and sort it using the shippedate (after the CASE statement). If I did not put the shippeddate after the END, it will still put the NULL values after, but the shippeddate will not be sorted.

In stadard SQL, there are options called NULL FIRST and NULLS LAST to conveniently control how NULLs are sorted. T-SQL does not support these functions however.

Summary

It's important to remember that a query that does not have an ORDER BY clause returns a relational result.  This does not guarantee any order and the only way to guarantee an order is to explicity  specify it using the ORDER BY clause. Therefore, a query with an ORDER BY clause returns a result set that is non-relational and it is called a cursor.

If ordering is not important in your result, it is recommended that you avoid using the ORDER BY clause as the sorting can be a very expensive process.

No comments:

Post a Comment