The Data Shizzle: August 2015

Monday, August 10, 2015

A Video Tutorial on How To Add Full-Text Search To SQL Server 2012

I'm on Chapter 6 of the Querying Microsoft SQL Server 2012 Training Kit. In order for me to learn this chapter I have to install/add Full-Text Search into my SQL Server 2012 instance. After looking around on the internet I found this nice video by DBA Presents. It shows you step by step on how to add Full-Text Search and then some.

Friday, August 7, 2015

CTEs

CTE stands for common table expression. It involves three parts:

1. Inner query
2. The name of the query and its columns
3. Outer query

Format

The structure of a CTE goes like this:


WITH <name>

AS (

       <inner query>

   )

<outer query>

No Nesting

CTEs don't nest unlike derived tables. To handle multiple CTEs, you can separate them using commas:


WITH C1 AS

(

  SELECT...

  FROM T1

  WHERE...

),

C2 AS

(

  SELECT...

  FROM C1

  WHERE...

)

SELECT ...

FROM C2

WHERE ...;

Recursion support

You can do recursion with CTEs. The recursive query can have 2 or more queries usually separated by UNION ALL operator. The requirement is one of the queries in the CTE body needs to return a valid relational result. This is known as the anchor member. This gets executed once. Then one of the queries in the CTE body will need to reference the CTE name. This is the query that will be invoked repeatedly until it returns an empty set. The outer query will then represent the unified results of the anchor member along with the results of the recursive member.


WITH EmpsCTE AS

(-- This is the anchor member
 
  SELECT empid, mgrid, firstname, lastname, 0 AS distance

  FROM HR.Employees

  WHERE empid = 9



  UNION ALL  -- separator
 

-- This is the recursive member
 
  SELECT M.empid, M.mgrid, M.firstname, M.lastname, S.distance + 1 AS distance

  FROM EmpsCTE AS S

    JOIN HR.Employees AS M

      ON S.mgrid = M.empid

)-- Outer query

SELECT empid, mgrid, firstname, lastname, distance

FROM EmpsCTE;

It returns the following result:

empid	mgrid	firstname	lastname	distance
9	5	Zoya	Dolgopyatova	0
5	2	Sven	Buck	1
2	1	Don	Funk	2
1	NULL	Sara	Davis	3

Thursday, August 6, 2015

The EXISTS predicate

Today, I've been learning about correlated subqueries. One of the cool new things that I learned today was the EXISTS predicate. I haven't really used it before. But it is similar to the IN predicate. However, after doing some research, IN and EXISTS are quite different from each other especially when it comes from an optimization perspective.

Basically the way EXISTS works is that it accepts a subquery as input and returns true if the subquery returns at least one row, otherwise it returns false. For example:


SELECT custid, companyname

FROM Sales.Customers AS C 

WHERE EXISTS

  (SELECT * 

   FROM Sales.Orders AS O

   WHERE O.custid = C.custid 

     AND O.orderdate = '20070212');

The query returns customers and their company that had orders placed on February 12, 2007.

custid	companyname
5	Customer HGVLZ
66	Customer LHANT

If the subquery did not return anything, then the outer query will not return any customer.

Can be negated

The EXISTS predicate can also be negated. For example:


SELECT custid, companyname

FROM Sales.Customers AS C 

WHERE NOT EXISTS

  (SELECT * 

   FROM Sales.Orders AS O

   WHERE O.custid = C.custid 

     AND O.orderdate = '20070212');

The result will return customers who did not order on February 7, 2012.

custid	companyname
72	Customer AHPOP
58	Customer AHXHT
25	Customer AZJED
18	Customer BSVAR

EXISTS vs IN

As I mentioned earlier, the first question that came to mind when I was looking at the EXISTS, was how is it different from IN? Someone posted the same question at Stack Overflow which produced some interesting discussion http://stackoverflow.com/questions/24929/difference-between-exists-and-in-in-sql. EXISTS predicate is supposed to be faster than the IN. The main purpose of the EXISTS is to well, check if a row does exists. Thus, avoiding the counting stuff such as:

SELECT COUNT(*) FROM Table

I've actually seen a lot of the query above used at work in places where they could have used EXISTS instead. Maybe I can start introducing more EXISTS since it is supposed to be faster and simpler to write. Also the example above has to check the entire table wherein if we use EXISTS, the statement will return true as soon as a row is found.

Wednesday, August 5, 2015

SQL Server 2012 and SQL Server Management Studio (SSMS) after Windows 10 update

I just finished upgrading my machine to the new Windows 10 OS and I just wanted to share to all the others who are wondering if their SSMS or SQL Server 2012 install won't work, that it should work fine. One of the worries I had before deciding to upgrade was that I might have to re-install my SSMS or I might have to do some reconfiguration because I saw a post somewhere that they were having trouble connecting to their DB after they upgraded. The upgrade was pretty seamless and smooth. It took me about 30 mins. I upgraded from version 8.1 and I'm loving 10 so far. I'm rocking my SSMS and SQL Server 2012 that I installed on my Windows 8.1. So come on and jump. The water is just fine.

Monday, August 3, 2015

Notes on Sorting Data

When is order guaranteed?

If we run a simple query such as the one below, how do we know the order of the result when it comes back?


SELECT empid, firstname, lastname, city, MONTH(birthdate) AS birthmonth

FROM HR.Employees

WHERE country = N'USA' AND region = N'WA';

Before, I always thought the results were ordered by the primary key by default. Boy, was I wrong. But after finding out that I wasn't the only one, I didn't feel as bad. The common misconception was the rows will always be returned in insertion order; for some, by primary key order (I belonged to that group); some assume by clustered index order. The correct assumption however is that a query has no guaranteed order unless we explicitly instruct the query by adding the ORDER BY clause.

The example query above returned the following results like this after I ran it in my machine:>

empid	firstname	lastname	city	birthmonth
1	Sara	Davis	Seattle	12
2	Don	Funk	Tacoma	2
3	Judy	Lew	Kirkland	8
4	Yael	Peled	Redmond	9
8	Maria	Cameron	Seattle	1

Although this result exactly matched the one listed in the Querying Microsoft SQL Server 2012 Training Kit, it does not mean it will be the same if we execute this in another machine. Actually, we are not guaranteed that the result will display like this if we execute this query again later at a different time in the same machine.

Factors that determine the default ordering

I tried executing the query again several times but it always returned the same result sorted like in the above table. However, according to the Training Kit, this is not guaranteed. The database engine can return the data in any order because there is no explicit instruction to return the data using a specific order. One of the factors might cause the query to return the result in this order is optimization. Perhaps, the SQL Server decided that it will return the query in this order for this period of time and there is a good chance that the result will be returned this way unless some circumstances change. Some examples of changes that may affect the database engine are:

changes in data distribution,
availability of physical structures such as indexes,
physical resources such as CPU and memory.
A change in the database version can also affect these like for example, if the database has been upgraded with service packs and patches.

The ORDER BY clause

The only way we can guarantee an ordering is by adding the ORDER BY clause to the query instruction.


SELECT empid, firstname, lastname, city, MONTH(birthdate) AS birthmonth

FROM HR.Employees

WHERE country = N'USA' AND region = N'WA'

ORDER BY city;

After ordering it by 'city', the output of the query becomes:

empid	firstname	lastname	city	birthmonth
3	Judy	Lew	Kirkland	8
4	Yael	Peled	Redmond	9
8	Maria	Cameron	Seattle	1
1	Sara	Davis	Seattle	12
2	Don	Funk	Tacoma	2

The default direction is always ASC (ascending order). We can add that to the end of the ORDER BY clause but it is optional. The DESC (descending) however needs to be explicitly specified we want the result to be sorted in descending order.

It is important to note that the result of a query with an ORDER BY clause is considered non-relational (from an ordering perspective). This is result is also called a cursor.

Deterministic/Non-deterministic

The result above is sorted by city, but if we look closely, empid 8 and 1 have the same city (Seattle). So in this case, how did the database engine know that empid 8 will be displayed first before empid 1? In such as case, the order is considered to be non-deterministic. If the city is not unique, the query does not guarantee order among the rows with the same city. Just like before, when it did not have an ORDER BY clause, it has no guarantee on the ordering of the rows returned.

To make the query above deterministic, we need to add more criteria to the order by clause. Here we are adding a tiebreaker by adding empid so that if the city is the same, the empid will determine the ordering.


SELECT empid, firstname, lastname, city, MONTH(birthdate) AS birthmonth

FROM HR.Employees

WHERE country = N'USA' AND region = N'WA'

ORDER BY city, empid;

Here's the output of the modified query: (deterministic)

empid	firstname	lastname	city	birthmonth
3	Judy	Lew	Kirkland	8
4	Yael	Peled	Redmond	9
1	Sara	Davis	Seattle	12
8	Maria	Cameron	Seattle	1
2	Don	Funk	Tacoma	2

Bad practice

In T-SQL the ORDER BY clause allows sorting by using the ordinal positions of the columns in the SELECT statement.

SELECT empid, firstname, lastname, city, MONTH(birthdate) AS birthmonth

FROM HR.Employees

WHERE country = N'USA' AND region = N'WA'

ORDER BY 4, 1;

However, this is considered bad practice. This has been mentioned several times in the Training Kit so this practice is not recommended and probably an exam item. It's considered best practice to refer to column names and not the ordinal positions.

Another cool bit that I learned is that you can also order the resulting rows by using column elements that we're not returning in the SELECT clause. For example:

SELECT empid, city,

FROM HR.Employees

WHERE country = N'USA' AND region = N'WA'

ORDER BY birthdate;

empid	city
4	Redmond
1	Seattle
2	Tacoma
8	Seattle
3	Kirkland

However, the result would be more meaningful if we include the birthdate as part of the result.

The DISTINCT clause

If the DISTINCT clause is added to the SELECT clause, then the previous query in the above section becomes invalid. This is because the resulting rows will not be mapped to the source rows in a one-to-one relationship but rather in a one-to-many relationship. The DISTINCT eliminates duplicates in the result set. In the example above, we were ordering the result by birthdate, but if there are employees that have distinct birthdates that belong to the same city (for example, Seattle), the query will not pick which employee to return and will fail. So if the DISTINCT case is used, we are limited in the ORDER BY list to only use elements that appear in the SELECT list.

Alias in the ORDER BY clause

Last week, I wrote a post about Logical Query Processing Order. In that post, we know that the ORDER BY gets conceptually evaluated after the SELECT statement. This means that aliases are allowed in the ORDER BY clause.

Treatment of NULLs

How are NULLs sorted? In standard SQL, NULLs are supposed sort together, however it does not dictate whether they should be sorted before non-NULLs or after. It lets the implementation decide. In SQL Server, NULL values are sorted before non-NULLs in the default ascending direction.


SELECT orderid, shippeddate

FROM Sales.Orders

WHERE custid = 20

ORDER BY shippeddate;

After I executed the example query above from the Training Kit, I got the following result

orderid	shippeddate
11008	NULL
11072	NULL
10258	2006-07-23 00:00:00.000
10263	2006-07-31 00:00:00.000
10351	2006-11-20 00:00:00.000
10368	2006-12-02 00:00:00.000

If we want to make the NULLs sort after non-NULL values in T-SQL. We need to change the query to the one below:


SELECT orderid, shippeddate

FROM Sales.Orders

WHERE custid = 20

ORDER BY CASE
 
WHEN shippeddate IS NULL THEN 1

ELSE 0

END, shippeddate;

Since ORDER BY allows the use of expressions, we can use a CASE statement to assign to assign a value of "1" if the shippeddate IS NULL and "0" if it is non-NULL. This will put the non-NULLs before the NULL values and sort it using the shippedate (after the CASE statement). If I did not put the shippeddate after the END, it will still put the NULL values after, but the shippeddate will not be sorted.

In stadard SQL, there are options called NULL FIRST and NULLS LAST to conveniently control how NULLs are sorted. T-SQL does not support these functions however.

Summary

It's important to remember that a query that does not have an ORDER BY clause returns a relational result. This does not guarantee any order and the only way to guarantee an order is to explicity specify it using the ORDER BY clause. Therefore, a query with an ORDER BY clause returns a result set that is non-relational and it is called a cursor.

If ordering is not important in your result, it is recommended that you avoid using the ORDER BY clause as the sorting can be a very expensive process.