Summarizing data in a SELECT statement using a GROUP BY clause is a very common area of difficulty for beginning SQL programmers. In Part I of this two part series, we'll use a simple schema and a typical report request to cover the effect of JOINS on grouping and aggregate calculations, and how to useCOUNT(Distinct) to overcome this. In Part II, we'll finish up our report while examining the problem withSUM(Distinct) and discussing how useful derived tables can be when grouping complicated data. (This article has been updated through SQL Server 2005.)
Here's the schema we'll be using along with some sample data:
create table Orders ( OrderID int primary key, Customer varchar(10), OrderDate datetime, ShippingCost money ) create table OrderDetails ( DetailID int primary key, OrderID int references Orders(OrderID), Item varchar(10), Amount money ) go insert into Orders select 1,'ABC', '2007-01-01', 40 union all select 2,'ABC', '2007-01-02', 30 union all select 3,'ABC', '2007-01-03', 25 union all select 4,'DEF', '2007-01-02', 10 insert into OrderDetails select 1, 1, 'Item A', 100 union all select 2, 1, 'Item B', 150 union all select 3, 2, 'Item C', 125 union all select 4, 2, 'Item B', 50 union all select 5, 2, 'Item H', 200 union all select 6, 3, 'Item X', 100 union all select 7, 4, 'Item Y', 50 union all select 8, 4, 'Item Z', 300
Determining the Virtual Primary Key of a Result Set
Let's examine our sample data by joining these tables together to return Orders along with the OrderDetails:
select o.orderID, o.customer, o.orderdate, o.shippingCost, od.DetailID, od.Item, od.Amount from orders o inner join OrderDetails od on o.OrderID = od.OrderID orderID customer orderdate shippingCost DetailID Item Amount ----------- ---------- -------------- ------------- --------- ---------- ---------- 1 ABC 2007-01-01 40.0000 1 Item A 100.0000 1 ABC 2007-01-01 40.0000 2 Item B 150.0000 2 ABC 2007-01-02 30.0000 3 Item C 125.0000 2 ABC 2007-01-02 30.0000 4 Item B 50.0000 2 ABC 2007-01-02 30.0000 5 Item H 200.0000 3 ABC 2007-01-03 25.0000 6 Item X 100.0000 4 DEF 2007-01-02 10.0000 7 Item Y 50.0000 4 DEF 2007-01-02 10.0000 8 Item Z 300.0000 (8 row(s) affected)
Remember that Orders has a one-to-many or parent/child relationship with OrderDetails: thus, oneorder can have many details. When we join them together, the Order columns are repeated over and over for each OrderDetail. This is normal, standard SQL behavior and what happens when you join tables in a one-to-many relation. Our result contains one row per OrderDetail, and those OrderDetail rows are never repeated, since it is not being joined to any further tables that will produce duplicate rows.
Thus, we could say that our result set has a virtual primary key of DetailID; there will never be a duplicate OrderDetail row in the data. We can count and add up all OrderDetail columns and never worry about double counting values. However, we can not do the same for the Orders table, since its rows are duplicated in the results. Remember this as we move forward.
A Typical Summary Report Example
Here is a good example using Orders and OrderDetails that demonstrates various aggregate functions and typical things to look for:
For each Customer, we want to return the total number of orders, the number of items ordered, the total order amount, and the total shipping cost.
All of this information lives in the Orders and the OrderDetails tables, and we know how to join them together, we just need to summarize those results now. We don't want to return all 8 rows and see all of the details; we just want to return 2 rows: one per customer, with the corresponding summary calculations.
Grouping and Summarizing Primary Rows
Since we want to return 1 row for each Customer, we can simply add GROUP BY Customer to the end of the SELECT. We can return the Customer column since we are grouping on it, and we can add any other columns as long as they are summarized in an aggregate function. We can also use the COUNT(*)aggregate function to return the number of rows per group. Since each row in our data corresponds to an order item, grouping by Customer and selecting Customer and COUNT(*) will return the number of OrderDetails per Customer:
select o.Customer, count(*) as ItemCount from Orders o inner join OrderDetails od on o.OrderID = od.OrderID group by o.Customer Customer ItemCount ---------- ----------- ABC 6 DEF 2 (2 row(s) affected)
Next, let's add the total order amount per customer. Since our join resulted in one row per item detail (remember: when joining Orders to OrderDetails, the result set has a virtual primary key of DetailID) we know that no rows in the OrderDetails table were duplicated in our results. Therefore, a simple SUM() of the Amount column from the OrderDetails table is all we need to return the total order amount per customer:
select o.Customer, count(*) as ItemCount, sum(od.Amount) as OrderAmount from Orders o inner join OrderDetails od on o.OrderID = od.OrderID group by o.Customer Customer ItemCount OrderAmount ---------- ----------- --------------------- ABC 6 725.0000 DEF 2 350.0000 (2 row(s) affected)
So far, so good! Only two more calculations to go: total orders per customer, and total shipping cost. Both of these values come from the Orders table.
If you recall, our join resulted in one row per OrderDetail, and that meant that the Orders table had duplicate rows. We need to return two calculations from our Orders table -- total number of Orders, and total Shipping cost. We already know that COUNT(*) returns the number of OrderDetails, so that won't work for us. Perhaps COUNT(OrderID) will return the total number of orders? Let's try it:
select o.Customer, count(*) as ItemCount, sum(od.Amount) as OrderAmount, count(o.OrderID) as OrderCount from Orders o inner join OrderDetails od on o.OrderID = od.OrderID group by o.Customer Customer ItemCount OrderAmount OrderCount ---------- ----------- --------------------- ----------- ABC 6 725.0000 6 DEF 2 350.0000 2 (2 row(s) affected)
Notice that the OrderCount column returned is the same as the ItemCount, and definitely not the number of orders per customer. That is because, by definition in SQL, COUNT(expression) just returns the total number of rows in which that expression is not null. Since OrderID is never null, it returns the total row count per Customer, and since each row corresponds with an OrderDetail item, we get the number of OrderDetail items per customer, not the number of Orders per customer.
Using COUNT(Distinct)
Luckily, there is a DISTINCT feature that we can use in our COUNT() aggregate function that will help us here. Count(Distinct expression) means: " return the total number of distinct values for the specified expression." So, if we write COUNT(Distinct OrderID), it will return the distinct number of OrderID values per Customer. Since each OrderID value corresponds to an Order (it is the primary key of the Orders table), we can use this to calculate our Order count:
select o.Customer, count(*) as ItemCount, sum(od.Amount) as OrderAmount, count(distinct o.OrderID) as OrderCount from Orders o inner join OrderDetails od on o.OrderID = od.OrderID group by o.Customer Customer ItemCount OrderAmount OrderCount ---------- ----------- --------------------- ----------- ABC 6 725.0000 3 DEF 2 350.0000 1 (2 row(s) affected)
Great! Looks good, makes sense, now we are getting somewhere.
Beware of SUMMING Duplicate Values
Moving on, let's add an expression to calculate total shipping cost per customer. Well, we know we have a ShippingCost column in our data from the Orders table, so let's try just adding SUM(ShippingCost) to our SELECT:
select o.Customer, count(*) as ItemCount, sum(od.Amount) as OrderAmount, count(distinct o.OrderID) as OrderCount, sum(o.ShippingCost) as TotalShipping from Orders o inner join OrderDetails od on o.OrderID = od.OrderID group by o.Customer Customer ItemCount OrderAmount OrderCount TotalShipping ---------- ----------- --------------------- ----------- --------------------- ABC 6 725.0000 3 195.0000 DEF 2 350.0000 1 20.0000 (2 row(s) affected)
Looks like we are good to go, right? Well -- not really, look at our TotalShipping column. Customer ABC has a total shipping of 195. Yet, in our Orders table, we see that the customer has 3 orders, with shipping values of 40,30 and 25. 40+30+25=95, not 195! What is going on here? Well, like the COUNT(*)expression, remember that the SUM() is acting not upon our tables themselves, but the result of the JOIN that we expressed in our SELECT. The JOIN from Orders to OrderDetails meant that rows from the Orders table were duplicated, remember? So, if we SUM() a column in our Orders table when it is joined to the Details, we are summing up duplicate values and our result will be too high. You can verify this by reviewing the results returned by the join from Orders to OrderDetails and manually adding them up by hand.
It is very important to understand this; when writing any JOINS, you need to identify which tables can and cannot have duplicate rows returned. The basic rule is very simple:
If Table A has one-to-many relation with Table B, and Table A is JOINED to Table B, then Table A will have duplicate rows in the result. The final result will have a virtual primary key equal to that of Table B's.
So, we know we can SUM() and COUNT() columns from Table B with no problem, but we cannot do that with columns from Table A because the duplicate values will skew our results. How do we handle this situation? We'll cover that and a whole lot more in Part II.
Part II
we worked on a simple report request and covered the basics of GROUP BY and the issue of duplicate rows caused by JOINs. Today we'll finish up that report while examining SUM(Distinct), and see just how crucial derived tables are when summarizing data from multiple tables.
The Problem with SUM(Distinct)
We previously learned that we can use COUNT(Distinct) to count columns from the duplicated table, so what aboutSUM(Distinct)? It seems like that should do the trick, since we only want to sum distinct shipping cost values, not all the duplicates. Let's give it a try:
select o.Customer, count(*) as ItemCount, sum(od.Amount) as OrderAmount, count(distinct o.OrderID) as OrderCount, sum(distinct o.ShippingCost) as TotalShipping from Orders o inner join OrderDetails od on o.OrderID = od.OrderID group by o.Customer Customer ItemCount OrderAmount OrderCount TotalShipping ---------- ----------- --------------------- ----------- --------------------- ABC 6 725.0000 3 95.0000 DEF 2 350.0000 1 10.0000 (2 row(s) affected)
And there it is! We seem to have solved our problem: looking back to our Orders table, we can see that the TotalShipping cost per Customer now looks correct.
But wait ... It is actually wrong!
This is where many people have problems. Yes, the data looks correct. And, for this small sample, it just randomly happens to be correct. But SUM(DISTINCT) works exactly the same as COUNT(DISTINCT): It simply gets all of the values eligible to be summed, eliminates all duplicate values, and then adds up the results. But it is eliminating duplicate values, not duplicate rows based on some primary key column! It doesn't care that shipping cost 40 belonged to orderID #1 and that shipping cost 30 belonged to OrderID #2; it simply doesn't separate them that way.
The expression SUM(Distinct ShippingCost) is basically evaluated like this:
- After Joining from Orders to OrderDetails, each group has the following Shipping cost values:
Customer ABC: 40,40,30,30,30,25
Customer DEF: 10 - Since DISTINCT was asked for, it eliminates duplicate values from those lists:
Customer ABC: 40,40,30,30,30,25
Customer DEF: 10 - And now it can evaluate the SUM() by adding up the remaining values:
Customer ABC: 40+30+25 = 95
Customer DEF: 10 = 10
If you aren't getting the concept, you still might not see the problem. In fact, at this point, many people never do. They see that SUM(x) returns huge numbers that cannot be right, so they tweak it and try SUM(DISTINCT x), and the values look much more reasonable, and they might even initially tie out perfectly, so off to production it goes. Yet, the SQL is incorrect; it is relying on the fact that currently no two orders for a customer have the same shipping cost.
Let's demonstrate by adding another order:
insert into Orders values (5, 'DEF', '2007-01-04', 10) insert into OrderDetails values (9, 5, 'Item J', 125)
Running that simply adds another Order for Customer DEF, shipping cost of $10, with one OrderDetail item for $125. Now, let's execute that same SELECT again to see how this new Order affected our results:
select o.Customer, count(*) as ItemCount, sum(od.Amount) as OrderAmount, count(distinct o.OrderID) as OrderCount, sum(distinct o.ShippingCost) as TotalShipping from Orders o inner join OrderDetails od on o.OrderID = od.OrderID group by Customer Customer ItemCount OrderAmount OrderCount TotalShipping ---------- ----------- --------------------- ----------- --------------------- ABC 6 725.0000 3 95.0000 DEF 3 475.0000 2 10.0000 (2 row(s) affected)
The ItemCount, OrderAmount and OrderCount columns look great. But the TotalShipping cost for DEF still shows $10! What happened!?
Can you figure it out? Remember how SUM(Distinct) works! It just takes distinct values passed to the function and eliminates duplicates. Both orders for DEF had a shipping cost of $10, and SUM(Distinct ShippingCost) doesn't care that the two $10 values are for different Orders, it just knows that the 10 is duplicated for the Customer, so it only uses the 10 once to calculate the SUM. Thus, it returns a value of 10 as the total shipping cost for those two orders, even though it should be 10+10=20. Our result is now wrong. The long and short of it is this: Never use SUM(Distinct) ! It doesn't usually make logical sense in most situations; there may be a time and place for it, but it is definitely not here.
Summarizing Derived Tables
So, how do we fix this? Well, like many SQL problems, the answer is simple: Do it one step at a time, don't try to join all of the tables together and just add SUM() and GROUP BY and DISTINCT almost randomly until things work; break it down logically step by step.
So, before worrying about totals per Customer, let's step back and focus on returning totals per Order. If we can return totals per Order first, then we can simply summarize those Order totals by Customer and we'll have the results we need. Let's summarize the OrderDetails table to return 1 row per Order, with the ItemCount and the total Order Amount:
select orderID, count(*) as ItemCount, sum(Amount) as OrderAmount from orderDetails group by orderID orderID ItemCount OrderAmount ----------- ----------- --------------------- 1 2 250.0000 2 3 375.0000 3 1 100.0000 4 2 350.0000 5 1 125.0000 (5 row(s) affected)
Nice and simple, easy to verify, things look good. Because we are grouping on OrderID, we can say that these results have a virtual primary key of OrderID -- that is, there will never be duplicate rows for the same Order. In fact, here's another basic rule to always remember:
The virtual primary key of a SELECT with a GROUP BY clause will always be the expressions stated in the GROUP BY.
We can now take that SQL statement and those results and encapsulate them in their own derived table. If we join from the Orders table to the previous SELECT as a derived table, we get:
select o.orderID, o.Customer, o.ShippingCost, d.ItemCount, d.OrderAmount from orders o inner join ( select orderID, count(*) as ItemCount, sum(Amount) as OrderAmount from orderDetails group by orderID ) d on o.orderID = d.orderID orderID Customer ShippingCost ItemCount OrderAmount ----------- ---------- --------------------- ----------- --------------------- 1 ABC 40.0000 2 250.0000 2 ABC 30.0000 3 375.0000 3 ABC 25.0000 1 100.0000 4 DEF 10.0000 2 350.0000 5 DEF 10.0000 1 125.0000 (5 row(s) affected)
Let's examine those results. There are no duplicate rows or values anywhere; there is exactly one row per Order. This is because our derived table has a virtual primary key of OrderID, so joining from Orders to our derived table will never produce duplicates. This is a very useful and simple technique to avoid duplicates when relating a parent table to a child table: summarize the child table by the parent's primary key first in a derived table, and then join it to the parent table. The parent table's rows will then never be duplicated and can be summarized accurately.
Now we have our total ItemCount per order, as well as our total OrderAmount per order. And we can see that if we sum these results up, our ShippingCost column will be fine, since it is never duplicated. No need for distinct. In fact, we can even use a regular COUNT(*) expression to get the total number of orders per customer!
So, we can simply add "GROUP BY Customer" to the previous SQL, calculate what we need with aggregate functions, and remove any columns (like OrderID) that we will not be summarizing. You might also notice that at this point, the total ItemCount per Customer is no longer a COUNT(*) expression; it is a simple SUM() of the ItemCount value returned from our derived table.
Here's the result:
select o.Customer, count(*) as OrderCount, sum(o.ShippingCost) as ShippingTotal, sum(d.ItemCount) as ItemCount, sum(d.OrderAmount) as OrderAmount from orders o inner join ( select orderID, count(*) as ItemCount, sum(Amount) as OrderAmount from orderDetails group by orderID ) d on o.orderID = d.orderID group by o.customer Customer OrderCount ShippingTotal ItemCount OrderAmount ---------- ----------- --------------------- ----------- --------------------- ABC 3 95.0000 6 725.0000 DEF 2 20.0000 3 475.0000 (2 row(s) affected)
And there you have it! We examined our data, logically considered the implications of our JOINS, broke the problem down into smaller parts, and ended up with a fairly simple solution that we know will be quick, efficient and accurate.
Adding More Tables a Summarized SELECT
To finish things up, suppose our schema also has a table of Customers:
Create table Customers ( Customer varchar(10) primary key, CustomerName varchar(100) not null, City varchar(100) not null, State varchar(2) not null ) insert into Customers select 'ABC','ABC Corporation','Boston','MA' union all select 'DEF','The DEF Foundation','New York City','NY'
... and we wish to also return each customers' name, city and state in our previous results. One way to do this is to simply add the Customers table to our existing join, and then add the customer columns to the SELECT clause. However, unless you add all of the customer columns to the GROUP BY as well, you will get an error message indicating that you need to either group or summarize all columns you wish to display. We aren't trying to calculate a COUNT() or a SUM() of Name, City and State, so it doesn't make sense to wrap those columns in an aggregate expression. So, it appears that we must add them all to our GROUP BY clause to get the results we need:
select o.Customer, c.customerName, c.City, c.State, count(*) as OrderCount, sum(o.ShippingCost) as ShippingTotal, sum(d.ItemCount) as ItemCount, sum(d.OrderAmount) as OrderAmount from orders o inner join ( select orderID, count(*) as ItemCount, sum(Amount) as OrderAmount from orderDetails group by orderID ) d on o.orderID = d.orderID inner join customers c on o.customer = c.customer group by o.customer, c.customerName, c.City, c.State Customer customerName City State OrderCount ShippingTotal ItemCount OrderAmount ---------- -------------------- --------------- ----- ----------- ------------- --------- ----------- ABC ABC Corporation Boston MA 3 95.0000 6 725.0000 DEF The DEF Foundation New York City NY 2 20.0000 3 475.0000 (2 row(s) affected)
Technically, that works, but it seems silly to list all of those customer columns in the GROUP BY ... After all, we are just grouping on Customer, not on each of the customer's attributes, right?
What's interesting is that the solution is something we already talked about and the same technique applies: Since Customer has a one-to-many relation with Orders, we know that joining Customers to Orders will result in duplicate rows per Customer, and thus all columns in the Customer table are duplicated in the results. You might notice that this is exactlythe same scenario that applies when joining Orders to OrderDetails. So, we handle this situation the same way! We simply summarize our Orders by Customer first, in a derived table, and then we join those results to the Customer table. This means that no columns from the Customer table will be dupicated at all, and there is no need to add them all to our GROUP BY expression. This keep our SQL clean, organized, and logically sound.
So, our final results now look like this:
select c.Customer, c.customerName, c.City, c.State, o.OrderCount, o.ShippingTotal, o.ItemCount, o.OrderAmount from ( select o.customer, count(*) as OrderCount, sum(o.ShippingCost) as ShippingTotal, sum(d.ItemCount) as ItemCount, sum(d.OrderAmount) as OrderAmount from orders o inner join ( select orderID, count(*) as ItemCount, sum(Amount) as OrderAmount from orderDetails group by orderID ) d on o.orderID = d.orderID group by o.customer ) o inner join customers c on o.customer = c.customer Customer customerName City State OrderCount ShippingTotal ItemCount OrderAmount ---------- -------------------- --------------- ----- ----------- ------------- --------- ----------- ABC ABC Corporation Boston MA 3 95.0000 6 725.0000 DEF The DEF Foundation New York City NY 2 20.0000 3 475.0000 (2 row(s) affected)
Conclusion
I hope this two part series helps a little bit with your understanding of GROUP BY queries. It is vital to identify and understand what the virtual primary key of a result set is when you join multiple tables, and to recognize which rows are duplicated or not. In addition, remember that COUNT(Distinct) can be useful, but SUM(Distinct) should very rarely, if ever, be used.
In general, if you find that values you need to SUM() have been duplicated, summarize the table causing those duplicates separately and join it in as a derived table. This will also allow you to break down your problem into smaller steps and test and validate the results of each step as you go.
GROUP BY is a very powerful feature, but is also misunderstood and abused, and the easiest way to leverage it is to carefully build your SQL from smaller, simpler parts into larger, more complicated solutions.
No comments:
Post a Comment