SQL Server Data Partitioning - Groups Within Groups

November 21, 2024#Software Development

Occasionally you’ll find yourself wanting to query data from a database. I know I do! Sometimes that involves grouping your data.

Sometimes you want to group your groups and pull specific data from those groups.

Lots of grouping! Today we’ll look at how we’d tackle this in SQL Server, without having to resort to manually querying for and iterating over whole collections just to get at what we want.

The Problem to Solve

Let’s say we want to query for Orders placed in our system, to see what the highest-rated Item is for each Order. This means that we will have to group our sets of Order Items together by Order, and then from there collect a subset of Item data from each Order.

Database Setup

A Simple Schema

First, let’s establish our database schema that we’ll use to demonstrate the tools we have available to solve our problem.

CREATE TABLE Orders (
  ID INT IDENTITY(1, 1),
  Date DATETIMEOFFSET NOT NULL DEFAULT GETDATE(),
  PRIMARY KEY(ID)
);
CREATE TABLE Items (
  ID INT IDENTITY(1, 1),
  Name VARCHAR(100),
  Price DECIMAL(18, 2) NOT NULL,
  Rating INT NOT NULL,
  PRIMARY KEY(ID)
);
CREATE TABLE OrderItems (
  OrderID INT NOT NULL,
  ItemID INT NOT NULL,
  PRIMARY KEY(OrderID, ItemID),
  FOREIGN KEY(OrderID) REFERENCES Orders(ID),
  FOREIGN KEY(ItemID) REFERENCES Items(ID)
);

We have three tables in our extremely simplified Orders database. Obviously we are omitting a lot of information usually present in these tables, but we’re keeping things clean for demonstrative purposes.

The Orders table, housing the date the Order was placed
The Items table, which has a Name, Price, and Rating
The OrderItems table, keeping track of what Items are in each Order

Sample Data

Now let’s create some Items and Orders!

INSERT INTO Items (Name, Price, Rating) VALUES
('Foo', 10.99, 3),
('Bar', 50.01, 5),
('Fizz', 4.50, 5),
('Buzz', 1.99, 1);

-- Create Orders with auto-generated values
INSERT INTO Orders DEFAULT VALUES; 
INSERT INTO Orders DEFAULT VALUES;
INSERT INTO Orders DEFAULT VALUES;

INSERT INTO OrderItems (OrderID, ItemID) VALUES
(1, 1), (1, 2), (1, 4), -- Foo, Bar, Buzz (Distinct ratings)
(2, 2), -- Bar (Single item)
(3, 1), (3, 2), (3, 3); -- For, Bar, Fizz (Rating tie)

Here we’ve created the Items Foo, Bar, Fizz, and Buzz. Both Bar and Fizz have a rating of 5, Foo has 3, and Buzz has 1. We’ve added three Orders (which in a fresh database, will create IDs 1, 2, and 3). Finally, we populated these created Orders with various items, annotated above.

All tables are hydrated with sample Order data, with three Orders created.

Let’s Group Some Data!

Let’s get the highest-rated Item in each Order in our database. In a real system, this may be limited to a given user, date range, or other factors. For simplicity, we’ll use our entire dataset.

First Attempt: One By One

To get the highest-rated Item per Order, we must get groups of them, then pick the highest one. We obviously don’t want to pull every OrderItem record into memory, so we only select the highest item. To accomplish this in this first attempt we must run a query for each Order.

DECLARE @OrderID INT = 1; -- Repeat for each existing Order.

SELECT TOP 1 i.Name
FROM Items i
INNER JOIN OrderItems oi ON i.ID = oi.ItemID
WHERE oi.OrderID = @OrderID -- Query must be run for each Order, passing in 1, 2, and 3 separately
ORDER BY i.Rating DESC

This gets us what we want: The name of the highest-rated Item in the Order we specify.

The highest-rated Item for Order 1 is the “Bar” Item.

Naturally this is inefficient, and we’d probably just say “pull all the data back and group/sort it in memory.”

No need! SQL gives us the tools to partition data within our queries.

Second Attempt: Partitioning

Instead of having to either pull all the data back and manually group/sort, or having to send N queries to the database, we can rely on the PARTITION BY operation in SQL. With this in conjunction with either the RANK() or ROW_NUMBER() functions, we can now effectively sort then filter by sub-groups within a single query.

SELECT * FROM
(
  SELECT 
    oi.OrderID,
    i.Name,
    RANK() OVER (PARTITION BY oi.OrderID ORDER BY i.Rating DESC) 'Rank',
    ROW_NUMBER() OVER (PARTITION BY oi.OrderID ORDER BY i.Rating DESC) 'Row'
  FROM OrderItems oi
  INNER JOIN Items i ON oi.ItemID = i.ID
) RankedItemsInOrders
WHERE Row = 1

Using partitioning abilities within SQL, we can get all the highest-rated Items for all Orders at once.

The inner query builds out rankings for the Items per Order. The outer query’s WHERE clause then limits the results to just fetching the highest-ranked item, thus giving us the highest-rated item per Order. The secret sauce is within the PARTITION BY portion, where we dictate how the partition or ranking is done via the ORDER BY portion. In this case, we’re ranking by i.Rating DESC, and the ranks are divided by (one might say partitioned by) oi.OrderID.

To filter or otherwise use Rank or Row, you have to first create the projection, then select against that projection. This is why we have to wrap the partitioning query around SELECT * FROM and then add the WHERE after the fact.

You only need to use RANK() OR ROW_NUMBER() to accomplish this. If you use Rank you would replace WHERE Row = 1 with WHERE Rank = 1. Both are selected here so we can find out how these two functions differ.

RANK() vs. ROW_NUMBER()

If we remove WHERE Row = 1, we can see the full results of the inner query, which will also reveal the difference between the two functions. Let’s also select each Item’s Rating, because that’s what we’re ranking Items with.

SELECT * FROM
(
  SELECT 
    oi.OrderID,
    i.Name,
    i.Rating,
    RANK() OVER (PARTITION BY oi.OrderID ORDER BY i.Rating DESC) 'Rank',
    ROW_NUMBER() OVER (PARTITION BY oi.OrderID ORDER BY i.Rating DESC) 'Row'
  FROM OrderItems oi
  INNER JOIN Items i ON oi.ItemID = i.ID
) RankedItemsInOrders

Removing the outer filter reveals the full list of ranked Items

For OrderID 1, things are simple: The Items are rated 5, 3, and 1, so both Rank and Row are 1, 2, and 3 respectively. OrderID 2 has a single Item in the whole Order, so there’s nothing notable here.

Take a look at OrderID 3’s Items, however. The Items Bar and Fizz both have a rating of 5. It’s a tie! Here’s where we see the difference between RANK() and ROW_NUMBER().

RANK() will give us the accurate ranking of each item based on the partitioned criteria. These two items are tied, therefore they both receive a ranking of 1.
ROW_NUMBER() will give us a sequential ordering regardless of ties, and the tiebreaking it does is unstable, meaning you can potentially get different ROW_NUMBER() results for the same tied Items for each query run.

In our original query, this means WHERE Row = 1 will sometimes return Bar, sometimes return Fizz, for OrderID 3. This also means WHERE Rank = 1 will return both Bar and Fizz (which one comes first is still unstable).

So, Which is Better?

It Depends™! Do you only want a single “top item” and don’t care which one is returned? Do you want to get every top item, even if it means multiple per set? Depending on what your goals are, you can now satisfy either need, using either RANK() or ROW_NUMBER() in concert with PARTITION BY! The important thing is you are now keeping this operation where it belongs: Inside your database.

SQL Server Data Partitioning - Groups Within Groups

The Problem to Solve

Database Setup

A Simple Schema

Sample Data

Let’s Group Some Data!

First Attempt: One By One

Second Attempt: Partitioning

RANK() vs. ROW_NUMBER()

So, Which is Better?

Authors

Categories

Series

The Problem to Solve

Database Setup

A Simple Schema

Sample Data

Let’s Group Some Data!

First Attempt: One By One

Second Attempt: Partitioning

RANK() vs. ROW_NUMBER()

So, Which is Better?

Get these blog posts delivered to your email!

Authors

Categories

Series