Using Grouping instead of Distinct in Entity Framework to Optimize Performance

On a number of pages on this web site I display a list of articles. For example, the home page shows a list of all articles from all categories. The FAQs page shows articles from the FAQs category only. Internally, the content items in the database (sorted in the Content table) are linked through a junction table to the Roles table. This enables me to publish content targeting specific roles. Anonymous users see content for the Anonymous role only, while, for example, I as the site administrator can see content that is assigned to the Administrators role only. I use this in order to add content to my site which is not public yet.

Using Distinct Can Kill Performance

To query the content I use a JOIN between the content and the roles, like this:

var result = from c in _repositoryWrapper.ContentItems.GetActiveForUser(roles)
             from r in c.Roles
             where roles.Contains(r.Description) || r.Description == "Visitors"
             select c;

This works fine, except when a content item is assigned to multiple roles. This happens for example when an item is assigned to the Administrators role and Visitors roles at the same time, which is the case when I put an article live. When this happens, the article is listed twice. This leads to a few problems. First of all, the Count is off, returning more items that really exist in the database which screws up the paging system. Secondly, it would display the items multiple times in the content list.

My first take at solving this problem was with the Distinct method which is an extension on IEnumerable. The good thing and the bad thing with the parameterless version of Distinct() is that it carries out the distinct operation at the database level. That's a good thing as it improves performance, minimizing the number of records that are being brought into the application. It's a bad thing for me in this case as SQL Server doesn't support distinct on text or next columns, which happens to be the datatype for some of my columns (the Summary and Body of a ContentItem to be precise). If you try it, you get an error such as the following:

{System.Data.SqlClient.SqlException (0x80131904): The text data type cannot be selected as DISTINCT because it is not comparable.

My next attempt was using an overload of Distinct that accepts an IEqualityComparer<T>. Creating the comparer class is pretty simple:

public class ContentComparer : IEqualityComparer<Content>
{
  public bool Equals(Content x, Content y)
  {
    return x.Id == y.Id;
  }

  public int GetHashCode(Content obj)
  {
    return obj.Id.GetHashCode();
  }
}

Simple and effective. Or so I thought. I used the comparer as follows:

var result = (from c in _repositoryWrapper.ContentItems.GetActiveForUser(roles)
              from r in c.Roles
              where roles.Contains(r.Description) || r.Description == "Visitors"
              select c).Distinct(new ContentComparer());

However, when you run this, you're greeted with the following exception:

"LINQ to Entities does not recognize the method 'System.Linq.IQueryable`1[Spaanjaars.Imar.Model.Content] Distinct[Content](System.Linq.IQueryable`1[Spaanjaars.Imar.Model.Content], System.Collections.Generic.IEqualityComparer`1[Spaanjaars.Imar.Model.Content..."

Ouch. Simple: yes. Effective: no. Distinct with the comparer can't be translated to SQL by EF. The fix seems easy. Simply call ToList on the EF result, and call Distinct on the returned list:

var result = (from c in _repositoryWrapper.ContentItems.GetActiveForUser(roles)
              from r in c.Roles
              where roles.Contains(r.Description) || r.Description == "Visitors"
              select c).ToList().Distinct(new ContentComparer());

When you now run this code, you get a nice, unique list of content items in the database for the specified roles.

However, this code has introduced one major problem: it has moved the Distinct logic from the database to .NET. In order to successfully create a distinct list of all records, EF has to pull all records from the database and create the distinct list within your application. This is not that bad when you need to get all content anyway and there aren't that many duplicates, but it will surely kill performance when you have lots of records and want to query just a few; using Skip() and Take() for example. When I added Take(10) to my query:

var result = (from c in _repositoryWrapper.ContentItems.GetActiveForUser(roles)
              from r in c.Roles
              where roles.Contains(r.Description) || r.Description == "Visitors"
              select c).ToList().Distinct(new ContentComparer()).Take(10);

I got exactly the 10 items I was expecting. However an inspection of the SQL code that gets executed using the SQL Server Profiler revealed that I am still pulling all records from the database; hundreds in my case. Clearly, this is undesired behavior. With just a few hundred records, this isn't too bad (but still completely unnecessary) but it will bring down your application if you're querying tables with thousands or millions of records this way.

The fix is pretty simple though: use grouping. Grouping is another LINQ and EF supported concept and fully translates to SQL. In my case, it was as easy as this:

var distinctResult = from c in result
             group c by c.Id into uniqueIds
             select uniqueIds.FirstOrDefault();

This groups each content item in a unique group for the ID of the content item. The FirstOrDefault method then ensures I am only getting a unique record per ID. The result variable is the result from the very first LINQ query with the JOIN from the beginning of this article.

The cool thing is that the result variable is still completely composable and results in proper and optimized SQL being sent to the database. With this code, getting only 10 records using Take(10) now only retrieves those 10 items from the database.

Thanks to Andreas Niedermair and Jon Skeet for originally bringing up and answering this question on stackoverflow.com.


Where to Next?

Wonder where to go next? You can read existing comments below or you can post a comment yourself on this article .


Consider making a donation
Please consider making a donation using PayPal. Your donation helps me to pay the bills so I can keep running Imar.Spaanjaars.Com, providing fresh content as often as possible.



Feedback by Other Visitors of Imar.Spaanjaars.Com

On Wednesday, January 19, 2011 5:36:04 AM Michael said:
Hello Imar ,

Good stuff. I have used your N-Layered design on some solutions in the past and I bought your 3.5 ASP.NET N-Layered pdf.

However, how do you see the Entity Framework fit into the N-Layered Design?

Thanks.
On Wednesday, January 19, 2011 7:50:16 AM Imar Spaanjaars said:
Hi Michael,

EF certainly has its use in N-Layer design. One of the new compelling options is EF COde First, where you define your business objects (as POCO classes) and let EF handle database interaction for you. This is something I am working on for the N-Layer Series vNext.

Cheers,

Imar
On Wednesday, February 09, 2011 3:04:40 PM Jonathan said:
How would this be affected if you placed the take(10) before the tolist method?
On Wednesday, February 09, 2011 4:13:14 PM Imar Spaanjaars said:
Hi Jonathan,

This woulsd take 10 records from the non-distinctresult set. Then after you distinct them, you may end up with fewer than 10 records.

Cheers,

Imar
On Tuesday, February 15, 2011 3:47:41 PM Mike said:
Tried this technique using:
- queryA (non join using "include" to include navigation property)
- queryB referenced queryA and group into uniqueIds
  and select uniqueIds.FirstOrDefault()

And received this error when trying to access the attributes in queryA.

"The ObjectContext instance has been disposed and can no longer be used for operations that require a connection."

When I debug and review returned results from list i see the above statement when I try and review the Navigation Property.

Any suggestions?

thanks in advance.
On Tuesday, February 15, 2011 6:05:23 PM Imar Spaanjaars said:
Hi Mike,

You typically get this error when you return an object from a function with a composable query that hasn't executed yet. Normally, calling something like ToList forces the query to execute, avoiding this issue. You can also have the same issue if you access the object outside a using block that defines the object context.

Is that the case in your code?

Imar
On Tuesday, February 15, 2011 6:41:51 PM Mike said:
First example simple query to include navigation property:

var queryA = from c in ctx.orders.include(order_detail) select c

Will return attibutes from orders and order_detail per order record.

I then apply your distinct syntax above to QueryB as follows:

var queryB = from c in queryA group c by c.CustomerID into uniqueIds
select uniqueIds.FirstOrDefault();

followed by:

var myQuery = queryB.ToList();

The order_detail has no values and when I  try and access a property from order_detail the message "The ObjectContext instance has been disposed and can no longer be used for operations that require a connection."

Thx.
Mike
On Tuesday, February 15, 2011 6:56:16 PM Imar Spaanjaars said:
Hi Mike,

Hard to tell from this code alone; order_detail to me looks like a string to determine what to include; not an actual object.

Try posting this on a forum such as the Wrox forum, and be sure to supply a lot more detail and code.

Cheers,

Imar
On Tuesday, November 01, 2011 5:24:36 AM vaseem said:
I was using ToList() and then distinct comparer, but that really hurt the performance and when i switch to the below code it was running better than before.

var Coupons = Entities.PS_Coupons.Where(p => p.PS_Merchants.IsActive == true).Select(p=>p.ValidAtCityID).Distinct();

Where ValidAtCityID is a nvarchar(max) not ntext, and i was able to run distinct on database level.

EF Output.

SELECT
[Distinct1].[ValidAtCityID] AS [ValidAtCityID]
FROM ( SELECT DISTINCT
[Extent1].[ValidAtCityID] AS [ValidAtCityID]
FROM  [dbo].[PS_Coupons] AS [Extent1]
INNER JOIN [dbo].[PS_Merchants] AS [Extent2] ON [Extent1].[MerchantID] = [Extent2].[ID]
WHERE 1 = [Extent2].[IsActive]
)  AS [Distinct1]
On Saturday, April 19, 2014 4:41:35 AM sophia said:
Hello Imar,

If you use Take(10) after ToList() then is it not also performance killer which will return first thousands of distinct records first from db to .net and then just select top 10 records.

However you are also right in your case that if we use Take() before ToList() in case of Grouping query then we may have fewer records then expected.
But please think once about what I have said above (select only 10 out of thousands records at .net level and not the DB level).

Please suggest/guide a more performance booster solution.

Thanks for sharing this article, really helpful :)
On Saturday, April 19, 2014 10:43:05 AM Imar Spaanjaars said:
Hi sophia,

Not sure I understand what you're asking. Isn't that exactly what this article is about?

Imar

Talk Back! Comment on Imar.Spaanjaars.Com

I am interested in what you have to say about this article. Feel free to post any comments, remarks or questions you may have about this article. The Talk Back feature is not meant for technical questions that are not directly related to this article. So, a post like "Hey, can you tell me how I can upload files to a MySQL database in PHP?" is likely to be removed. Also spam and unrealistic job offers will be deleted immediately.

When you post a comment, you have to provide your name and the comment. Your e-mail address is optional and you only need to provide it if you want me to contact you. It will not be displayed along with your comment. I got sick and tired of the comment spam I was receiving, so I have protected this page with a simple calculation exercise. This means that if you want to leave a comment, you'll need to complete the calculation before you hit the Post Comment button.

If you want to object to a comment made by another visitor, be sure to contact me and I'll look into it ASAP. Don't forget to mention the page link, or the QuickDocId of the document.

For more information about the Talk Back feature, check out this news item.