Using Grouping instead of Distinct in Entity Framework to Optimize Performance

On a number of pages on this web site I display a list of articles. For example, the home page shows a list of all articles from all categories. The FAQs page shows articles from the FAQs category only. Internally, the content items in the database (sorted in the Content table) are linked through a junction table to the Roles table. This enables me to publish content targeting specific roles. Anonymous users see content for the Anonymous role only, while, for example, I as the site administrator can see content that is assigned to the Administrators role only. I use this in order to add content to my site which is not public yet.

Using Distinct Can Kill Performance

To query the content I use a JOIN between the content and the roles, like this:

var result = from c in _repositoryWrapper.ContentItems.GetActiveForUser(roles)
             from r in c.Roles
             where roles.Contains(r.Description) || r.Description == "Visitors"
             select c;

This works fine, except when a content item is assigned to multiple roles. This happens for example when an item is assigned to the Administrators role and Visitors roles at the same time, which is the case when I put an article live. When this happens, the article is listed twice. This leads to a few problems. First of all, the Count is off, returning more items that really exist in the database which screws up the paging system. Secondly, it would display the items multiple times in the content list.

My first take at solving this problem was with the Distinct method which is an extension on IEnumerable. The good thing and the bad thing with the parameterless version of Distinct() is that it carries out the distinct operation at the database level. That's a good thing as it improves performance, minimizing the number of records that are being brought into the application. It's a bad thing for me in this case as SQL Server doesn't support distinct on text or next columns, which happens to be the datatype for some of my columns (the Summary and Body of a ContentItem to be precise). If you try it, you get an error such as the following:

{System.Data.SqlClient.SqlException (0x80131904): The text data type cannot be selected as DISTINCT because it is not comparable.

My next attempt was using an overload of Distinct that accepts an IEqualityComparer<T>. Creating the comparer class is pretty simple:

public class ContentComparer : IEqualityComparer<Content>
{
  public bool Equals(Content x, Content y)
  {
    return x.Id == y.Id;
  }

  public int GetHashCode(Content obj)
  {
    return obj.Id.GetHashCode();
  }
}

Simple and effective. Or so I thought. I used the comparer as follows:

var result = (from c in _repositoryWrapper.ContentItems.GetActiveForUser(roles)
              from r in c.Roles
              where roles.Contains(r.Description) || r.Description == "Visitors"
              select c).Distinct(new ContentComparer());

However, when you run this, you're greeted with the following exception:

"LINQ to Entities does not recognize the method 'System.Linq.IQueryable`1[Spaanjaars.Imar.Model.Content] Distinct[Content](System.Linq.IQueryable`1[Spaanjaars.Imar.Model.Content], System.Collections.Generic.IEqualityComparer`1[Spaanjaars.Imar.Model.Content..."

Ouch. Simple: yes. Effective: no. Distinct with the comparer can't be translated to SQL by EF. The fix seems easy. Simply call ToList on the EF result, and call Distinct on the returned list:

var result = (from c in _repositoryWrapper.ContentItems.GetActiveForUser(roles)
              from r in c.Roles
              where roles.Contains(r.Description) || r.Description == "Visitors"
              select c).ToList().Distinct(new ContentComparer());

When you now run this code, you get a nice, unique list of content items in the database for the specified roles.

However, this code has introduced one major problem: it has moved the Distinct logic from the database to .NET. In order to successfully create a distinct list of all records, EF has to pull all records from the database and create the distinct list within your application. This is not that bad when you need to get all content anyway and there aren't that many duplicates, but it will surely kill performance when you have lots of records and want to query just a few; using Skip() and Take() for example. When I added Take(10) to my query:

var result = (from c in _repositoryWrapper.ContentItems.GetActiveForUser(roles)
              from r in c.Roles
              where roles.Contains(r.Description) || r.Description == "Visitors"
              select c).ToList().Distinct(new ContentComparer()).Take(10);

I got exactly the 10 items I was expecting. However an inspection of the SQL code that gets executed using the SQL Server Profiler revealed that I am still pulling all records from the database; hundreds in my case. Clearly, this is undesired behavior. With just a few hundred records, this isn't too bad (but still completely unnecessary) but it will bring down your application if you're querying tables with thousands or millions of records this way.

The fix is pretty simple though: use grouping. Grouping is another LINQ and EF supported concept and fully translates to SQL. In my case, it was as easy as this:

var distinctResult = from c in result
             group c by c.Id into uniqueIds
             select uniqueIds.FirstOrDefault();

This groups each content item in a unique group for the ID of the content item. The FirstOrDefault method then ensures I am only getting a unique record per ID. The result variable is the result from the very first LINQ query with the JOIN from the beginning of this article.

The cool thing is that the result variable is still completely composable and results in proper and optimized SQL being sent to the database. With this code, getting only 10 records using Take(10) now only retrieves those 10 items from the database.

Thanks to Andreas Niedermair and Jon Skeet for originally bringing up and answering this question on stackoverflow.com.


Where to Next?

Wonder where to go next? You can post a comment on this article.

Doc ID 546
Full URL https://imar.spaanjaars.com/546/using-grouping-instead-of-distinct-in-entity-framework-to-optimize-performance
Short cut https://imar.spaanjaars.com/546/
Written by Imar Spaanjaars
Date Posted 06/28/2010 13:10

Comments

Talk Back! Comment on Imar.Spaanjaars.Com

I am interested in what you have to say about this article. Feel free to post any comments, remarks or questions you may have about this article. The Talk Back feature is not meant for technical questions that are not directly related to this article. So, a post like "Hey, can you tell me how I can upload files to a MySQL database in PHP?" is likely to be removed. Also spam and unrealistic job offers will be deleted immediately.

When you post a comment, you have to provide your name and the comment. Your e-mail address is optional and you only need to provide it if you want me to contact you. It will not be displayed along with your comment. I got sick and tired of the comment spam I was receiving, so I have protected this page with a simple calculation exercise. This means that if you want to leave a comment, you'll need to complete the calculation before you hit the Post Comment button.

If you want to object to a comment made by another visitor, be sure to contact me and I'll look into it ASAP. Don't forget to mention the page link, or the Doc ID of the document.

(Plain text only; no HTML or code that looks like HTML or XML. In other words, don't use < and >. Also no links allowed.