Cartesian Explosion in Entity Framework

Entity Framework is an Object-Relational Mapping (ORM) framework that allows developers to interact with databases using C# or other .NET languages. It allows developers to create complex data models, however, when working with these models, there is a chance of facing what is known as the „Cartesian Explosion“. When loading related entities using Joins and one-to-many relations, the amount of data returned could increase rapidly, leading to the „Cartesian Explosion“. Especially with big data sets the data returned „explodes“.

For example, consider a data model that has entities for a Blogwith multiple Posts. Just like in the Entity Framework documentation, the following code shows two models with one-to-many relationship between them:

public class Blog
{
    public int BlogId { get; set; }
    public string Url { get; set; }

    public List<Post> Posts { get; set; }
}

public class Post
{
    public int PostId { get; set; }
    public string Title { get; set; }
    public string Content { get; set; }

    public int BlogId { get; set; }
    public Blog Blog { get; set; }
}

If a Blog has 4 Posts, the following query will return redundant data for the Blog:

select b.*, p.*
from dbo.Blogs b
left join dbo.Posts p on p.BlogId = b.Id;

The more data is loaded, using one-to-many relations, the more the amount of duplicate data starts to grow. In Entity Framework this quickly gets a problem when using Include and ThenInclude to join related entities with a LINQ query…

var posts = await DbContext.Blogs
    .Include(p => p.Posts)
    .Include(p => p.Contributors)
    .Where(p => p.Id == BlogId)
    .ToListAsync();

Now, if we include Contributorsfor Blogs the query will return 12 rows and even more duplicate data.

The columns of the Blog and Post are returned multiple times. The goal would be to narrow it down to 3 + 4 = 7 rows instead of 3 * 4 = 12 rows. If we add more Includes the result set (the cartesian product) explodes.

The explosion happens only íf the relations are on the same level. If, for example,Contributorsare part of Posts the explosion only happens for thePosts.

But it can get even worse….

The performance problem adds up, because EF internally appends an order byand this causes further performance problems on such large amounts of data. The following SQL statement is generated by EF:

SELECT *
FROM [Blogs] AS [b]
LEFT JOIN [Posts] AS [p] ON [b].[Id] = [p].[BlogId]
LEFT JOIN [Contributors] AS [c] ON [b].[Id] = [c].[BlogId]
ORDER BY [b].[Id], [p].[PostId]

This leads to serious performance issues. There are several approaches to prevent this problemes.

Prevent Data Duplication

If the Blogs table has a big column like long texts or binary data it is returned multiple times. When querying data one should always lay focus on the minimum of data needed.

Simply said: „If you don’t need columns don’t query for it.“

Separate the queries

One solution is to load the data in separate queries. You can load the related entities manually by using Load()

var books = await DbContext.Blogs
    .Where(p => p.Id == BlogId)
    .ToListAsync();

DbContext.Posts.Where(p => p.Id == bookId).Load();
DbContext.Contributors.Where(p => p.Id == bookId).Load();

Using Split Query

Entity Framework allows you to specify that a query can be split in multiple SQL statements, using. AsSplitQuery() for a single LINQ query

var posts = await DbContext.Blogs
    .Include(p => p.Posts)
    .Include(p => p.Contributors)
    .Where(p => p.Id == BlogId)
    .AsSplitQuery()
    .ToListAsync();

or to enable Split Query globally as shown below

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
    optionsBuilder
        .UseSqlServer(
            @"Server=(localdb)\mssqllocaldb;Database=EFQuerying;Trusted_Connection=True",
           o => o.UseQuerySplittingBehavior(QuerySplittingBehavior.SplitQuery));
}

When split queries are configured as the default, it’s still possible to configure specific queries to execute as single queries:

using (var context = new SplitQueriesBloggingContext())
{
    var blogs = context.Blogs
        .Include(blog => blog.Posts)
        .AsSingleQuery()
        .ToList();
}

Using split query the query of above results in the following three SQL statements

SELECT [b].[Id], [b].[Rating], [b].[Url]
FROM [Blogs] AS [b]
ORDER BY [b].[Id]

SELECT [p].[PostId], [p].[BlogId], [p].[Content], [p].[Title], [b].[Id]
FROM [Blogs] AS [b]
INNER JOIN [Posts] AS [p] ON [b].[Id] = [p].[BlogId]
ORDER BY [b].[Id]

SELECT [c].[ContributorId], [c].[BlogId], [c].[Title], [b].[Id]
FROM [Blogs] AS [b]
INNER JOIN [Contributors] AS [c] ON [b].[Id] = [c].[BlogId]
ORDER BY [b].[Id]

In conclusion, split queries are great and do avoid the Cartesian Explosion. Next to all the performance improvments, splitting the query can result in negative side effects.