Bulk Delete

Maybe it looks like I am having a personal vendetta against OR mappers but this is not the case. In the last week I have seen a otherwise “good” applications performance go overboard because of a misuse of an OR mapper. So I just want to share my findings and maybe save someone, somewhere, some time.

My poison today was the process of deleting lot of records from a simple table. A simple task if you care to get your fingers dirty in SQL, regardless of the flavor.

Let’s say that you have a table of users and you developed a strong hate for the letter ‘H’ (or any other letter in the alphabet). So you woke up with a burning hate of the letter ‘H’ and you can not but delete all users which name starts with the letter ‘H’.

All you need to do is to execute this simple query:

DELETE FROM Users
WHERE UserName LIKE 'H%'

And done…

I agree that this scenario is a little…simple, but even if the structure of the database schema would be a little more complicated you could still do it with very few queries.

Now lets take a look at how this happens with a OR mapper.

With an OR mapper we first have to get a hold of the users we want to delete. The easiest (and most common) way to do this is to load them into memory. That meas that a sql query will be fired to read all users that we are going to delete.

SELECT * FROM Users WHERE UserName LIKE 'H%'

Note: Please take into consideration that this query does not happen if the users are already loaded into memory.

So now you have them in memory. Now you can call Delete on every one of them. Te default behavior of any OR mapper is to fire of individual delete statement, one for each user you want to delete.

DELETE FROM Users WHERE UserId = 1
DELETE FROM Users WHERE UserId = 2
...
DELETE FROM Users WHERE UserId = 5000

There is no arguing that this is less efficient that doing all of it in one big delete statement. The situation gets even worse if there are triggers or a cascade delete 😦

But before you start fainting stop think. I found a solution, sadly only for Entity Framework. For the rest I just bypass the OR mapper and use the underlying connection to directly fire SQL statements (how this is done is specific with the OR mapper used).

This breaks the “database agnostic” advantage that OR mapper give you. But to be honest   I think that this advantage is a myth, so I lose nothing here. And besides in all the SQL flavors the vanilla delete statement is the same.

So this is another problem I have with OR mappers and my solution to the same problem. Please keep in mind that this becomes a problem with massive data quantities. What “massive” means for your application is a question that can only be individually answered.

But before I go I want to share my latest story. There was an operation in an application I work on that fired 80k to 100k requests of those 80% to 90% were delete statement. Before I started the operation took 50 minutes and would often fail because of timeouts. After the “optimization” the same operation working with the same data finished in 4 to 6 seconds.

User.Deleted

Interesting title, is it not?

But let me explain. I have noticed, lately, that people have stopped using their brains and have given their data storage over to heuristics. What I mean with that? I mean that more and more people are using their OR mapper to generate their DB schema. And I think that this is not the greatest idea in the world. Maybe this is just me but I hope not.

Point in case is the title.

In the databases “Users” table there may be a column that is labeled Deleted. What this column is doing is telling you that the users is actually deleted. You might be asking yourself that if the user is deleted then why just not delete the user record. Well there are many reasons but the top one is data-integrity. There are ways to work around this but if not pressed by the law we tend to not do it.
For those still in the dark here is a quick recap of why this is done. When the user entry is created we tend to bind other data in the database to that user. When the user then decides to leave our system we tend to keep some of the data he has provided or data that accumulated as a result of him using the system. To keep the database integrity existent we eighter have to reassign all the data to another “bogus” user or just mark the user as nonexistent. We obviously go for the later 🙂

So the Users.Deleted makes sense in the context of a database but how about in the context of your domain model? In the data domain this makes little sense because if the user is deleted (non-existed for the application) the user should not be loaded into the data domain.

This is just one small example where the data domain and the database schema are not equal. There are countless others and their number will increase with the complexity of the application.

I love nHibernate but…

There are just to many assemblies to include to get it running. In my little fantasy idealistic world the main features of nHibernate would be contained in one assembly that would not clog up the references if the project and make it dependent upon to many things.

Before you will burn me on the stake for taking a swing on your favorite OR-mapper hear me out. In the old days where people/software were happy if they could replace their 50+ lines of ADO.NET code with 5 XML files nHibernate was the bomb. All that you needed was one assembly and you were done. Today the situation is somewhat different. I personally expect an OR mapper to deliver:

  • Basic OR mapping functionality
  • Fluent configuration
  • A Linq provider
The situation now is that each of this is a separate assembly. And I do not like this. This means that I must have three assemblies where only one should be, making my projects “bloated” and my bin folder looking like a dump.
I do not see a reason why this can not be included in the main distribution of nHibernate and my hopes are high that this will be the case in the near future.