Subject [firebird-support] Re: performance of subselect with group by
Author Svein Erling Tysvær
>Hello,
>
> thanks Set, makes my test unnecessary!
>
> But I'm thinking a little bit about my original problem.
>
> Would it make sense to add a tracker entry for optimization of
> subselects without reference to outer query?
>
> I think that they should get evaluated and transformed to something
> like a "or" connected list of simple compares.
>
> So
> delete from test where Id in (
> select min(t.Id) FROM test t
> group by t.reference, t.key
> having count(*) > 1
> )
>
> could be transformed by the engine to
>
> delete from test where Id = 5 or Id = 6 or Id = 89
>
> in case of
>
> select min(t.Id) FROM test t
> group by t.reference, t.key
> having count(*) > 1
>
> would return 5,6,89
>
> I've found several similar questions in my mail archive so I think
> I'm not the only one with such a problem.
>
> A common use case would be to remove duplicate entries for
> example to create a unique index.
>
> Any comments?

My guess is that it is too early to add this to Firebird. Sure, there are situations where a non-correlated subselect would be benefitial, but how would Firebird know when? In your case (since you're deleting from the same table as you're selecting from), I agree that this could be simple. However, normally the select would be against a different table and then I think things are not as simple. Assuming we were talking about two different tables: I don't know how Firebird translates a query like yours into EXISTS, but theoretically it could be something like:

delete from DeleteFromTable t
where exists(select * from test t2
left join test t3
on t2.reference = t3.reference
and t2.key = t3.key
and t2.id > t3.id
where t.id = t2.id
and t3.id is null)

If DeleteFromTable was a smallish table, and test was large and sensible indexes were in place, then my rewritten query would be quicker if your 'min subquery' returned a lot of rows. On the other hand, if your subquery returned few rows, then executing the subselect like you suggest would indeed be quicker. If you add a WHERE clause to your subselect, it is even thinkable that, say, WHERE MyField = 1 would be quicker using your solution, whereas WHERE MyField = 2 would be quicker using mine. And the only way for Firebird to know, would be to have histograms. Histograms was removed from things to be implemented in version 3 (I think it was announced during the conference, at least I'm certain that I've heard this), I hope they will appear in Firebird 4. I would think that once histograms are in place, cases like yours are amongst those that can be considered for optimization (although I don't know anything about the source of Firebird).

You are definitely not the only one with this kind of problem, I would appreciate it myself if it was possible for deletes and updates to use indexes similar to what is possible with selects (e.g. allowing for JOIN). However, my knowledge of SQL and Firebird is not good enough to formulate how to do this in a sensible way. Also, I think that as users of Firebird it is possible to rewrite our slow queries to more performant ones in most cases, so to me this problem is not all too big.

I have never needed expression indexes myself, but have you looked into writing one to see if this can make your query execute (considerably) faster?

Sorry for not agreeing with you in that this is a useful optimization just yet, if you're lucky people knowing something about the source code of Firebird will disagree with me.

Set