firebird-support - Re: Questions about SELECT DISTINCT *

Subject	Re: Questions about SELECT DISTINCT *
Author	Svein Erling Tysvær
Post date	2006-03-09T07:32:18Z

--- In firebird-support@yahoogroups.com, "robert_difalco" wrote:

> Thanks Svein, then finally. Do you prefer these bottom two
> variations over the original DISTINCT query?

Well, Robert, it depends on the purpose. I never use IN (<subselect>),
but the other two alternatives each have some interesting aspects
associated with them.

SELECT DISTINCT ... JOIN ... can choose a very good plan for returning
the entire result set. It can freely decide to do something like
PLAN SORT(JOIN(JOIN(
P INDEX(IDX_ANCESTORID), L INDEX(IDX_PARENTID)), G INDEX(IDX_OID))
(there will be errors in this, I haven't looked at a real plan this week)

At the same time, it can take time to return the first few rows, and
it is not naturally updateable if you use IBO (although I think it can
be set to be updateable).

SELECT ... WHERE EXISTS ... clearly shows that you're interested in
fields belonging to G and nothing else (without having to look through
all fields in the select list to ascertain P or L not being
referenced). The first few records could also appear quicker, and the
user issuing the query is less likely to believe the program has
stopped responding. If there is lots of matches, then I would assume
it to be considerably faster as well. E.g. if you wanted to know the
name of your employees in China that lived in a village or town where
one or more people had black hair, then

SELECT e.name FROM employee e
WHERE EXISTS(SELECT * FROM Chinas_Population p where e.village =
p.village AND p.hair_colour = 'Black')

is far better than

SELECT DISTINCT e.name FROM employee e
JOIN Chinas_Population p on e.village = p.village
WHERE p.hair_colour = 'Black'

Back to your case: Using EXISTS does require G to be the first table
in the plan (well, not neccessarily in how it is written, I think
subselects normally appear above the main select), and restricts the
optimizer quite a bit.

Practically speaking, when I'm interested in limiting the result set
depending on tables that I don't want output fields from, I normally
go for [NOT] EXISTS, but if the main select is a large table and I
have no other limiting criteria in the WHERE clause, then using SELECT
DISTINCT ... JOIN may be that much quicker that I prefer using this.
Though I think it is fair to say that I normally choose one without
giving the other option too much of a thought, possibly because I
write more ad-hoc queries than queries to be executed thousands of times.

Set

> > > SELECT DISTINCT G.oid, G.name, G.description, G.type [, ...]
> > > FROM Groupable G
> > > JOIN Link ON G.oid = Link.childId
> > > JOIN Path ON Link.parentId = Path.descendantId
> > > WHERE (G.type = LEAF AND Path.ancestorId = 12345)
> > >
> > SELECT G.oid, G.name, G.description, G.type [, ...]
> > FROM Groupable G
> > WHERE G.type = LEAF AND
> > G.oid IN (
> > SELECT L.childId
> > FROM Link L
> > JOIN Path P ON L.parentId = P.descendantId
> > WHERE P.ancestorId = 12345)
> >
> >
> > SELECT G.oid, G.name, G.description, G.type [, ...]
> > FROM Groupable G
> > WHERE G.type = LEAF AND
> > EXISTS(SELECT *
> > FROM Link L
> > JOIN Path P ON L.parentId = P.descendantId
> > WHERE P.ancestorId = 12345
> > AND G.oid = L.childId)