Subject More on Graphs, Trees, and SELECT DISTINCT
Author robert_difalco
Consider the query we have now:

>
> SELECT G.oid, G.name, G.description, G.type [, ...]
> FROM Groupable G
> WHERE G.type = LEAF AND
> G.oid IN (
> SELECT L.childId
> FROM Link L
> JOIN Path P ON L.parentId = P.descendantId
> WHERE P.ancestorId = 12345)
>

Let's assume that this represents something like a logical network
hierarchy using groups and nodes (i.e. hosts). Groupable is just a
normalized table that acts as a base for both groups and nodes.
Furthermore, a groupable (i.e. a node or group) can be a child of any
other group. So, you might use this to arrange your hosts by region,
application, or platform.

Now lets say that a node is a container for elements (e.g. files).
Elements are always unique to the node they belong to. Furthermore,
each element has a version history. So there is a one to many
relationship between nodes and element and between elements and
versions. Finally, in the element table I have a denormalization where
one of the element columns indicates the "current version".

Ok. So if, for example, I want to collect all of the current versions
for all nodes that descend from 12345, I currently write something
like this:

SELECT DISTINCT <Version fields>
FROM Version V
JOIN Element E ON V.oid = E.currentVersionId
JOIN Groupable G ON E.nodeId = G.oid
JOIN Link L ON G.oid = L.childId
JOIN Path P ON L.parentId = P.decsendantId
WHERE P.ancestorId = 12345 AND G.type = LEAF;

Actually, instead of the Groupable table I use the No

de table. The Node table is just a subset of the Groupable table that
contains only leafs along with some extra fields that a Node has over
and on top of a Groupable. So the query really looks like this:

SELECT DISTINCT <Version fields>
FROM Version V
JOIN Element E ON V.oid = E.currentVersionId
JOIN Node N ON E.nodeId = N.oid
JOIN Link L ON G.oid = L.childId
JOIN Path P ON L.parentId = P.decsendantId
WHERE P.ancestorId = 12345; // no longer need: AND G.type = LEAF;

So given the advice you gave me for the last distinct query, how would
you re-write the above query to get rid of the distinct? (or how would
you re-write it just in general). The distinct can also be very slow
on this query because there are many records and the Version table has
many fields including long comments and such.

Is there a solution better than this?

SELECT <Version fields>
FROM Version V
WHERE V.oid IN (
SELECT E.currentVersionID
FROM Element E
JOIN Node N ON E.nodeId = N.oid
JOIN Link L ON G.oid = L.childId
JOIN Path P ON L.parentId = P.decsendantId
WHERE P.ancestorId = 12345 );

I guess for sure you would change the subquery to an EXISTS. I'm just
leaving that out right now as an optimizer chore but would probably
put the EXISTS in. But is there another overall approach I should be
taking?

TIA,

R.