Subject | Re[6]: [firebird-support] Find duplicate multi-row entries |
---|---|
Author | Daniel Miller |
Post date | 2017-09-28T09:00:44Z |
This appears no different than a simple:
select * from routes
So...not there yet. I think what I'm fighting is the lack of pivot or crosstab functionality - and I may have no choice (though it offends me deeply) but to setup my archive table for this comparison. Either via the hated structure of column names ROUTE1, ROUTE2, ROUTE8...or the only slightly less offensive VARCHAR concatenation.
A concatenated LIST seems like it would be a elegant solution - and while operating over the whole table is time-consuming a single node isn't too bad. And if I can get the history filtered and old duplicate records deleted it would be fine. But the inconsistency is killing me.
There MUST be a Firebird-friendly, normalized, compact storage structure that will work. I'm just not seeing it.
--
Daniel
On 9/28/2017 1:19:18 AM, "Svein Erling Tysvær setysvar@... [firebird-support]" <firebird-support@yahoogroups.com> wrote:Sorry again, Daniel, fourth query required (I forgot that things may vary depending on time, this is something I've never done before):with tmp(NODE, ROUTE_UPDATED) as(select distinct rh.NODE, rh.ROUTE_UPDATEDfrom route_history rhjoin routes r on rh.NODE = r.NODE), //*This join is only for performance reasons in case ROUTE_HISTORY contains a lot more records than ROUTES and may be removed*/new_routes(NODE) as(select distinct t.NODEfrom tmp twhere not exists(select *from routes rleft join route_history rh on r.NODE = rh.NODEand r.ROUTE_INDEX = rh.ROUTE_INDEXand r.LINK_NODE = rh.LINK_NODEand r.QUALITY is not distinct from rh.QUALITYand t.ROUTE_UPDATED = rh.ROUTE_UPDATEDwhere t.NODE = r.NODEand rh.NODE is null)and not exists(select *from route_history rhleft join routes r on r.NODE = rh.NODEand r.ROUTE_INDEX = rh.ROUTE_INDEXand r.LINK_NODE = rh.LINK_NODEand r.QUALITY is not distinct from rh.QUALITYwhere t.NODE = rh.NODEand t.ROUTE_UPDATED = rh.ROUTE_UPDATEDand r.NODE is null))select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITYfrom new_routes nrjoin routes r on nr.NODE = r.NODEHope I finally got it right,Set2017-09-28 10:01 GMT+02:00 Svein Erling Tysvær <setysvar@...>:Sorry, I forgot that should probably use 'IS NOT DISTINCT' and not '=' for QUALITY.Note that this checks only checks if there has ever been anything identical in ROUTE_HISTORY, it doesn't restrain itself to the latest entry only (i.e. the query I've written is slightly similar to the second query I wrote, it has to be extended if you want something slightly similar to the first query). Also, my third query doesn't consider duplicate rows (since the fields seem to be the primary key in ROUTE, it isn't theoretically possible in your example).Set2017-09-28 9:52 GMT+02:00 Svein Erling Tysvær <setysvar@...>:OK, that's completely different, but I've done something similar once before. It is not possible to directly compare sets for equality, but it is possible to check that set A doesn't contain anything that isn't in set B and that set B doesn't contain anything that isn't in set A. I assume NODE is the one common denominator:with tmp (NODE) as(select distinct NODE from routes),new_routes(NODE) as(select t.NODEfrom tmp twhere not exists(select *from routes rleft join route_history rh on r.NODE = rh.NODEand r.ROUTE_INDEX = rh.ROUTE_INDEXand r.LINK_NODE = rh.LINK_NODEand r.QUALITY = rh.QUALITYwhere t.NODE = r.NODEand rh.NODE is null)and not exists(select *from route_history rhleft join routes r on r.NODE = rh.NODEand r.ROUTE_INDEX = rh.ROUTE_INDEXand r.LINK_NODE = rh.LINK_NODEand r.QUALITY = rh.QUALITYwhere t.NODE = rh.NODEand r.NODE is null))select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITYfrom new_routes nrjoin routes r on nr.NODE = r.NODEDoes this get you the result you want?Set2017-09-28 9:19 GMT+02:00 'Daniel Miller' dmiller@... [firebird-support] <firebird-support@yahoogroups. com >:
Thanks - but that still doesn't work. I get a huge return set with a ton of duplicates. I think something's being lost in translation. A simple match of one row isn't sufficient - I need to match the "set".With the following six records:5557 1 1116 0
5557 2 1115 0
5557 3 5104 2
5557 4 5073 2
5557 5 5222 2
5557 6 5110 2That is the list of routes a given node has at a specific point in time. So it's perfectly possible at another time to have fewer, or more, or different routes. And I need to consider ALL the routes in play for a given timestamp to be a single "set". So if today I have the above list of routes, and yesterday route index 4 was looking at a different node - that would constitute a different and unique set even though the other 5 records match.5557 1 1116 0
5557 2 1115 0
5557 3 5104 2
5557 4 9999 2
5557 5 5222 2
5557 6 5110 2So it's quite possible I will have lots of duplicates for an individual row in ROUTES when compared in ROUTE_HISTORY - what I'm trying to filter is the pattern of rows. Only if ALL the rows of a current entries in ROUTES exist in ROUTE_HISTORY should it be considered a duplicate condition.--DanielOn 9/27/2017 11:53:48 PM, "Svein Erling Tysvær setysvar@... [firebird-support]" <firebird-support@yahoogroups. com > wrote:Sure it is possible to write such a query:select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITYfrom routes rleft join route_history rh1 on r.node = rh1.nodeand r.route_index = rh1.route_indexand r.link_node = rh1.link_nodeand r.quality is not distinct from rh1.qualityleft join route_history rh2 on rh1.node = rh2.nodeand rh1.route_index = rh2.route_indexand rh1.route_updated < rh2.route_updatedwhere rh1.node is nulland rh2.node is nullI use two LEFT JOINS because I assume you want things to be returned if things are changed back, e.g. if you have5557 1 1116 0
5557 1 1116 0then you only want one row in route_history, but if you have5557 1 1116 0
5557 1 1116 15557 1 1116 0then you don't want the last record to be considered a duplicate, but know when it changed back. If ROUTE_INDEX is just a running number, you consider5557 1 1116 0
5557 2 1116 0to be duplicates, and there are no gaps between the ROUTE_INDEX for each NODE, then the query can be simplified:select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITYfrom routes rleft join route_history rh1 on r.node = rh1.nodeand r.route_index+1 = rh1.route_indexand r.link_node = rh1.link_nodeand r.quality is not distinct from rh1.qualitywhere rh1.node is nullNote that I use IS NOT DISTINCT so that NULLs are considered equal. If you prefer, the left joins can be replaced by (nested) NOT EXISTS.HTH,Set2017-09-28 7:22 GMT+02:00 'Daniel Miller' dmiller@... [firebird-support] <firebird-support@yahoogroups. com >:
On 9/27/2017 9:20:54 PM, "liviuslivius liviuslivius@... [firebird-support]" <firebird-support@yahoogroups. com > wrote:group by + havingis your friend.E.g.Select field1, field2, count(*)From tablexGroup by field1, field2Having count(*)>1But in your scenario i do not know if this is the solution. Question is if record was changed and in next update "restored" and once again changed is this duplicate or no?If yes above query is ok if no then procedure is your friend.And for the future change your audit trigger and check before if there was any changeThank you, but I this isn't quite what I need. I need to match against all the records of a set (up to 8 rows per set) - not just individual rows.--Daniel