How to list all duplicated rows which may include NULL columns?

Posted by Yousui on Stack Overflow See other posts from Stack Overflow or by Yousui
Published on 2010-06-10T05:55:39Z Indexed on 2010/06/10 6:03 UTC
Read the original article Hit count: 232

Hi guys,

I have a problem of listing duplicated rows that include NULL columns. Lemme show my problem first.

USE [tempdb];
GO

IF OBJECT_ID(N'dbo.t') IS NOT NULL
BEGIN
    DROP TABLE dbo.t
END
GO

CREATE TABLE dbo.t
(
    a NVARCHAR(8),
    b NVARCHAR(8)
);
GO

INSERT t VALUES ('a', 'b');
INSERT t VALUES ('a', 'b');
INSERT t VALUES ('a', 'b');
INSERT t VALUES ('c', 'd');
INSERT t VALUES ('c', 'd');
INSERT t VALUES ('c', 'd');
INSERT t VALUES ('c', 'd');
INSERT t VALUES ('e', NULL);
INSERT t VALUES (NULL, NULL);
INSERT t VALUES (NULL, NULL);
INSERT t VALUES (NULL, NULL);
INSERT t VALUES (NULL, NULL);
GO

Now I want to show all rows that have other rows duplicated with them, I use the following query.

SELECT a, b
FROM   dbo.t
GROUP
    BY a, b
HAVING count(*) > 1

which will give us the result:

a        b
-------- --------
NULL     NULL
a        b
c        d

Now if I want to list all rows that make contribution to duplication, I use this query:

WITH
duplicate (a, b) AS
(
    SELECT a, b
    FROM   dbo.t
    GROUP
        BY a, b
    HAVING count(*) > 1
)
SELECT dbo.t.a, dbo.t.b
FROM   dbo.t
       INNER JOIN duplicate
           ON (dbo.t.a = duplicate.a
           AND dbo.t.b = duplicate.b)

Which will give me the result:

a        b
-------- --------
a        b
a        b
a        b
c        d
c        d
c        d
c        d

As you can see, all rows include NULLs are filtered. The reason I thought is that I use equal sign to test the condition(dbo.t.a = duplicate.a AND dbo.t.b = duplicate.b), and NULLs cannot be compared use equal sign. So, in order to include rows that include NULLs in it in the last result, I have change the aforementioned query to

WITH
duplicate (a, b) AS
(
    SELECT a, b
    FROM   dbo.t
    GROUP
        BY a, b
    HAVING count(*) > 1
)
SELECT dbo.t.a, dbo.t.b
FROM   dbo.t
       INNER JOIN duplicate
           ON (dbo.t.a = duplicate.a
               AND dbo.t.b = duplicate.b)
           OR
           (dbo.t.a IS NULL
               AND duplicate.a IS NULL
               AND dbo.t.b = duplicate.b)
           OR
           (dbo.t.b IS NULL
               AND duplicate.b IS NULL
               AND dbo.t.a = duplicate.a)
           OR
           (dbo.t.a IS NULL
               AND duplicate.a IS NULL
               AND dbo.t.b IS NULL
               AND duplicate.b IS NULL)

And this query will give me the answer as I wanted:

a        b
-------- --------
NULL     NULL
NULL     NULL
NULL     NULL
NULL     NULL
a        b
a        b
a        b
c        d
c        d
c        d
c        d

Now my question is, as you can see, this query just include two columns, in order to include NULLs in the last result, you have to use many condition testing statements in the query. As the column number increasing, the condition testing statements you need in your query is increasing astonishingly. How can I solve this problem?

Great thanks.

© Stack Overflow or respective owner

Related posts about sql-server

Related posts about sql-server-2005