Are you seeing duplicate records in your data? Finding duplicates in SQL shouldn’t be a problem that plagues you. This is a common problem that happens for many reasons, but usually boils down to one of the following:
- A bad SQL query (primary culprit)
- Duplicate data in the source database/file
It’s likely somewhere along the way that a bad query was written and caused the data to duplicate. If the duplication is really bad it’s likely caused by a cartesian join. Also known as a cartesian product, a cartesian join is a join that causes data duplication often when no join is specified between two tables.
SELECT <COLUMNS> FROM Table1, Table2;
Cartesian Join (aka Cartesian Product)
If the above example had 100 rows in each table then the output would return 100*100 rows totaling 10,000 rows since there is no join to explicitly tell the tables how to filter the data. It will result in a combination of every row between both tables. These can be harder to located in larger queries that involve several joins with several tables.
Fixing Duplicates – The Quick Way
There is a quick way to eliminate duplicates in a SQL query – and that’s with the DISTINCT clause. This will remove any and all duplicate rows in a given query output – although not the recommended approach.
SELECT DISTINCT <COLUMNS> FROM Table1, Table2;
Writing a distinct query as shown above will potentially remove duplication, but you still may or may not get the results you’re looking for. This query is very inefficient since we have not resolved the missing join issue, which means the database compiler is going to still multiply all the records before then filtering the data back down using the DISTINCT keyword.
Below, we can see an example that shows the count returned from a cartesian product.
Had we included the proper join, our record count would have been much lower.
Fixing Duplicates – The Right Way
If the data did not start with duplicate values, I’d take a guess that it’s not what you’re expecting. Thankfully, finding duplicates in SQL doesn’t need to be over-complicated. When you’ve been able to determine the query that’s causing the issue, it’s important to break down the SQL statement into pieces especially if the query is very large.
Here is a step by step:
- Strip out all fields after the SELECT and before the FROM clause.
- Replace fields with COUNT(*) to get a general count of the query based on the current tables and joins.
- Comment out all tables and joins besides the first table in the FROM clause.
- Rejoin each table one by one checking the count each time a table is added.
- If the table count has increased during one of the table joins and it was not expected, you’ve found your culprit.
Some queries sometimes have 20-30 table joins and the above steps can be tedious, but finding duplicates in SQL is not an impossible task. Likely culprits are usually joining incorrect aliases or missing a join condition, but you’ll be glad that you took the time and the duplication headache is over!