BigQuery Deduplication

Mark Scannell
Google Cloud - Community
3 min readSep 10, 2019

--

Do you ever get duplicate rows in BigQuery?

I’m going to explore some techniques for deduplication in BigQuery both for the whole table and by partition. It assumes you have the dataset transactions.

Create Your Test Data

This will create a table containing two columns (date, v). There will be 21 days of data (partitioned by date), and each day will have a million rows. The value v will be mostly unique, but there will be a number of duplicates.

Create the Data

CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT
date,
CAST((i*2)+(2-FLOOR(RAND()*4.0)) AS INT64) AS v
FROM
UNNEST(GENERATE_DATE_ARRAY(
CURRENT_DATE(),
DATE_ADD(CURRENT_DATE(), INTERVAL 20 DAY)
)) AS date
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000000, 1)) AS i;

Examine your Data

SELECT
date,
COUNT(DISTINCT v) AS unique_values,
COUNT(*) AS value
FROM
`transactions.testdata`
GROUP BY
1;

These are the first ten rows of output.

You can see that (depending on RAND()) there are quite a few unique_values, but always less than a full million. We want to remove the duplicates.

Deduplicate Everything — Easy Way

The easiest way is to re-create the whole table in place using DISTINCT. You need to use the same parameters (PARTITION BY) for the table.

Recreate the Table

CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT DISTINCT * FROM `transactions.testdata`;

This processed about 320MB and consumed 2 minutes 15 seconds of slot time.

Examine your Data

Running the query to examine the data shows it is now deduplicated:

However, if you have partitions and you’re appending to your table you don’t want to deduplicate the entire table every time. Rather, you want to de-duplicate just your partition. But how?

Deduplicate Some of the Table — Merge

Rerun the query to create the table again so that it is full of duplicates. This way you can see the partial de-duplication.

Deduplicate using MERGE

MERGE `transactions.testdata` t
USING (
SELECT DISTINCT *
FROM `transactions.testdata`
WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

This processed 15.3MB and consumed 18 seconds of slot time. As there is 21 days of data, this makes sense.

Examine your Data

One of the days — the current date as of writing — is no deduplicated, but none of the other days. It is as expected!

Conclusion

It is very easy to deduplicate rows in BigQuery across the entire table or on a subset of the table, including a partitioned subset.

--

--