Skip to content

Commit

Permalink
Merging changes synced from https://github.com/MicrosoftDocs/dataexpl…
Browse files Browse the repository at this point in the history
…orer-docs-pr (branch live)
  • Loading branch information
Learn Build Service GitHub App authored and Learn Build Service GitHub App committed Sep 23, 2024
2 parents 56b542a + 9eac5bf commit eeb7c86
Show file tree
Hide file tree
Showing 7 changed files with 21 additions and 34 deletions.
7 changes: 1 addition & 6 deletions data-explorer/.openpublishing.redirection.json
Original file line number Diff line number Diff line change
Expand Up @@ -425,15 +425,10 @@
"redirect_url": "/kusto/set-timeout-limits?view=azure-data-explorer&preserve-view=true",
"redirect_document_id": false
},
{
"source_path": "dealing-with-duplicates.md",
"redirect_url": "/kusto/concepts/dealing-with-duplicates?view=azure-data-explorer&preserve-view=true",
"redirect_document_id": false
},
{
"source_path": "excel-connector.md",
"redirect_url": "/azure/data-explorer/excel",
"redirect_document_id": false
}
]
}
}
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
---
title: Handle duplicate data
description: Learn about the different approaches to handle duplicate data effectively and efficiently.
title: Handle duplicate data in Azure Data Explorer
description: This topic will show you various approaches to deal with duplicate data when using Azure Data Explorer.
ms.reviewer: mblythe
ms.topic: how-to
ms.date: 09/22/2024
ms.date: 12/19/2018

#Customer intent: I want to learn how to deal with duplicate data.
---
# Handle duplicate data
# Handle duplicate data in Azure Data Explorer

<!-- //TODO: Remove this and redirect to KQL repo in concepts folder-->

Devices sending data to the cloud maintain a local cache of the data. Depending on the data size, the local cache could be storing data for days or even months. You want to safeguard your analytical databases from malfunctioning devices that resend the cached data and cause data duplication in the analytical database. Duplicates can affect the number of records returned by a query. This is relevant when you need a precise count of records, such as counting events. This topic outlines best practices for handling duplicate data for these types of scenarios.

Expand Down Expand Up @@ -39,7 +42,7 @@ Understand your business requirements and tolerance of duplicate data. Some data

### Solution #2: Handle duplicate rows during query

Another option is to filter out the duplicate rows in the data during query. The [`arg_max()`](../query/arg-max-aggregation-function.md) aggregated function can be used to filter out the duplicate records and return the last record based on the timestamp (or another column). The advantage of using this method is faster ingestion since de-duplication occurs during query time. In addition, all records (including duplicates) are available for auditing and troubleshooting. The disadvantage of using the `arg_max` function is the additional query time and load on the CPU every time the data is queried. Depending on the amount of the data being queried, this solution may become non-functional or memory-consuming and will require switching to other options.
Another option is to filter out the duplicate rows in the data during query. The [`arg_max()`](/kusto/query/arg-max-aggregation-function?view=azure-data-explorer&preserve-view=true) aggregated function can be used to filter out the duplicate records and return the last record based on the timestamp (or another column). The advantage of using this method is faster ingestion since de-duplication occurs during query time. In addition, all records (including duplicates) are available for auditing and troubleshooting. The disadvantage of using the `arg_max` function is the additional query time and load on the CPU every time the data is queried. Depending on the amount of the data being queried, this solution may become non-functional or memory-consuming and will require switching to other options.

In the following example, we query the last record ingested for a set of columns that determine the unique records:

Expand All @@ -62,14 +65,14 @@ This query can also be placed inside a function instead of directly querying the

### Solution #3: Use materialized views to deduplicate

[Materialized views](../management/materialized-views/materialized-view-overview.md) can be used for deduplication, by using the [take_any()](../query/take-any-aggregation-function.md)/[arg_min()](../query/arg-min-aggregation-function.md)/[arg_max()](../query/arg-max-aggregation-function.md) aggregation functions (see example #4 in [materialized view create command](../management/materialized-views/materialized-view-create.md#examples)).
[Materialized views](/kusto/management/materialized-views/materialized-view-overview) can be used for deduplication, by using the [take_any()](/kusto/query/take-any-aggregation-function)/[arg_min()](/kusto/query/arg-min-aggregation-function)/[arg_max()](/kusto/query/arg-max-aggregation-function) aggregation functions (see example ?view=azure-data-explorer&preserve-view=true#4 in [materialized view create command](/kusto/management/materialized-views/materialized-view-create#examples)).

> [!NOTE]
> Materialized views come with a cost of consuming cluster's resources, which may not be negligible. For more information, see materialized views [performance considerations](../management/materialized-views/materialized-view-overview.md#performance-considerations).
> Materialized views come with a cost of consuming cluster's resources, which may not be negligible. For more information, see materialized views [performance considerations](/kusto/management/materialized-views/materialized-view-overview?view=azure-data-explorer&preserve-view=true#performance-considerations).
### Solution #4: Use soft delete to remove duplicates

[Soft delete](data-soft-delete.md) supports the ability to delete individual records, and can therefore be used to delete duplicates. This option is recommended only for infrequent deletes, and not if you constantly need to deduplicate all incoming records.
[Soft delete](/kusto/concepts/data-soft-delete?view=azure-data-explorer&preserve-view=true) supports the ability to delete individual records, and can therefore be used to delete duplicates. This option is recommended only for infrequent deletes, and not if you constantly need to deduplicate all incoming records.

#### Choose between materialized views and soft delete for data deduplication

Expand All @@ -81,12 +84,12 @@ There are several considerations that can help you choose between using material

### Solution #5: `ingest-by` extent tags

['ingest-by:' extent tags](../management/extent-tags.md) can be used to prevent duplicates during ingestion. This is relevant only in use cases where each ingestion batch is guaranteed to have no duplicates, and duplicates are only expected if the same ingestion batch is ingested more than once.
['ingest-by:' extent tags](/kusto/management/extent-tags?view=azure-data-explorer&preserve-view=true) can be used to prevent duplicates during ingestion. This is relevant only in use cases where each ingestion batch is guaranteed to have no duplicates, and duplicates are only expected if the same ingestion batch is ingested more than once.

## Summary

Data duplication can be handled in multiple ways. Evaluate the options carefully, taking into account price and performance, to determine the correct method for your business.

## Related content

* [Write queries](../query/tutorials/learn-common-operators.md)
* [Write queries for Azure Data Explorer](/azure/data-explorer/kusto/query/tutorials/learn-common-operators)
9 changes: 3 additions & 6 deletions data-explorer/kusto-tocs/management/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,13 @@ items:
href: /kusto/management/databases?view=azure-data-explorer&preserve-view=true
items:
- name: .show databases command
displayName: .show databases command, .show cluster databases command
href: /kusto/management/show-databases?view=azure-data-explorer&preserve-view=true
- name: .show database command
href: /kusto/management/show-database?view=azure-data-explorer&preserve-view=true
- name: .show cluster databases command
href: /kusto/management/show-cluster-database?view=azure-data-explorer&preserve-view=true
- name: .alter database prettyname command
href: /kusto/management/alter-database-prettyname?view=azure-data-explorer&preserve-view=true
- name: .drop database prettyname command
href: /kusto/management/drop-database-prettyname?view=azure-data-explorer&preserve-view=true
href: /kusto/management/alter-database?view=azure-data-explorer&preserve-view=true
- name: .show database schema command
displayName: .show databases schema
href: /kusto/management/show-schema-database?view=azure-data-explorer&preserve-view=true
Expand Down Expand Up @@ -914,8 +913,6 @@ items:
href: /kusto/concepts/data-soft-delete?view=azure-data-explorer&preserve-view=true
- name: Soft delete command
href: /kusto/management/soft-delete-command?view=azure-data-explorer&preserve-view=true
- name: Deal with duplicate data
href: /kusto/concepts/dealing-with-duplicates?view=azure-data-explorer&preserve-view=true
- name: Extents (data shards)
items:
- name: Extents overview
Expand Down
4 changes: 2 additions & 2 deletions data-explorer/kusto/.openpublishing.redirection.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@
"redirect_document_id": false
},
{
"source_path": "query/concepts/dealing-with-duplicates.md",
"redirect_url": "/kusto/concepts/dealing-with-duplicates",
"source_path": "concepts/dealing-with-duplicates.md",
"redirect_url": "/azure/data-explorer/dealing-with-duplicates",
"redirect_document_id": false
},
{
Expand Down
2 changes: 1 addition & 1 deletion data-explorer/kusto/concepts/data-soft-delete.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ For information on how to use the command, see [Syntax](../management/soft-delet

This deletion method should only be used for the unplanned deletion of individual records. For example, if you discover that an IoT device is reporting corrupt telemetry for some time, you should consider using this method to delete the corrupt data.

If you need to frequently delete records for deduplication or updates, we recommend using [materialized views](../management/materialized-views/materialized-view-overview.md). See [choose between materialized views and soft delete for data deduplication](dealing-with-duplicates.md#choose-between-materialized-views-and-soft-delete-for-data-deduplication).
If you need to frequently delete records for deduplication or updates, we recommend using [materialized views](../management/materialized-views/materialized-view-overview.md). See [choose between materialized views and soft delete for data deduplication](/azure/data-explorer/dealing-with-duplicates#choose-between-materialized-views-and-soft-delete-for-data-deduplication).

## Deletion process

Expand Down
2 changes: 0 additions & 2 deletions data-explorer/kusto/management/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -914,8 +914,6 @@ items:
href: ../concepts/data-soft-delete.md
- name: Soft delete command
href: soft-delete-command.md
- name: Deal with duplicate data
href: ../concepts/dealing-with-duplicates.md
- name: Extents (data shards)
items:
- name: Extents overview
Expand Down
8 changes: 1 addition & 7 deletions data-explorer/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -506,14 +506,8 @@ items:
href: data-purge-portal.md
- name: Delete data
href: delete-data.md
- name: Data soft delete
items:
- name: Soft delete overview
href: /kusto/concepts/data-soft-delete?view=azure-data-explorer&preserve-view=true
- name: Soft delete command
href: /kusto/management/soft-delete-command?view=azure-data-explorer&preserve-view=true
- name: Deal with duplicate data
href: /kusto/concepts/dealing-with-duplicates?view=azure-data-explorer&preserve-view=true
href: /kusto/query/concepts/dealing-with-duplicates?view=azure-data-explorer&preserve-view=true
- name: Monitor
items:
- name: Monitor Azure Data Explorer with metrics
Expand Down

0 comments on commit eeb7c86

Please sign in to comment.