diff --git a/data-explorer/kusto/functions-library/log-reduce-fl.md b/data-explorer/kusto/functions-library/log-reduce-fl.md index 601fb8ca9d..b23db87e4c 100644 --- a/data-explorer/kusto/functions-library/log-reduce-fl.md +++ b/data-explorer/kusto/functions-library/log-reduce-fl.md @@ -48,7 +48,7 @@ The function runs multiples passes over the rows to be reduced to common pattern The Logram algorithm considers 3-tuples and 2-tuples of tokens. If a 3-tuple of tokens is common in the log lines (it appears more than *trigram_th* times), then it's likely that all three tokens are part of the pattern. If the 3-tuple is rare, then it's likely that it contains a variable that should be replaced by a wildcard. For rare 3-tuples, we consider the frequency with which 2-tuples contained in the 3-tuple appear. If a 2-tuple is common (it appears more than *bigram_th* times), then the remaining token is likely to be a parameter, and not part of the pattern.\ The Logram algorithm is easy to parallelize. It requires two passes on the log corpus: the first one to count the frequency of each 3-tuple and 2-tuple, and the second one to apply the logic previously described to each entry. To parallelize the algorithm, we only need to partition the log entries, and unify the frequency counts of different workers. -* **Apply [Drain algorithm](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf)**: this pass is optional, pending *use_drain* is true. Drain is a log parsing algorithm based on a truncated depth prefix tree. Log messages are split according to their length, and for each length the first *tree_depth* tokens of the log message are used to build a prefix tree. If no match for the prefix tokens was found, a new branch is created. If a match for the prefix was found, we search for the most similar pattern among the patterns contained in the tree leaf. Pattern similarity is measured by the ratio of matched nonwildcard tokens out of all tokens. If the similarity of the most similar pattern is above the similarity threshold (the parameter *similarity_th*), then the log entry is matched to the pattern. For that pattern, the function replaces all nonmatching tokens by wildcards. If the similarity of the most similar pattern is below the similarity threshold, a new pattern containing the log entry is created.\ +* **Apply [Drain algorithm](https://pinjiahe.github.io/files/pdf/research/ICWS17.pdf)**: this pass is optional, pending *use_drain* is true. Drain is a log parsing algorithm based on a truncated depth prefix tree. Log messages are split according to their length, and for each length the first *tree_depth* tokens of the log message are used to build a prefix tree. If no match for the prefix tokens was found, a new branch is created. If a match for the prefix was found, we search for the most similar pattern among the patterns contained in the tree leaf. Pattern similarity is measured by the ratio of matched nonwildcard tokens out of all tokens. If the similarity of the most similar pattern is above the similarity threshold (the parameter *similarity_th*), then the log entry is matched to the pattern. For that pattern, the function replaces all nonmatching tokens by wildcards. If the similarity of the most similar pattern is below the similarity threshold, a new pattern containing the log entry is created.\ We set default *tree_depth* to 4 based on testing various logs. Increasing this depth can improve runtime but might degrade patterns accuracy; decreasing it's more accurate but slower, as each node performs many more similarity tests.\ Usually, Drain efficiently generalizes and reduces patterns (though it's hard to be parallelized). However, as it relies on a prefix tree, it might not be optimal in log entries containing parameters in the first tokens. This can be resolved in most cases by applying Logram first. diff --git a/data-explorer/kusto/management/cache-policy.md b/data-explorer/kusto/management/cache-policy.md index 211d939b39..f5796ee23f 100644 --- a/data-explorer/kusto/management/cache-policy.md +++ b/data-explorer/kusto/management/cache-policy.md @@ -3,7 +3,7 @@ title: Caching policy (hot and cold cache) description: This article describes caching policy (hot and cold cache). ms.reviewer: orspodek ms.topic: reference -ms.date: 08/11/2024 +ms.date: 11/11/2024 --- # Caching policy (hot and cold cache) @@ -21,6 +21,9 @@ The best query performance is achieved when all ingested data is cached. However Use management commands to alter the caching policy at the [database](alter-database-cache-policy-command.md), [table](alter-table-cache-policy-command.md), or [materialized view](alter-materialized-view-cache-policy-command.md) level. +> [!NOTE] +> For information about the consumption rate, see [Eventhouse and KQL database consumption](/fabric/real-time-intelligence/real-time-intelligence-consumption). + ::: moniker-end ::: moniker range="azure-data-explorer" @@ -35,13 +38,13 @@ Use management commands to alter the caching policy at the [cluster](alter-clust ## How caching policy is applied -When data is ingested, the system keeps track of the date and time of the ingestion, and of the extent that was created. The extent's ingestion date and time value (or maximum value, if an extent was built from multiple pre-existing extents), is used to evaluate the caching policy. +When data is ingested, the system keeps track of the date and time of the ingestion, and of the extent that was created. The extent's ingestion date and time value (or maximum value, if an extent was built from multiple preexisting extents), is used to evaluate the caching policy. > [!NOTE] > You can specify a value for the ingestion date and time by using the ingestion property `creationTime`. > When doing so, make sure the `Lookback` property in the table's effective [Extents merge policy](merge-policy.md) is aligned with the values you set for `creationTime`. -By default, the effective policy is `null`, which means that all the data is considered **hot**. A `null` policy at the table level means that the policy will be inherited from the database. A non-`null` table-level policy overrides a database-level policy. +By default, the effective policy is `null`, which means that all the data is considered **hot**. A `null` policy at the table level means that the policy is inherited from the database. A non-`null` table-level policy overrides a database-level policy. ## Scoping queries to hot cache @@ -64,7 +67,7 @@ The `default` value indicates use of the default settings, which determine that If there's a discrepancy between the different methods, then `set` takes precedence over the client request property. Specifying a value for a table reference takes precedence over both. -For example, in the following query, all table references use hot cache data only, except for the second reference to "T", that is scoped to all the data: +For example, in the following query, all table references use hot cache data only, except for the second reference to "T" that is scoped to all the data: ```kusto set query_datascope="hotcache"; @@ -85,7 +88,7 @@ Example: * `SoftDeletePeriod` = 56d * `hot cache policy` = 28d -In the example, the last 28 days of data will be on the SSD and the additional 28 days of data will be stored in Azure blob storage. You can run queries on the full 56 days of data. +In the example, the last 28 days of data is stored on the SSD and the additional 28 days of data is stored in Azure blob storage. You can run queries on the full 56 days of data. ## Related content diff --git a/data-explorer/kusto/query/best-practices.md b/data-explorer/kusto/query/best-practices.md index fc664c283b..a80b200acb 100644 --- a/data-explorer/kusto/query/best-practices.md +++ b/data-explorer/kusto/query/best-practices.md @@ -3,81 +3,47 @@ title: Best practices for Kusto Query Language queries description: This article describes Query best practices. ms.reviewer: alexans ms.topic: reference -ms.date: 08/11/2024 +ms.date: 11/11/2024 adobe-target: true --- # Best practices for Kusto Query Language queries -> [!INCLUDE [applies](../includes/applies-to-version/applies.md)] [!INCLUDE [fabric](../includes/applies-to-version/fabric.md)] [!INCLUDE [azure-data-explorer](../includes/applies-to-version/azure-data-explorer.md)] [!INCLUDE [monitor](../includes/applies-to-version/monitor.md)] [!INCLUDE [sentinel](../includes/applies-to-version/sentinel.md)] +> [!INCLUDE [applies](../includes/applies-to-version/applies.md)] [!INCLUDE [fabric](../includes/applies-to-version/fabric.md)] [!INCLUDE [azure-data-explorer](../includes/applies-to-version/azure-data-explorer.md)] [!INCLUDE [monitor](../includes/applies-to-version/monitor.md)] [!INCLUDE [sentinel](../includes/applies-to-version/sentinel.md)] Here are several best practices to follow to make your query run faster. ## In short -:::moniker range="azure-data-explorer" -| Action | Use | Don't use | Notes | -|--|--|--|--| -| **Reduce the amount of data being queried** | Use mechanisms such as the `where` operator to reduce the amount of data being processed. | | See below for efficient ways to reduce the amount of data being processed. | -| **Avoid using redundant qualified references** | When referencing local entities, use the unqualified name. | | See below for more on the subject. | -| **`datetime` columns** | Use the `datetime` data type. | Don't use the `long` data type. | In queries, don't use unix time conversion functions, such as `unixtime_milliseconds_todatetime()`. Instead, use update policies to convert unix time to the `datetime` data type during ingestion. | -| **String operators** | Use the `has` operator | Don't use `contains` | When looking for full tokens, `has` works better, since it doesn't look for substrings. | -| **Case-sensitive operators** | Use `==` | Don't use `=~` | Use case-sensitive operators when possible. | -| | Use `in` | Don't use `in~` | -| | Use `contains_cs` | Don't use `contains` | If you can use `has`/`has_cs` and not use `contains`/`contains_cs`, that's even better. | -| **Searching text** | Look in a specific column | Don't use `*` | `*` does a full text search across all columns. | -| **Extract fields from [dynamic objects](scalar-data-types/dynamic.md) across millions of rows** | Materialize your column at ingestion time if most of your queries extract fields from dynamic objects across millions of rows. | | This way, you'll only pay once for column extraction. | -| **Lookup for rare keys/values in [dynamic objects](scalar-data-types/dynamic.md)** | Use `MyTable | where DynamicColumn has "Rare value" | where DynamicColumn.SomeKey == "Rare value"` | Don't use `MyTable | where DynamicColumn.SomeKey == "Rare value"` | This way, you filter out most records, and do JSON parsing only of the rest. | -| **`let` statement with a value that you use more than once** | Use the [materialize() function](materialize-function.md) | | For more information on how to use `materialize()`, see [materialize()](materialize-function.md). For more information, see [Optimize queries that use named expressions](named-expressions.md).| -| **Apply conversions on more than 1 billion records** | Reshape your query to reduce the amount of data fed into the conversion. | Don't convert large amounts of data if it can be avoided. | | -| **New queries** | Use `limit [small number]` or `count` at the end. | | Running unbound queries over unknown datasets may yield GBs of results to be returned to the client, resulting in a slow response and a busy cluster. | -| **Case-insensitive comparisons** | Use `Col =~ "lowercasestring"` | Don't use `tolower(Col) == "lowercasestring"` | -| **Compare data already in lowercase (or uppercase)** | `Col == "lowercasestring"` (or `Col == "UPPERCASESTRING"`) | Avoid using case insensitive comparisons. | | -| **Filtering on columns** | Filter on a table column. | Don't filter on a calculated column. | | -| | Use `T | where predicate(*Expression*)` | Don't use `T | extend _value = *Expression* | where predicate(_value)` | | -| **summarize operator** | Use the [hint.shufflekey=\](shuffle-query.md) when the `group by keys` of the summarize operator are with high cardinality. | | High cardinality is ideally above 1 million. | -| **[join operator](join-operator.md)** | Select the table with the fewer rows to be the first one (left-most in query). | | -| | Use `in` instead of left semi `join` for filtering by a single column. | | -| Join across clusters | Across clusters, run the query on the "right" side of the join, where most of the data is located. | | -| Join when left side is small and right side is large | Use [hint.strategy=broadcast](broadcast-join.md) | | Small refers to up to 100MB of data. | -| Join when right side is small and left side is large | Use the [lookup operator](lookup-operator.md) instead of the `join` operator | | If the right side of the lookup is larger than several tens of MBs, the query will fail. | -| Join when both sides are too large | Use [hint.shufflekey=\](shuffle-query.md) | | Use when the join key has high cardinality. | -| **Extract values on column with strings sharing the same format or pattern** | Use the [parse operator](parse-operator.md) | Don't use several `extract()` statements. | For example, values like `"Time =