[SEC://RESEARCH]_

SIEM · SOAR · Threat Detection

← ~/BLOG

2026-01-09 // Dylan Evans

Designing Detections for Scale, Speed, and Human Beings

Effective KQL detection engineering goes beyond writing queries — it demands an understanding of attacker behaviour, analyst workflows, and production environments. This article covers three core principles: building with entity mapping in mind, optimising queries for scale, and writing detections that others can understand and maintain under pressure.

Detection EngineeringKQLDefender XDRMicrosoft SentinelSOCSIEM

Detection engineering with KQL (Kusto Query Language) often looks simple on paper: write a query, reference a table, join some data, and generate alerts. In practice, it's rarely that straightforward. KQL detections sit at the forefront of Defender XDR, meaning even small changes and errors can have ripple effects far beyond a single alert. Noisy or poorly designed detections quickly lead to analyst fatigue, increased false positives, and an erosion of trust within the SOC.

Building effective detections requires more than knowing KQL. It demands you understand attacker behaviours and attack paths, the analyst's point-of-view, and what legitimate user/system activity actually looks like in production.

This article shares some KQL engineering tips and tricks drawn from real-world experience designing, tuning, and supporting detections in live SOC environments. The focus is not just on cleaner queries, but revisiting tried and tested techniques that actually deliver visible improvements:

  • Developing detections with triage in mind.
  • Writing and optimising your KQL for big and small log tables.
  • Making your queries readable and maintainable under pressure.

1. Always Keep Entity Mapping Front of Mind

Just like triage is important to the incident flow, entities are important considerations when writing detection rules. These entities essentially represent the subjects of interest in your alerts and provide useful context that allows an analyst to understand their investigation. Including and mapping these entities can inherently enable automation to act when specific entity types are identified.

For example, an alert for an overseas sign-in that has mapped the IP address and the user involved allows an analyst to immediately understand the scope of the alert activity and derive the most logical next steps:

  1. Check the user's sign-in logs, particularly from that IP address.
  2. Check the IP in threat intelligence sources.
  3. Check the app activity logs for this user, particularly from that IP address.

Because these entities are explicitly mapped, the alert also lends itself naturally to automation. For example:

  • If the activity is safe, alerts can be auto-closed as benign or false positives.
  • If there are repeatable actions, like disabling a user account, an automation rule can be created to act on the involved accounts of a specific incident.

This is only possible because Sentinel knows these values exist. If Sentinel knows what these values are and how they relate, it can take further actions more reliably and more efficiently. This is where engineering around entity mapping pays its dividends. Detections that can clearly articulate the who, what, where, and how not only speed up time-to-respond, but also scale better for mature SOC teams where automation is no longer a nicety, but a necessity.

2. Optimisation IS Important

When first starting out writing detections, it's easy not to pay much attention to query runtime or compute requirements — and in many customer environments, that's a fair assessment. The data stored in log workspaces is simply not large enough to warrant an optimisation effort to reduce runtime by 0.01 seconds. However, working across multiple tenants at an MSSP quickly changes that perspective. A query that runs "fast enough" in one tenant can behave very differently in a larger one.

For instance, running a query against the SecurityIncidents table will usually complete quickly even with an unoptimised query. However, running an unoptimised query over 60 days of non-interactive sign-in logs could mean failures, delays, and timeouts. The takeaway is not that every query must be aggressively optimised from day one, but that detection engineers need to understand where performance matters and why certain KQL statements do not scale.

Understanding the == / has / contains Hierarchy

One of the more common optimisation mistakes is misuse of contains. It is slow, but intuitive and forgiving, making it an easy choice for newer engineers. The has or has_any operators provide the same functionality more efficiently.

Consider searching a column for the string "PowerShell.exe -enc":

The == operator — fastest, but requires an exact case-sensitive match:

| where Text == "PowerShell.exe -enc"  // TRUE
| where Text == "powershell.exe"        // FALSE — case mismatch, missing flag

The has operator — fast, case-insensitive, but token-based (full words only):

| where Text has "powershell.exe"   // TRUE — case-insensitive, ignores '-enc'
| where Text has "PowerShell"       // TRUE — '.exe' is not part of the token
| where Text has "shell"            // FALSE — 'shell' is not a full token
| where Text has ".exe -enc"        // FALSE — multi-token searches are not supported

The has_any operator — same as has, but matches any of multiple terms:

| where Text has_any ("powershell", "google")  // TRUE — matches 'powershell'

The contains operator — slow, but not bound by token rules:

| where Text contains "shell"       // TRUE — but also matches "Shellcode detected"
| where Text contains ".exe -enc"   // TRUE — works, but at a performance cost

To summarise: == is fastest but least forgiving; has is fast and reasonably forgiving; contains is slow but extremely forgiving.

Using Watchlists Instead of Inline in Lists

Watchlists are a powerful tool within Sentinel that act as mini databases referenceable from your KQL. They reduce query complexity and centralise information that may be used across multiple queries, such as a risky user list or a list of suspicious command-line entries.

Instead of:

| where CommandLine in ("e1", "e2", "e3", ... tens or hundreds more)

Use a watchlist:

| lookup _GetWatchlist("SusCommandLine") on $left.CommandLine == $right.SusCommandLine

This makes your query easier to maintain and read, while also allowing other queries to reference the same centralised data.

The Golden Rules of Optimisation: Touch as Little Data as Possible

Filter early. If you are looking for failed sign-ins, filter out successful ones at the start. KQL runs in order of operations, so earlier filters reduce the data processed by everything that follows.

Summarise where possible. summarize can reduce thousands of rows down to exactly what you need. For example, to detect when failed sign-ins exceed 1000 in a given period:

SignInLogs
| where Result != 0
| summarize count() by Result
| where Result == 50058 and count_ > 1000

When summarising a time field, use bin to aggregate logs into defined time blocks:

| summarize count() by bin(TimeGenerated, 5m)

Use project early. If you only care about specific columns, project them at the start to drop the rest and reduce the scope of data being processed:

SignInLogs
| project TimeGenerated, Result
| summarize count() by bin(TimeGenerated, 5m), Result

3. Write Your Query as if You're Reading it at 3am

This step is arguably the most important. A brilliantly written query with flawless logic is of little value if no one else understands what it's doing. Clear, readable KQL allows analysts triaging an alert at 3am to quickly understand the detection, know where to look, and identify what questions to ask next.

Use comments. Comments are an underrated and underused tool in detection engineering, despite being commonplace in software development for decades. Use // to initiate a single-line comment, Ctrl+/ to comment out an entire line, or click-drag a section and use Ctrl+L to bulk comment a block.

Avoid complex variable names. If using variables via let, make sure they are easy to understand and do not overlap with common column names like UserPrincipalName. A variable name you need to go searching for to understand is not a good variable name.

Write documentation. Good documentation explains the why behind your detection, not just the what. It tells the next analyst what problem the query is solving, what assumptions it relies on, and what "normal" is supposed to look like. Without that context, even a well-written query becomes risky to touch and harder to fix under pressure.

Nobody enjoys writing documentation. But every undocumented detection eventually becomes technical debt — usually discovered at the worst possible time, by someone who didn't write it, during a major incident. Documentation is one of the most important force-multipliers in detection engineering, and skipping it is always a bad idea.

Conclusion

None of this is revolutionary. Variations of these practices already exist in SOCs everywhere. The point isn't novelty — it's discipline. These ideas exist to reinforce the detection engineering habits that continue to work as data volumes grow and complexity increases.

Map – Optimise – Explain. When those three principles are kept front of mind, even a flawed query is rarely a problem. Someone can understand it, fix it, and deploy an improved version quickly. That alone is a major operational win.

If it's slow or misunderstood, it won't be trusted. And if it isn't trusted, it might as well not exist.