A real intrusion does not stay in one place. It touches the operating system, crosses the network, and often runs through the browser, and the only way to see the whole attack is to correlate events across all three. That is the premise of a dataset released on arXiv on June 16, 2026, by Abir Ashab Niloy and colleagues — and the reason it exists is a gap that connects three otherwise separate strands of security research: the data we have, the labels we need, and the models small enough to actually deploy.
Start with the data problem, because it is the one the authors lead with. Detecting a multi-stage attack "requires correlating events across all three sources," they write, but the public datasets researchers train on are each missing something. Network-focused sets like CICIDS and UNSW-NB15 "miss host and browser activity." Host-focused sets like LMDG and CICAPT-IIoT "lack browser telemetry." One dataset, ATLAS, has all three sources but only labels events as malicious or benign — no detail on which technique an attacker was using. Connect those gaps and the conclusion is stark.
"No public dataset combines all three sources with per-entry ATT&CK technique labels."— arXiv:2606.18190, source
That single sentence is the whole motivation. The MITRE ATT&CK framework is the industry's shared vocabulary for describing what attackers do — a catalog of tactics and the techniques that implement them. A detector that only says "malicious" tells an analyst far less than one that says "this is credential dumping" or "this is command-and-control over a tunnel." But you cannot train a model to produce ATT&CK-level labels if no dataset provides them at the per-event granularity, and especially not one that also spans the host, the network, and the browser at once. So the team built it.
What's actually in the dataset
The numbers give a sense of scale. The release contains "a multi-source log dataset of 870 sessions (70 attack, 800 benign) and approximately 2.3 million events," with system, network, and browser activity "captured simultaneously on Windows endpoints." Capturing all three at the same time is the part that matters — it is what lets a model learn how a single attack manifests across sources, rather than learning each source in isolation. The malicious events are labeled with ATT&CK technique IDs "covering 12 tactics and 53 techniques," which is the granularity the field has been missing.
Crucially, the attacks are not simulated abstractions. The authors generated the attack data "using real tools, including Remote Access Trojan (RAT), Command and Control (C2) tunnels, and cloud exfiltration." That choice trades some experimental tidiness for realism: the malicious traffic looks like the techniques defenders actually face. The heavy skew toward benign sessions — 800 to 70 — also mirrors reality, where genuine attacks are a tiny fraction of all activity, and it is a deliberate stress test for any model that has to find the needle.
The small-model angle, and why it's the interesting one
The third strand is what the authors do with the data. Rather than reaching for a giant frontier model, they fine-tune three small language models — Qwen2.5-1.5B, Llama-3.2-3B, and Phi-4-Mini — using Low-Rank Adaptation, or LoRA, a technique that adapts a model cheaply without retraining all its weights. This is the practical bet: security telemetry is sensitive and high-volume, the kind of data organizations often want to process on their own hardware rather than ship to an external API, and small models are what make that feasible.
The headline result is a dramatic before-and-after. On chunk classification — deciding whether a span of logs is malicious — fine-tuning "improved every model on every metric," with accuracy rising "from approximately 8% in the base variants to between 90% and 97% after fine-tuning." An 8% base rate tells you these small models knew essentially nothing about reading raw security logs out of the box; the jump to the 90s tells you the dataset taught them something real. That every model improved on every metric is the kind of clean signal that suggests the gains come from the data, not a lucky configuration.
But the authors are refreshingly honest about where it breaks down. The harder task — naming the specific ATT&CK technique, not just flagging malice — "remained challenging, with the best exact-match accuracy at 42%." They note that "high partial-match scores show the models captured most of the underlying reasoning," meaning the models often get close to the right technique without nailing it exactly. That gap between "something is wrong here" and "here is precisely which technique it is" is the honest frontier of this work, and pretending otherwise would undersell how hard fine-grained attribution is.
Why these three things are one story
Follow the thread and the pieces lock together. The field could not build technique-aware detectors because no dataset had per-event ATT&CK labels across host, network, and browser. So the researchers built that dataset with real attack tooling. And once it existed, it became possible to ask whether modest, deployable small models could learn from it — with the answer being a confident yes for coarse detection and a qualified "not yet" for fine-grained attribution. The dataset is the enabling artifact; the SLM experiments are the proof that it teaches something learnable.
The caveats travel with the contribution. The data is Windows-only, the attack sessions are few in absolute terms, and the technique-identification ceiling shows the labels are easier to detect than to discriminate. None of that diminishes the core move, which is the one worth remembering: when the right labeled data does not exist, the bottleneck is not the model. Build the dataset, and even small, on-premises models can start to do work that previously seemed to require something far larger.