How would you like to read it?
Same edition, explained without the jargon — and just as faithful. It's not a quick summary: an independent check confirms the plain-language version stays true to the original, without dropping or distorting anything.
NVIDIA Opens Nemotron 3 Ultra, a 550-Billion Open Model with Hybrid Mamba-Attention Architecture
Open-weight release on June 4: 55 billion active parameters, a 1M-token context, and claimed throughput of up to 5.9x over rivals. On raw intelligence, however, it still trails China's Kimi K2.6.
On June 4, NVIDIA released Nemotron 3 Ultra, an open-weight model with 550 billion total parameters and 55 billion active (roughly 90% sparsity). It is distributed on Hugging Face under the open OpenMDW-1.1 license.
At its core is a hybrid Mamba-Attention MoE architecture: a Mamba-2 (state space) backbone with a few self-attention layers. This combination reduces attention cost and KV cache footprint in long contexts. These are two distinct quantities: the computational cost of attention scales quadratically with sequence length, while KV cache memory grows linearly with the number of tokens retained. By keeping attention to a few layers, the hybrid mitigates the quadratic component of prefill and curbs cache growth during decode. The result is a window of up to 1M tokens. NVIDIA also claims to beat other open LLMs on the RULER benchmark at 1M context. On the efficiency front, the model is pre-trained in NVFP4 (4-bit) on Blackwell hardware and adopts LatentMoE and multi-token prediction. It is not an architecture invented from scratch, but an engineered recombination of known techniques (Mamba-2, attention, MoE, 4-bit quantization) optimized for long-horizon agents.
The real competitive lever is speed: over 300 tokens/second on pre-release endpoints, versus 50-100 for Chinese peers. NVIDIA claims throughput 5.9x / 4.8x / 1.6x higher than GLM-5.1, Kimi-K2.6, and Qwen-3.5 in the 8k input / 64k output scenario, but these are in-house figures, measured against rivals chosen by NVIDIA. The technical report obtains them at max-throughput in NVFP4 on GB200, with Nemotron served via TRT-LLM and the competitors via vLLM. For each model, the best result was chosen, with or without speculative decoding. On raw intelligence, Artificial Analysis gives it 48 points on its Intelligence Index: first among US open models (Gemma 4 at 39, gpt-oss-120b at 33), but six points behind China's Kimi K2.6 (54). Independent analyses add two caveats. First: the score is on pre-release BF16 weights, not on the final NVFP4 version. Second: the license's openness still steers developers toward NVIDIA hardware via CUDA dependencies.
Why it matters
- Frontier research: A frontier-scale Mamba-Attention hybrid with open weights shows that the quadratic cost of attention and the linear footprint of the KV cache in long contexts can be mitigated with a Mamba-2 backbone using few attention layers. It's 550B-scale material for studying 1M contexts and state-space + MoE mechanisms. But it also confirms that the US open frontier still trails the Chinese one (Kimi K2.6 at 54 versus 48).
- LLM builders / devs: An open license, a 1M-token context, and over 300 tokens/second make the model deployable for long-horizon agents, with contained inference costs thanks to the 55B active parameters. But deployment remains data-center grade: the model card indicates a minimum of 8x GB200/B200/GB300/B300, 16x H100, or 8x H200. Also worth weighing before adoption: the independent benchmarks are on pre-release BF16 weights (not the final NVFP4 version), and the ecosystem pushes toward NVIDIA hardware via CUDA.
Trump Signs AI Executive Order: Voluntary Model Reviews and No Mandatory Licensing
On 2 June 2026 the White House opts for voluntary self-regulation of frontier models: 30 days of early government access and no licensing requirements. But the voluntary nature of the mechanism is already being contested.
On 2 June 2026 President Donald Trump signed the executive order "Promoting Advanced Artificial Intelligence Innovation and Security", which puts oversight of frontier models on a voluntary rather than mandatory footing. The order directs federal agencies to define a voluntary framework within 60 days. Under it, developers grant the government access to "covered frontier models" up to 30 days before release to other parties, so that Treasury, NSA and CISA can assess their cyber capabilities (Section 3(b)(ii)). Which models fall within the perimeter has not yet been settled. According to David Sacks, the White House AI czar, the framework is intended for models with a "meaningful step-change" in cyber capabilities. The precise criterion will be defined by the classified benchmarking process run by the NSA. In short, the review is not meant to cover every update.
The text is explicit about what it does NOT do: Section 3(c) prohibits any licensing requirement, pre-clearance or permit to develop, publish or distribute new models. According to the White House fact sheet, the order also establishes, within 30 days, an "AI cybersecurity clearinghouse" led by the Treasury together with the National Cyber Director, NSA and CISA. The body will coordinate vulnerability scanning, validation and patch distribution.
This is a softened version. The earlier draft, shelved on 21 May over concerns about competitiveness with China, went as far as 90 days of review, later cut to 30 as a concession to industry. The voluntary nature, however, remains contested. Experts at the Council on Foreign Relations warn that effectiveness will hinge on genuine collaboration more than on the text, and that patching remains unresolved. On the security front, as Roll Call reports, Senator Josh Hawley and several organizations are calling on Congress to make the review mandatory.
Why it matters
- Entrepreneurs: The choice of voluntary self-regulation — with an explicit ban on mandatory licensing and pre-clearance — reduces regulatory uncertainty for those developing or selling AI in the US. It may also favor 'trusted partner' programs and indirect benefits for cyber/IT contractors, even though the text does not create a formal contractual status. But the voluntary nature is already being contested (Hawley and several organizations are pushing to make it mandatory via Congress): the framework could tighten.
- ICT engineers / IT managers: The vulnerability clearinghouse and the 30-day early access aim to give defenders an edge over the risks of frontier models. Those running infrastructure and security should, however, weigh the experts' caveats: classified benchmarking and NDAs can delay the arrival of models to defenders, and patching remains unresolved for less structured operators.
Gemma 4 12B: Google Drops the Encoder and Brings Native Multimodality to the Laptop
Google DeepMind releases an open 12B model that projects images and raw audio directly into token space, approaching a 26B MoE with less than half the memory. The idea, however, builds on already-known early fusion work, and the '16 GB' claim comes with real caveats.
On June 3, 2026, Google DeepMind released Gemma 4 12B, an open multimodal model under the Apache 2.0 license. The novelty is architectural: it eliminates the dedicated encoders for images and audio. In place of the roughly 550M-parameter vision encoder, a lightweight ~35M-parameter linear projection maps image patches (48×48 px tiles) directly into the model's embedding space. Raw 16 kHz audio, split into 40 ms frames, is projected into the same space as the text tokens, entirely removing the conformer audio encoder of the previous models. It is the first mid-size Gemma with native audio input. The Hugging Face model card lists 11.95 billion parameters, a 256K-token context, and support for images, audio, and video. The claimed gain is performance close to the 26B MoE with less than half the memory, runnable on a 16 GB laptop.
The approach, however, is not unprecedented: as several developers point out, encoder-free early fusion had already been explored by Meta FAIR's Chameleon and the EVE series. Gemma 4's real novelty is applying it to images and audio together at this scale. Material caveats remain: MarkTechPost's independent analysis notes that removing the encoder shifts much of the visual understanding onto the LLM backbone (a quality/efficiency trade-off) and that the launch materials did not include complete benchmarks. Pointing the same way — but from a non-independent voice — is the technical analysis by Maarten Grootendorst, who is affiliated with Google DeepMind. On accessibility, the community clarifies that 16 GB of VRAM (not system RAM) is required, and that this threshold assumes quantized versions, whereas the benchmarks run in BF16 (~24 GB).
Why it matters
- Frontier research: The verifiable contribution is not an invention from scratch but an engineering milestone: it shows that encoder-free early fusion can approach 26B-class quality at halved memory costs and at the edge, applying the idea to images and audio together. Open weights and Apache 2.0 make it a reproducible testbed for end-to-end multimodal fine-tuning, but the lineage (Chameleon, EVE) and the caveats on benchmarks and real-world memory must be kept in mind when assessing its significance.
Microsoft Unveils Project Solara, the Chip-to-Cloud Platform for Devices That Run Agents Instead of Apps
At Build 2026, Microsoft revealed Project Solara, an AOSP-based system for 'agent-first' devices. It was shown with a wearable badge and a desktop companion. For now these are prototypes, not products on sale.
At Build 2026, Microsoft announced Project Solara, a "chip-to-cloud" platform designed from the ground up for agent-first devices. The idea is to shift "from software you open to intelligence you invoke" — that is, to invoke an agent instead of opening apps. The operating system is called Microsoft Device Ecosystem Platform (MDEP) and is built on AOSP (Android Open Source Project), not on Windows. The devices effectively work as interfaces to agents hosted in the cloud (Microsoft Command Line announcement).
Microsoft showed two reference designs. The first is a wearable badge with Qualcomm silicon: a touchscreen, a far-field microphone array, a fingerprint sensor for Windows Hello for Business, and a side camera. Connectivity includes 5G, Wi-Fi, Bluetooth, and GNSS. The second is a desktop companion with MediaTek silicon: a touchscreen, facial authentication, a dual microphone array, a UWB presence sensor, and two USB-C ports. Connected to an external monitor, it can act as a Windows 365 client. The platform supports multiple agents without a "dominant" agent and adapts the interface with a just-in-time UI. On the enterprise side there are Intune, Entra ID, and a physical button to mute the microphone (Engadget).
The caveat is substantial: these are prototypes, not products you can buy. Microsoft will not build the final devices but will supply the reference designs to OEMs. To be certified, devices will have to use "approved chipsets" (Tom's Hardware). The external pilot — with AccuWeather, Best Buy, CVS Health, Levi's, and Target — will start "in the coming months"; the company itself is holding back: "We are still early. I don't want to over-promise."
Why it matters
- End users: It's a concrete preview of the post-smartphone era: devices you use by talking to an agent instead of tapping apps. But it's an enterprise vision still at the prototype stage — there's nothing to buy today, and the real usefulness will depend on how reliable and non-intrusive the agents become.
Apica/Omdia Study: Agentic AI Could Make Telemetry Explode by Up to 9.5x, and Enterprises Aren't Ready
An Omdia study commissioned by Apica projects an average 9.5x increase in the telemetry generated by agentic AI within two years; 54% of enterprises have already seen their data triple in 12 months, and observability costs are stalling projects. The figures, however, should be treated as vendor data.
A new study conducted by Omdia (Informa TechTarget group) on behalf of Apica argues that the adoption of agentic AI will trigger an explosion of telemetry data (Apica press release). Across more than 300 IT decision-makers in North America and Western Europe, respondents expect on average a 9.5-fold increase in the volume of telemetry produced by agentic workloads within two years. The UK, Switzerland, Germany, and Austria account for roughly half the sample. This is a projection, not something that has already happened: 44% expect growth of between 6x and 100x (UK Tech News).
The figure already observed is more modest, but significant: 54% of enterprises have seen their telemetry volume triple over the past 12 months, on average 3.7x year over year. AI/ML workloads now account for roughly 43% of that growth. 83% place AI observability among their 2026 priorities.
The crux is cost. In 69% of agentic projects, observability spending exceeds compute and infrastructure combined, with a stated average annual spend of $3.17 million. What's more, 59% say they have already canceled or postponed an agentic deployment because monitoring it cost too much. “Scalability is the main reason agentic projects fail to take off,” sums up Omdia analyst Torsten Volk.
The research, however, should be read with caution: it is PR-driven, commissioned by Apica, which sells precisely the kind of low-cost telemetry pipelines in question (it claims up to 40% lower TCO). The direction nonetheless finds independent corroboration: OneUptime, unconnected to the study, estimates that a single AI pipeline generates 10-50x more telemetry than a traditional call. This throws the volume-based pricing of Datadog, New Relic, and Splunk into crisis.
Why it matters
- ICT engineers / IT managers: Anyone managing infrastructure and observability budgets should plan for telemetry growth now — pipelines, sampling, data governance — before scaling up agents: with the volume-based pricing of today's tools, monitoring can cost more than the compute itself and even bring projects to a halt. Even treating the 9.5x as a vendor estimate, the direction is confirmed by independent sources, so the capacity and cost risk is real and should be sized up in good time.
GitHub Copilot: New Tabbed UI in the CLI, Air-Gapped BYOK on the VS Code Side
GitHub is updating on two distinct fronts. The Copilot CLI gains a tabbed interface, rubber duck, prompt scheduling, and voice input. Air-gapped BYOK with custom endpoints, by contrast, arrives in the May releases of Copilot in VS Code.
GitHub has updated two distinct fronts of the Copilot experience. The Copilot CLI changelog of June 2, 2026 introduces an experimental terminal interface, enabled with /experimental on. It offers Session, Issues, Pull Requests, and Gists tabs, themed semantic colors, and screen reader support enabled by default. Going generally available are the "rubber duck" — an internal critic-agent that reviews plans, design, implementation, and tests via /rubber-duck — and voice input, which runs locally, keeping the audio on the machine. Prompt scheduling, by contrast, remains experimental, with /every (repeats at intervals, e.g. /every 30m run the frontend tests) and /after (single delayed execution).
Air-gapped BYOK, which the briefing attributes to the CLI, actually belongs to the Copilot in VS Code changelog (May releases, v1.120–v1.123). There you find the "bring-your-own-key" models that can run in isolated environments without GitHub authentication, a "Custom Endpoint" provider compatible with chat completions, responses, or messages, the "configurable utility models" (for titles, summaries, commit messages, and intent detection), and the reasoning effort controls. The scope must be delimited, however: according to the VS Code 1.122 release notes, BYOK without sign-in covers chat, tools, and MCP server, while inline suggestions and Next Edit Suggestions (NES) still require GitHub authentication — a decisive detail precisely for anyone planning an air-gapped deployment.
The clarification matters because the CLI had already gained BYOK and local models in its April 7, 2026 update — with an offline mode (COPILOT_OFFLINE=true), providers such as Azure OpenAI, Anthropic, Ollama, vLLM, and Foundry Local, and optional GitHub authentication. Air-gapped, in short, is not a novelty of the June CLI: in this round it concerns VS Code.
Why it matters
- LLM builders / devs: The leverage operates on two levels: the CLI gains ergonomics and automation (tabbed UI, recurring prompt scheduling, automatic review via rubber duck), while control over privacy and model routing — BYOK in isolated environments, custom endpoints pointing to your own or local models — in this round lives on the VS Code side (and, since April 7, in the CLI with offline mode). Knowing where each capability sits avoids planning an air-gapped deployment on the wrong tool — and even on the VS Code side the isolation is not total: without sign-in, BYOK applies to chat, tools, and MCP server, but inline suggestions and NES remain tied to GitHub authentication. It is also worth remembering that the tabbed UI and scheduling are still experimental (opt-in via
/experimental on).