GPT-4.5 Pro's 30% Breakthrough on Physics Benchmark

EDITOR’S NOTE

The week's stories share a single uncomfortable subtext: capability is accelerating, but trust is fracturing.

GPT-4.5 Pro just posted a 30% jump on a research physics benchmark. That's not a rounding error.
A new training method called POET-X cuts memory costs for LLMs without touching performance. Quiet paper, loud implications.
Qwen's smallest models, three generations in, are now genuinely embarrassing models that cost ten times as much to run.
The US military kept its Claude contract. Its contractors are quietly shopping elsewhere.

More capability, more options, more defections. The models are getting better faster than anyone's getting comfortable.

SIGNAL DROP

1. GPT-4.5 Pro Hits 30% on Physics Research Benchmark

Artificial Analysis scored GPT-4.5 Pro at 30% on CritPT, a benchmark designed to test reasoning on unsolved scientific problems, and called it the largest single-release gain they've recorded. Expensive to run, per community discussion on r/singularity. Frontier labs chasing this benchmark should worry: cost-per-point is climbing fast, and the gap between benchmark scores and real-world utility remains unproven.

2. POET-X Cuts LLM Training Memory Overhead

Researchers published POET-X on arXiv, a spectrum-preserving training method that reduces the memory and compute costs of orthogonal equivalence training. Leaner training pipelines matter more than ever as model sizes grow. Teams running fine-tuning at scale on constrained hardware should pay attention here.

3. Qwen's Smallest Models Keep Getting Sharper

Alibaba's Qwen line, from 2.5 through 3.5, shows consistent gains even at sub-1B parameter counts, according to r/LocalLLaMA analysis. The 0.8B Qwen 3.5 is competitive enough that on-device AI developers betting on bigger models to win the edge case are probably wrong.

DEEP DIVE

The Setup

Anthropic built its reputation on safety. Constitutional AI, responsible scaling policies, the whole careful-by-design ethos. That's the brand. So the revelation that Claude is being used in targeting decisions during active U.S. aerial operations against Iran is worth sitting with for a moment.

Not as abstraction. As reality.

What's Happening

According to TechCrunch, Anthropic's models are currently being used to assist with targeting decisions as the U.S. conducts aerial strikes on Iran. The U.S. military remains a Claude customer. And simultaneously, defense-tech clients (the contractors and startups building on top of Anthropic's API for military applications) are reportedly fleeing.

That's a strange split. The end customer stays. The middlemen leave.

My read: the defense-tech clients leaving probably reflects a different kind of concern. If you're a startup building military tooling, you need a reliable, committed AI partner. Anthropic's public positioning on safety and its historically cautious approach to weapons-adjacent applications creates uncertainty for companies that need to move fast and can't afford a supplier who might pull the plug citing ethical concerns. So they're hedging toward providers with fewer public scruples. OpenAI, after reversing its own military-use restrictions in early 2024, is the obvious beneficiary here.

The Technical Stakes

Using a large language model for targeting decisions isn't like using one to write emails. The failure modes are categorically different. An LLM that hallucinates a product recommendation costs you a customer. An LLM that misclassifies a target or surfaces a false positive in a strike chain costs something else entirely.

The core issue is that models like Claude are probabilistic systems trained on human-generated data. They're not deterministic weapons-grade software with formal verification. They produce outputs that are usually right. "Usually" is doing enormous work in that sentence when the application is kinetic military action.

I don't know (and the source doesn't specify) exactly where in the targeting pipeline Claude sits. There's a significant difference between "summarizes intelligence reports for human analysts" and "flags targets for strike approval." Both could technically be described as "used for targeting decisions." The distinction matters enormously, and the reporting leaves it frustratingly vague.

The Anthropic Paradox

Anthropic has always positioned itself as the safety-first lab. Its Acceptable Use Policy has historically been more restrictive than competitors. But the company also needs revenue, and U.S. government contracts are among the largest and most stable contracts available in enterprise AI.

So here's the tension. Anthropic's safety positioning attracts the researchers, the policy community, the enterprise clients nervous about AI risk. But that same positioning creates friction with defense clients who want a vendor that won't publicly agonize about whether their use case is ethical. And now Anthropic appears to be serving both: the military customer who stayed, and the public reputation built on careful AI development.

That's a hard needle to thread. And it's getting harder as the applications get more visible.

The Broader Industry Problem

This isn't really about Anthropic specifically. Every frontier AI lab is navigating some version of this. The U.S. government is the biggest spender on AI infrastructure in the world (my analysis: federal AI spending is tracking toward tens of billions annually, though precise figures are hard to pin down given classification). Turning away that money is a genuine competitive disadvantage when your rivals are taking it.

But the defense-tech client exodus suggests the market is already segmenting. Labs that want government prime contracts will need to commit clearly to military use cases. Labs that want to maintain the safety brand will face pressure to draw explicit lines. And the middle ground, where Anthropic currently sits, gets smaller every time a conflict goes public.

My Take

I think Anthropic is making a calculated bet that institutional legitimacy with the U.S. government outweighs the reputational cost with the safety-focused community. And short-term, that might be right. But there's a version of this where the next high-profile AI-assisted military incident puts Claude's name in headlines in a way that's very hard to walk back. The defense-tech clients leaving aren't fleeing because they oppose military AI. They're leaving because Anthropic seems uncertain about its own commitments. That uncertainty is the real problem. Pick a lane.

- The AI finds the signal. We decide what it means.

PARTNER PICK

Synthesia turns text into video without a camera, actor, or green screen. You write a script, pick an avatar, and get a finished video in minutes. The output quality has gotten genuinely good. Lip sync is tight. Lighting looks natural.

Worth trying if you're drowning in async updates, need localized training content, or want to test video messaging without the production overhead. The avatars still feel slightly uncanny if you stare too long, and custom voices cost extra. But for 90% of internal comms and educational use cases, it's faster and cheaper than the alternative.

Try Synthesia →

Some links are affiliate link. We earn a commission if you subscribe. We only feature tools we'd use ourselves.

TOOL RADAR

❝

CollectivIQ

Runs your query through ChatGPT, Gemini, Claude, Grok, and up to 10 other models simultaneously, then surfaces all the responses side by side. The pitch: consensus across models means fewer hallucinations slipping through. Interesting idea, and honestly useful for high-stakes queries where you'd otherwise tab between chatbots manually. The real question is whether "more answers" actually means "better answers." Jury's still out.

Worth it if: you need cross-model verification and hate browser tab chaos.
Skip if: you trust one model and want speed.

Try CollectivIQ →

❝

OpenAI Learning Outcomes Measurement Suite

A research framework from OpenAI designed to measure how AI tools affect student learning over time across different educational settings. This is infrastructure for researchers and institutions, not a classroom product. No pricing mentioned. Useful if you're designing AI education studies. Not useful if you want a tool students actually touch.

Worth it if: you're running formal AI education research.
Skip if: you want something teachers can use today.

Try OpenAI Learning Outcomes Measurement Suite →

Some links are affiliate link. We earn a commission if you subscribe. We only feature tools we'd use ourselves.

TECHNIQUE

PROMPT CORNER

The Persona Sandwich

Most prompts tell the model what to do. This one tells it who to be, then what to do, then who's asking. That third layer is the part people skip.

You are a senior backend engineer who has shipped production systems at scale and has strong opinions about reliability over cleverness.

Review this API design and identify failure modes:
[paste your design]

I'm a mid-level engineer preparing to present this to a skeptical infrastructure team. Be direct. Don't soften criticism.

The "who's asking" layer matters because it calibrates tone and depth simultaneously. Without it, you get generic feedback. With it, the model adjusts for your actual context: what you already know, what pressure you're under, and how blunt you need it to be.

Use this when you need expert critique, not cheerleading. Code reviews, strategy documents, pitch decks before the real pitch. Anywhere a yes-and response would waste your time.

The three layers take 30 extra seconds to write. The output quality difference is not subtle.

QUICK LINKS

RoboPocket: Improve Robot Policies Instantly with Your Phone Use your phone's camera to collect robot training data with AR feedback, doubling data efficiency without needing physical robots.

Monitoring Emergent Reward Hacking During Generation via Internal Activations Detect when LLMs game their reward signals mid-generation by monitoring internal activations, not just final outputs.

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models Text-to-image models with multiple encoders are vulnerable to backdoor attacks requiring minimal parameter tuning.

SimpliHuMoN: Simplifying Human Motion Prediction Single transformer model handles both trajectory and pose prediction, beating specialized models on established benchmarks.

DQE-CIR: Distinctive Query Embeddings in Composed Image Retrieval New approach fixes contrastive learning's tendency to suppress semantically related images in text-guided image search.

PICK OF THE WEEK

Tools gaining traction this week based on our source data.

Artificial Analysis Physics Benchmark — GPT-5.4-PRO hits 30% on research physics problems. Largest single-release jump we have seen.
POET-X — Memory-efficient LLM training via orthogonal transformations. Actually solves the overhead problem, not just theoretical.
NotebookLM Cinematic Video Overviews — Google's NotebookLM now generates video summaries of research papers. Useful for quick literature scans.

Some links are affiliate link. We earn a commission if you subscribe. We only feature tools we'd use ourselves.

How was today's issue?

⚡
Great

👍
Good

😐
Meh

Still doing things manuall that AI could handle?

Lets fix that.

Book a free discovery call →

The AI finds the signal. We decide what it means.

Research and drafting assisted by AI. All content reviewed, edited, and approved by a human editor before publication.

GPT-4.5 Pro's Breakthrough Performance

EDITOR’S NOTE

SIGNAL DROP

DEEP DIVE

The Setup

PARTNER PICK

TOOL RADAR

TECHNIQUE

PROMPT CORNER

The Persona Sandwich

QUICK LINKS

PICK OF THE WEEK

How was today's issue?

Reply

Keep Reading

triggerAll

triggerAll