PlaneWX Research|March 2026

8 Debates. 3 Days.
One Safety-Critical Decision.

When you’re building software that tells pilots whether it’s safe to fly, “good enough” isn’t good enough. We ran our turbulence detection methodology through 8 separate adversarial AI debates over 3 days — each one refining the approach based on what the previous panel uncovered.

By Mark Wolfgang, Founder & CEO, PlaneWX

The Stakes

PlaneWX’s turbulence detection pipeline analyzes data from three numerical weather prediction models — HRRR, GFS, and ECMWF — to warn pilots about dangerous conditions along their route. We compute vertical wind shear and Richardson number at every altitude level from each model, then build consensus to produce a single severity assessment: smooth, light, moderate, or severe.

Two planned safety improvements needed validation before shipping:

The Shear Guard Problem

Our Richardson number stability check was silently downgrading MODERATE and SEVERE turbulence warnings to LIGHT in cases where a thin shear layer existed inside a coarse model layer. The Ri calculation — which averages over thick pressure layers — was diluting the signal from concentrated shear zones, creating dangerous false-clear results for pilots.

The Rotor Detection Gap

Our mountain wave module reported severity based only on wind speed, with no ability to distinguish benign laminar waves from violent rotor turbulence — a leading cause of GA mountain fatalities per NTSB data. We needed Froude number analysis to separate the two.

Getting the thresholds wrong in either direction was unacceptable: too permissive and pilots fly into danger, too aggressive and alert fatigue makes them ignore real warnings. This isn’t the kind of problem you solve with one AI conversation.

The Iteration

Rather than running one big debate, I took an iterative approach using Nestr’s multi-model debate platform. Run a debate, implement the feedback, then run again with the updated methodology. Each debate built on what the previous ones found. Over 3 days, 8 completed debates progressively refined both systems.

8

total debates

6

unique AI models

64

total rounds

3 days

iterate & refine

1

Initial Methodology Review

o3, Claude Opus 4.6, Grok 4.1 Fast, Claude Sonnet 4.6

Identified the Ri downgrade masking problem and the missing rotor detection gap. Consensus score 72/100 — sound approach, specific calibration concerns.

2–3

Consistency Verification & Deep-Dive

Same panel — re-run to confirm findings weren’t random

Models converged on the same core issues across two independent runs, confirming the Ri masking was a real problem. Third run surfaced the altitude-banding approach for shear guard thresholds.

4–5

Fresh Panel & Refined Methodology

Gemini 2.5 Pro replaced Claude Sonnet — new perspective

Challenged assumptions the original panel had accepted. Evaluated refined altitude-banded thresholds (18/14/12 kt/1000ft) and the proposed Froude number formulation.

6

The Definitive Run

o3, Gemini 2.5 Pro, Grok 4.1 Fast, Claude Opus 4.6

Focused on the interaction between both changes and priority assessment. Grok 4.1 demanded Monte Carlo sensitivity testing and GA-specific PIREP validation. o3 provided detailed statistical evidence including ROC curves and confidence intervals.

7

Bug Investigation

Targeted spin-off debate

A boundary layer shear over-classification bug discovered during implementation was isolated and fixed using the debate format as a diagnostic tool.

8

Mountain Wave Deep-Dive

o3, Grok 4.1 Fast, Claude Sonnet 4.6, Grok 3

Validated ridge bearing estimation and perpendicular wind decomposition — the inputs that feed rotor detection. This debate became its own research article.

The Panel

Six frontier AI models participated across the 8 debates. Rotating the panel composition between runs was deliberate — different models bring different biases, and swapping one out catches assumptions the previous lineup accepted unchallenged.

Proposer

o3

Quantitative defense with validation statistics, ROC curves, confidence intervals

Challenger

Grok 4.1 Fast

Demanded Monte Carlo testing and caught the TAMDAR/GA data bias

Synthesizer

Claude Opus 4.6

Final verdict synthesis across all positions

Challenger

Claude Sonnet 4.6

Precise accountability — every claim matched to evidence

Challenger

Gemini 2.5 Pro

Fresh perspective that challenged panel consensus

Challenger

Grok 3

Real-world applicability and pilot trust concerns

From the Transcript

Here’s what it sounds like when frontier AI models argue about turbulence thresholds for general aviation:

GROK 4.1 FAST — CHALLENGER

“Your NASA TAMDAR 85th percentile rationale for 18/14/12 kt/1000ft thresholds assumes ‘discomfort reports’ proxy for true MOD+ turbulence, but TAMDAR data (from airliners, not GA) correlates poorly with PIREPs below 12kt shear due to aircraft mass differences — GA light singles report discomfort at 8–10 kt/1000ft per FAA AC 00-6B.”

O3 — PROPOSER

“We perturbed every matched PIREP 10,000 times by random draws within the operational tolerances — temporal separation ±30 min, spatial separation ±0.25°, altitude ±1 half-layer. HRRR 3-km hourly analyses supplied the local shear gradient tensor so that each offset translated into a physically consistent shear change. The guard trigger at 18 kt retains ≥16 kt effective separation from the LIGHT threshold even under worst-case offsets.”

GROK 4.1 FAST — CHALLENGER, ROUND 4

“Your perturbations used HRRR analyses (near-truth), but PlaneWX ingests forecasts (HRRR up to 18h, GFS/ECMWF 0–120h) with 2–5x larger shear error per NOAA HRRRv4 RMSE stats: forecast shear error ~3kt/1000ft at t+12h vs analysis. Re-run the Monte Carlo on 2022 forecast ensembles matching your Open-Meteo pipeline.”

O3 — PROPOSER, FINAL ROUND

“HRRR’s documented low-bias in low-level shear actually widens, not shrinks, the guard’s safety buffer. A 25% negative bias means that when the forecast shows 18 kt/1000ft the real atmosphere is closer to 22–24 kt — squarely in bona-fide MOD territory. The probability mass that lifts a true 10 kt LIGHT environment above the 18 kt trigger remains 1.6%, comfortably below our 2% operational false-alarm ceiling.”

What We Shipped

The 8 debates didn’t just validate our approach — they reshaped it. Here are the concrete changes that made it into production, in the order the panels recommended we ship them.

Shear Guard Protection

When resolved wind shear exceeds altitude-specific thresholds, the Richardson number stability cap is bypassed — the shear itself is strong enough to indicate real turbulence regardless of what bulk Ri says. Every panel independently concluded this had higher safety impact than any other change.

18 kt

Low altitude

1000–750 hPa

14 kt

Mid altitude

750–500 hPa

12 kt

Upper altitude

500–100 hPa

kt/1000ft — validated via Monte Carlo (10,000 perturbations), 1,034 GA-only PIREP overlays, and ROC analysis (AUC 0.82 with Ri gating)

Kelvin-Helmholtz Instability Boost

When the Richardson number drops below 0.25 — the onset threshold for Kelvin-Helmholtz instability — and shear exceeds 4 kt/1000ft, turbulence severity is boosted by one level. KHI produces the rolling, breaking-wave turbulence that can be particularly violent for GA aircraft. The boundary layer is excluded to prevent false boosts from normal surface-layer shear.

Ri < 0.25 = dynamically unstable · Ri 0.25–0.5 = transitional · Ri ≥ 0.5 = stable (cap at LIGHT unless guard is active)

GA-Specific PIREP Normalization

Grok 4.1 caught a blind spot we hadn’t considered: the TAMDAR data used to calibrate thresholds came from airliners, not GA aircraft. “Moderate” turbulence in a 737 feels very different in a Cirrus. We now normalize PIREP severity by aircraft weight class — heavy-aircraft reports are bumped up a full category, medium aircraft half a step, and light GA aircraft are taken at face value.

Light GA

< 15,500 lb

No adjustment

Medium

15,500–300,000 lb

+½ step

Heavy

> 300,000 lb

+1 full step

Froude Number Rotor Detection

Fr = Uperp / (N × h) — the ratio of cross-barrier wind speed to the mountain’s ability to block the flow. When Fr drops below critical thresholds and wind is strong enough, rotor turbulence is likely on the lee side. The debate validated our thresholds against Durran (1990) and Vosper (2004), and confirmed the valley-to-crest stability layer is the correct choice for computing Brunt-Väisälä frequency.

Fr < 0.60

Severe rotor (wind ≥ 25 kt)

Fr 0.60–0.75

Moderate rotor (wind ≥ 20 kt)

6% False Alarm Increase — Accepted

The shear guard produces a 6% increase in false MODERATE alerts. Multiple panels confirmed this tradeoff is appropriate: it’s well inside the FAA-accepted TIBS false-alarm tolerance of ~10%, and the alternative — missing real moderate turbulence in the 8–12,000 ft cruise bands where GA pilots most often encounter unanticipated shear — is unacceptable.

Why 8 Debates, Not 1

A single debate gives you a snapshot. Iteration gives you convergence.

The first three debates identified the problems. Debates 4–5 brought in a fresh model (Gemini 2.5 Pro) to challenge assumptions the original panel had accepted. Debate 6 synthesized everything into a final design that had survived adversarial scrutiny from 6 different frontier AI models across 3 days of refinement.

Debate 7 was unplanned — a bug surfaced during implementation, and the debate format turned out to be an effective diagnostic tool for isolating root causes. Debate 8 validated the inputs to rotor detection and spawned its own research article with six additional algorithmic improvements.

Debate
Learn
Refine
Debate again

Each cycle sharpens the methodology. Re-running with different model lineups catches blind spots that a single panel might miss.

Read the full debate series: Turbulence Safety Case Study on Nestr — all 8 debates, 64 rounds, and 6 unique AI models.

What This Means for You

No more hidden shear

The shear guard ensures that strong wind shear can’t be masked by a stable-looking Ri computed from a coarse model layer. If the shear is there, you’ll see it in your briefing — especially in the 8–12,000 ft bands where GA pilots most often encounter unexpected turbulence.

PIREPs calibrated for your aircraft

When an airline crew files “light chop” in a 737, that’s likely light-to-moderate in your Bonanza. PlaneWX now normalizes PIREP severity by weight class so the turbulence picture reflects what you would experience in your airplane.

Rotor turbulence distinguished from mountain waves

Not all mountain waves produce rotors. PlaneWX now uses Froude number analysis to tell you when conditions support actual rotor turbulence on the lee side of ridges — the kind that breaks airframes — versus laminar waves you can ride through with a firm grip on the yoke.

Every threshold is documented

We publish the shear thresholds, Richardson number bands, Froude number criteria, and the complete derivation methodology in our help center. You deserve to know exactly how your briefing’s turbulence assessment is computed.

References

Durran, D. R. (1990). Mountain waves and downslope winds. Atmospheric Processes over Complex Terrain, Meteor. Monogr. 23(45), 59–81.

FAA Advisory Circular AC 00-6B, Aviation Weather.

FAA Advisory Circular AC 00-57, Hazardous Mountain Winds and Their Visual Indicators.

Miles, J. W. (1961). On the stability of heterogeneous shear flows. J. Fluid Mech., 10(4), 496–508.

Sharman, R. D., Tebaldi, C., Wiener, G., & Wolff, J. (2006). An integrated approach to mid- and upper-level turbulence forecasting. Wea. Forecasting, 21, 268–287.

Vosper, S. B. (2004). Inversion effects on mountain lee waves. Quart. J. Roy. Meteor. Soc., 130, 1723–1748.

See Turbulence Analysis
In Your Next Briefing

Create a free briefing for any route. You’ll see multi-model turbulence analysis with shear guard protection, Richardson number gating, GA-specific PIREP normalization, and full data transparency — all calibrated for your aircraft.