Mountain wave and rotor turbulence are among the most dangerous phenomena for general aviation. We needed to know if our detection system was good enough to stake a pilot’s safety on. So we didn’t ask one AI to review it — we put four frontier AI models in a room and let them argue about it for 20 rounds.
By Mark Wolfgang, Founder & CEO, PlaneWX
PlaneWX analyzes wind flow relative to terrain ridges along a pilot’s filed route. We decompose model sounding winds into perpendicular-to-ridge components, compute Froude numbers, estimate Brunt-Väisälä frequency, and classify mountain wave and rotor severity — all in real time, for every briefing.
But mountain meteorology is hard. The literature spans decades of research by Durran, Vosper, Reinecke, Sharman, and others. The algorithms involve physical constants, threshold values, and edge cases that interact in non-obvious ways. Getting it wrong means either false alarms that erode pilot trust, or — far worse — missing a real hazard.
I needed a rigorous methodology review that would challenge every assumption, check every threshold against published science, and probe every edge case. Hiring a panel of mountain meteorology consultants would cost thousands of dollars and take weeks. I needed answers today.
I used Nestr, a platform that structures debates between multiple AI models. Instead of asking one AI “is this right?” (and getting a polite, agreeable answer), I configured a four-model adversarial debate: one proposer defending the methodology, three challengers probing for weaknesses, and a synthesizer delivering the final verdict.
Proposer
o3
Defended the methodology against FAA standards
Challenger
Grok 4.1 Fast
Probed edge cases and under-represented terrain
Challenger
Claude Sonnet 4.6
Demanded evidence for every claim
Challenger
Grok 3
Pressure-tested real-world applicability
4
AI models
20
rounds of debate
1.3M
tokens analyzed
20 min
start to finish
The debate started as a validation exercise and evolved into a deep collaborative research session. By round 5, the models weren’t just finding problems — they were proposing solutions, debating implementation details, and holding each other accountable for unsupported claims. Six concrete improvements emerged.
Not all blocked-flow conditions produce rotors. Rotor turbulence requires trapped lee waves, which only form when the atmospheric profile supports wave trapping. The debate identified that our system was missing the Scorer parameter (l²) — a critical diagnostic for whether waves trap near the surface or propagate harmlessly upward. We now compute l² from model soundings between ridge crest and ~4 km above crest. When waves aren’t trapped (l² < 0.25), rotor alerts are suppressed — preventing false alarms in common winter jet-stream scenarios.
References: Scorer (1949), Durran (1990), Reinecke & Durran (2008)
Our original thresholds (25/30/40 kt for light/moderate/severe) were calibrated for major barriers like the Rockies and Sierra Nevada. Grok 4.1 immediately challenged: “How do they hold for lower-relief terrain like the Appalachians, where PIREPs often show waves at <25 kt perpendicular flow?” The answer: they don’t. We now scale thresholds dynamically based on terrain relief — lower-relief terrain gets lower thresholds, preventing under-detection in the Appalachians, Ozarks, and coastal ranges.
Formula: base = 18 + (relief_ft / 1,000) × 4 kt, clamped to [15, 40]
Simple 2D Froude analysis can over-predict rotor severity in complex 3D terrain where multiple ridge orientations exist — sinuous valleys, overlapping ridges, convergent flows. Grok 3 pushed on Colorado Front Range scenarios where real conditions are less severe than 2D theory predicts. We now compute the circular variance of ridge orientations and inflate the Froude number by 25% when terrain is complex, except in deeply blocked flows (Fr < 0.35) where confined valleys can amplify hazards.
References: Vosper (2004), Smith (1989)
Lower-relief terrain was being systematically under-detected. When the Froude number indicates blocked flow (Fr < 0.60) and cross-ridge winds are 25+ kt, severity is now promoted by one level — even over modest 1,500 ft ridges. A pilot crossing the Blue Ridge or Ozarks in strong flow deserves the same quality of hazard detection as someone crossing the Continental Divide.
When terrain gradient is weak or ridge orientation can’t be determined with confidence, our v1 system used 100% of wind speed as a worst-case perpendicular estimate. This over-alerted constantly. The debate established that the RMS value of |sin(θ)| over all possible ridge orientations is ~0.707, so we now use 70% of total wind speed as a statistically grounded fallback — honest about uncertainty without crying wolf.
The unit tests written to verify the Scorer parameter uncovered a real bug: the perpendicular wind projection formula was computing the along-ridge component instead of the cross-ridge component (sin/cos vs. cos/sin). For a N-S ridge with westerly flow, the formula returned zero instead of full wind speed. This wasn’t caught in manual testing because typical wind/ridge combinations still produced reasonable-looking numbers. Only the physics-aware test fixture — designed to verify that westerly wind is fully perpendicular to a N-S ridge — exposed it.
Ask a single AI to “validate my mountain wave detection methodology” and you’ll get a helpful but agreeable review. It might flag a few issues, but it won’t spend 20 rounds holding your feet to the fire.
The power of the adversarial format is that the models challenge each other, not just the input. When o3 cited the perpendicular wind thresholds from AC 00-57, Grok 4.1 immediately asked about Appalachian under-detection. When o3 proposed the Scorer parameter, Claude Sonnet challenged the ceiling limit and pushed it from crest+2km to crest+4km. When improvements were proposed, Claude Sonnet demanded end-to-end integrated validation before any could be released.
By round 19, all three challengers independently converged on the same conclusion: the methodology was sound in principle, but couldn’t be released without an integrated end-to-end validation run. That consensus — from three independent reasoning systems across two providers — carries more weight than any single model’s rubber stamp.
Read the full 20-round debate: Mountain Wave Detection Case Study on Nestr — all 80 messages and 1.3 million tokens of mountain meteorology debate, unedited.
The debate produced the improvements. But the debate isn’t the validation — the testing is. Every improvement was implemented, tested, and verified before deployment.
59
Unit Tests
Covering Scorer parameter, relief-scaled thresholds, terrain complexity, Froude number, Brunt-Väisälä frequency, severity promotion, and full integration tests via the analysis pipeline.
6
Regression Routes
Rocky Mountain crossing (KSUS-KSLC), Sierra Nevada (KOAK-KRNO), Colorado Front Range (KDEN-KASE), Appalachians (KJFK-KLEX), Cascades (KBFI-KELN), and a flat negative control (KDFW-KIAH).
v1 ↔ v2
Side-by-Side Comparison
Automated comparison of v1 fixed thresholds vs. v2 relief-scaled thresholds on every regression route, tracking escalations, de-escalations, and Scorer gate suppressions.
March 16, 2026 — all 6 routes against local dev server with live weather data:
| Route | Terrain | Wave | Hazards | v1→v2 Changes |
|---|---|---|---|---|
| KSUS→KSLC | high-mountain | moderate | 6 | 3 de-escalations |
| KDEN→KASE | high-mountain | moderate | 3 | — |
| KOAK→KRNO | mountainous | none | 0 | — |
| KJFK→KLEX | hilly | none | 0 | — |
| KBFI→KELN | mountainous | none | 0 | — |
| KDFW→KIAH | flat | none | 0 | control passed |
The 3 de-escalations on KSUS→KSLC are expected: v2’s relief-scaled thresholds correctly raise the bar for high-relief Rocky Mountain terrain, where stronger winds are needed to generate the same severity level. Light-weather conditions on the March 16 test date meant the Scorer gate and complex-terrain modifier weren’t exercised in production — they are fully covered by unit tests.
The Scorer parameter gate prevents rotor alerts when the atmosphere doesn’t actually support trapped lee waves. Strong upper-level winds in common winter jet-stream patterns will no longer trigger unnecessary rotor warnings.
If you fly the Appalachians, Ozarks, or coastal ranges, the relief-scaled thresholds and sub-2,000 ft promotion mean PlaneWX will flag mountain wave conditions that v1 would have missed. Your briefing reflects the terrain you’re actually flying over.
Every algorithm, every threshold, every physical constant is documented in our help center. We publish the science behind the system because you deserve to know how your safety decisions are being informed.
Durran, D. R. (1990). Mountain waves and downslope winds. Atmospheric Processes over Complex Terrain, Meteor. Monogr. 23(45), 59–81.
FAA Advisory Circular AC 00-57, Hazardous Mountain Winds and Their Visual Indicators.
Reinecke, P. A., & Durran, D. R. (2008). Estimating topographic blocking using a Froude number when the static stability is nonuniform. J. Atmos. Sci., 65(4), 1035–1048.
Scorer, R. S. (1949). Theory of waves in the lee of mountains. Quart. J. Roy. Meteor. Soc., 75, 41–56.
Sharman, R. D., Tebaldi, C., Wiener, G., & Wolff, J. (2006). An integrated approach to mid- and upper-level turbulence forecasting. Wea. Forecasting, 21, 268–287.
Smith, R. B. (1989). Hydrostatic airflow over mountains. Advances in Geophysics, 31, 1–41.
Vosper, S. B. (2004). Inversion effects on mountain lee waves. Quart. J. Roy. Meteor. Soc., 130, 1723–1748.
Create a free briefing for any mountain route. You’ll see terrain classification, cross-barrier wind analysis, Froude number rotor detection, and Scorer parameter gate diagnostics — all explained in plain English.