Kalshi Quantitative Trading
ML Ensemble System for Weather Prediction Markets
Problem
Weather prediction markets on Kalshi are systematically mispriced — the distribution of bucket pricing is not confident early in the market. That gap creates a persistent edge for a model that is directionally correct, paired with a trading strategy we have not disclosed. The specific market we target: whether NYC Central Park's daily high temperature falls within a given 1°F bucket.
Market Research
Before building anything, we studied the structure of Kalshi's weather markets: how prices move throughout the day, how they relate to publicly available forecasts, and where pricing tends to be weakest. We found that early in the trading window, markets frequently misprice the probability distribution across temperature buckets — not because the weather is unpredictable, but because market prices anchor to broad public consensus rather than refined meteorological data.
We also identified the exact metric that matters: Kalshi contracts settle on the daily high, which is largely determined by a single peak window in the afternoon. Any model that averages error across the full day is optimizing the wrong thing.
System
We built a three-model ensemble — a Temporal Fusion Transformer, an XGBoost model, and an NBeats model — each independently trained and tested on select historical weather data alongside 17 real-time meteorological variables. Each model produces a probability distribution over possible temperatures rather than a single point estimate.
The three outputs are combined with dynamic weightings and a variety of guards to ensure intelligent trading. A live edge signal compares the ensemble's distribution against Kalshi's posted market price, and circuit breakers prevent trading when model confidence is low.
As integration lead, I designed the ensemble architecture and the standardized schema that lets the three models communicate cleanly. I also built the TFT component.
Development
The project evolved through three phases.
Phase 1 was foundational: market research, data pipeline construction, model selection, and establishing the backtesting infrastructure. We settled on NYC Central Park as our target market and defined the problem formally.
Phase 2 was our first full build. We trained hourly models — predicting temperature at each hour of the day — and optimized for RMSE across all 24 hours. This produced only 10.9% bucket accuracy in validation. The problem: averaging error over all 24 hours masked poor performance at the afternoon peak, the only window that determines the daily high and the only one Kalshi actually prices.
Phase 3 reframed the problem entirely. Rather than predicting hourly temperatures and deriving a daily high, we trained directly on the daily high as the target. This is the current system. During this phase I also discovered and fixed a lookahead bias in the TFT pipeline that would have silently inflated our backtest results.
Training & Backtesting
Each model was trained and backtested independently before being integrated into the ensemble. We used held-out validation sets to evaluate generalization, and the backtesting framework was designed to mirror live trading conditions as closely as possible — using only data that would have been available at prediction time.
The lookahead fix in Phase 3 was critical: an earlier version of the TFT inadvertently trained on future data, producing results that looked better than they were. Catching and correcting this before paper trading saved us from acting on a false signal. Paper trading is the final validation layer before committing real capital.
Outcome
The TFT component reached ~2.8°F validation RMSE. The full ensemble reached ~1.8°F RMSE. We estimate a 30% annual return in production. The system is currently in paper trading to verify the live edge before committing real capital.