Kalshi Weather Derivatives

Overview

Prediction markets do not reward accurate forecasts in isolation — they reward identifying when market-implied probabilities are miscalibrated.

This project builds a quantitative trading system for Kalshi's daily high temperature markets (NYC Central Park, 1°F buckets). Instead of predicting a single temperature, the system estimates a full probability distribution over outcomes, compares it to market-implied probabilities, and generates trading decisions based on expected value and risk constraints.

The forecasting models are only one layer of the system. The core challenge is converting uncertain predictions into disciplined, automated trading decisions under real market conditions.

Research

Before building models, we analyzed how Kalshi weather markets behave across the trading window.

We found that early market prices tend to anchor to broad public forecast distributions while underweighting higher-resolution meteorological data that becomes available closer to settlement. This creates periods where the market's implied distribution is systematically less precise than what the data supports.

We also identified a structural mismatch in common forecasting approaches: Kalshi contracts settle on a single daily high temperature, yet many weather models optimize error across all 24 hourly temperatures. This means traditional loss functions often optimize the wrong target relative to the payoff structure.

Reframing the problem around the settlement event itself became one of the largest performance improvements in the system.

Edge Generation

The system decomposes edge into two independent components.

Point Forecast Edge — the ensemble improves estimation of the expected daily high by directly targeting the settlement outcome rather than full-day averages.

Distributional Edge — each model produces a calibrated probability distribution over outcomes (p10 / p50 / p90). These are compared against Kalshi's implied probabilities to identify mispricing in the shape and variance of the distribution, not just the mean.

A key insight is that profitability does not require a superior point forecast; a correctly calibrated distribution alone can generate edge if the market misprices uncertainty.

System Architecture

The system combines three independent forecasting models:

Temporal Fusion Transformer (TFT) — sequence modeling and temporal attention
XGBoost — nonlinear residual correction and feature interaction learning
N-BEATS — deep time-series basis expansion

Each model outputs a full probabilistic forecast distribution rather than a point estimate.

An ensemble layer combines these outputs using dynamically updated weights derived from rolling forecast error. Models that degrade in recent performance are automatically down-weighted without manual intervention.

The resulting aggregated distribution feeds directly into a trading decision engine that evaluates expected value across Kalshi contracts.

Risk Management & Decision Layer

Forecasting and capital allocation are treated as separate problems. The trading system includes multiple independent risk controls:

Divergence filters — models that deviate significantly from both ensemble consensus and external baselines are temporarily excluded
Kelly-based sizing — position sizing derived from expected edge with strict capital caps
Market-implied volatility gate — trading is reduced when market prices already imply a sufficiently tight distribution
Cost and liquidity constraints — prevent execution in unfavorable market structures
Fallback sizing — reduced exposure when contract structure limits optimal positioning

Production thresholds and calibration constants are intentionally not disclosed.

Validation & Backtesting

A central engineering requirement was ensuring strict point-in-time correctness. All backtests are walk-forward: models only access data available before the trading cutoff for each simulated day.

During development, I identified a feature leakage issue in the Temporal Fusion Transformer pipeline where forecast features inadvertently included post-cutoff information. After correction, we observed a significant divergence between inflated and corrected performance metrics, confirming the importance of strict temporal isolation.

The system also uses:

Temporally disjoint validation sets (no random splits)
Regime-aware error analysis across weather conditions
Continuous post-hoc performance monitoring
Live paper trading before capital deployment

Production Infrastructure (AWS)

The system is fully automated and runs as a production pipeline on AWS. It includes:

Automated ingestion of weather observations and forecast data
Scheduled model training and retraining workflows
Ensemble prediction generation
Trading signal evaluation
Execution layer for paper and live trading

Infrastructure is built using AWS services for orchestration, storage, secrets management, and compute, enabling the system to run continuously without manual intervention. This separation between research code and production infrastructure allows rapid iteration while maintaining system stability in live environments.

Results

A key finding was that optimizing for hourly temperature prediction significantly underperformed relative to optimizing directly for the settlement target. Reframing the problem to predict the daily high directly produced the largest single improvement in system performance.

The system has successfully passed backtesting and paper trading validation and is currently deployed with real capital. Live performance is being withheld until a statistically meaningful sample size is accumulated.

3

Independent Forecasting Models

Live

Real Capital Deployed

Walk-Forward

Point-in-Time Validation

My Contributions

As Integration Lead, I focused on turning independently developed models into a unified production system. Key contributions:

Designed the ensemble architecture and shared probabilistic forecast schema
Built the Temporal Fusion Transformer forecasting pipeline
Developed interfaces enabling three independent models to operate as a single system
Implemented rolling-error-based dynamic ensemble weighting
Discovered and fixed a critical feature leakage issue in the TFT pipeline
Helped transition the system from research prototype to automated production deployment

Technologies

Machine Learning

Temporal Fusion Transformer
XGBoost
N-BEATS
Probabilistic forecasting
Ensemble learning
Calibration and uncertainty estimation

Infrastructure

AWS EC2
AWS S3
AWS Lambda
AWS Secrets Manager
AWS DynamoDB
Docker
Automated scheduling / cron-based orchestration

Methods

Time-series forecasting
Walk-forward validation
Probability calibration
Quantitative risk management
Feature engineering for spatiotemporal data