Kerala Assembly Election 2026 Insights : The Model

The first two posts covered the webapp itself — what it shows and how you interact with it. This post goes deeper into the model that powers the predictions. I want to explain all the design choices, what worked, what did not, and where the numbers should be taken with a grain of salt.

Problem setup

At the core, this is a multi-class classification problem. For each of Kerala's 140 assembly constituencies, the model predicts which alliance wins in 2026: LDF, UDF, NDA, or Others.

The training data is 11 elections from 1977 to 2021, giving about 1,298 rows total. That is roughly 9 data points per constituency on average. For a classification task, this is extremely small. Deep learning is out of the question. Even standard ML needs to be careful.

Why Kerala is unusually hard to model

Three things make this problem different from typical classification.

01
Near-perfect alternation. Kerala has alternated between LDF and UDF at the state level in every election since 1982 — until 2021, when LDF became the first government re-elected in 40 years. Any model that learns the alternation pattern will get 2021 wrong. Any model that ignores it will overfit to one exception.
02
Extreme class imbalance. LDF and UDF together account for 93.5% of all wins. NDA has exactly 1 win in the entire dataset (Nemom 2016, lost in 2021). Others mostly won seats in the 1977–1987 era. The model effectively cannot learn NDA or Others patterns from data alone.
03
Constituency heterogeneity. Some seats are fortress-like strongholds that never change. Others are genuine toss-ups. A single global model cannot handle both — it either over-predicts the majority class in swing seats or under-predicts it in safe seats.

These three constraints shaped every design decision in the pipeline.

Feature engineering: 48 features across 9 groups

The features fall into 9 groups. Each group captures a different signal.

Group A — Win patterns and history (8 features)

Rolling LDF/UDF win rates (last 3, last 5, all-time), consecutive previous wins, constituency volatility, and long-term front lean. These are the backbone features. They encode how a constituency has voted historically without needing any vote share data.

Group B — Vote shares and margins (10 features)

Previous margin percentage, turnout, signed margin (positive for LDF, negative for UDF), per-front vote shares, margin momentum (are margins growing or shrinking?). These capture the how much dimension that Group A misses.

Group C — State-level context (10 features)

This group was rewritten from scratch during development. It now includes:

ruling_tenure: how many consecutive terms the current ruling alliance has served
anti_incumbency: scaled signal — starts at 1.0 for one term, increases by 0.3 per additional term
alternation_streak: how many consecutive alternations have occurred before this election
swing_direction and swing_strength: composite features combining tenure and alternation

The idea is that "anti-incumbency" is not a binary yes/no — a second consecutive term is very different from an unprecedented third term.

Group D — Lok Sabha proxy (6 features)

2024 Lok Sabha results broken down by assembly constituency. Features include LDF/UDF/NDA vote shares, turnout, and the LDF–UDF margin.

Important caveat: These features are only used for the final 2026 prediction, not during cross-validation. Since the Lok Sabha election happened in 2024, using it to evaluate model accuracy on 2016 or 2011 would be data leakage. More on this in the QC section below.

Group E–F — Local body elections and Census demographics

Local body results and census data (literacy, urbanization, SC/ST percentage). These are time-invariant constituency characteristics. In practice, the data files for these groups were incomplete, so most of these features end up as NaN and get dropped.

Group G — Constituency identity (2 features)

Encoded constituency type (General/SC/ST) and a numeric constituency identifier. The constituency ID acts as a weak lookup — the model can learn that constituency #47 has a particular pattern. It is a form of memorisation, but with only 1,298 rows it provides useful signal.

Group H — BJP vote trends (3 features)

Previous BJP vote share, and how much of BJP's vote appears to come from LDF vs UDF voters. This matters for seats like Nemom where BJP's rise directly affected the LDF–UDF balance.

Group I — District and region aggregates (6 features)

District-level win rates (LDF, UDF, NDA), district margin trend, regional LDF win rate, and regional swing trend. These let the model see geographic context — how is the area around a constituency voting, not just the constituency itself.

NaN handling

Many features are NaN for early elections (no vote share data before 1996, no BJP data before 2001). Rather than filling with sentinel values like -1, the pipeline leaves NaN intact. XGBoost and scikit-learn's HistGradientBoosting handle NaN natively. RandomForest and ExtraTrees in scikit-learn 1.4+ also support NaN directly. This avoids creating spurious signal from fill values.

Model architecture

Instead of training one model and hoping it works everywhere, the pipeline uses a multi-level ensemble with per-constituency routing. There are three tiers.

Tier 1 — Constituency-level models (5 models)

These are the workhorses, trained on all 42 CV features (48 minus the 6 Lok Sabha features during CV, all 48 for final prediction).

Hybrid-XGB: XGBoost (400 trees, depth 5) + per-constituency residual correction with recency decay
Hybrid-RF: RandomForest version of the same architecture
Hybrid-ET: ExtraTrees version
Ensemble-Hybrid: Equal-weight blend of the three above
EmpiricalBayes: Dirichlet posterior updated with recency-weighted election observations, no tree model at all

The "Hybrid" architecture has two stages. First, the tree model produces global probabilities. Then, per-constituency residuals (weighted by how recently the observation happened) are added as a correction. This lets the model learn "constituency X tends to be 15% more LDF than the global model thinks."

Tier 2 — Geographic models (3 models)

These use restricted feature subsets to force the model to think at coarser granularity.

StateLevelModel: Only 9 state-context features (anti-incumbency, swing, tenure etc.). This cannot distinguish between constituencies at all — it predicts the same thing for every seat. It exists to capture state-wide wave dynamics.
DistrictModel: 9 district + key features. Groups constituencies by their 14 districts and learns district-level patterns.
RegionModel: 10 region + broad features. Groups constituencies into 4 geographic regions (North, Central, South-Central, South).

The StateLevelModel is intentionally weak (43.9% accuracy alone). It only adds value when the macro swing overwhelms local factors — think wave elections like 2006.

Tier 3 — Per-constituency router

The ConstituencyRouter learns which blend of the 8 models works best for each constituency using temporal cross-validation performance.

For each constituency, the router computes a weight vector across all 8 models:

Run 5-fold temporal CV (train on all data before year X, test on year X)
For each constituency, record which models got it right in which folds
Apply recency weighting (0.85 decay per fold — 2021 matters more than 2001)
Apply minimum weight floor (0.05 — no model is ever zeroed out)
Blend with global weights for constituencies with sparse CV data

At prediction time, each constituency gets a weighted average of all 8 model probabilities, using its learned weight vector.

Example routing outcomes:

Kuttiadi: StateLevel 32%, RegionLevel 38% — a highly volatile seat where macro and regional signals dominate
Nemom: Hybrid-XGB 22% — the only BJP win in history; the BJP-feature-aware XGBoost model gets the most weight
Kunnamangalam: Hybrid-RF/ET 27% each — a stable seat where the tree models are most reliable

Tier 4 — Probability calibration

After learning router weights, the pipeline learns a temperature scaling parameter from the CV predictions. Temperature T > 1 softens overconfident probabilities. The pipeline uses Brier-score minimisation (not NLL, which can favor sharpening) and constrains T ≥ 1.0.

In practice, the learned T = 1.000, meaning the router's blended probabilities are already well-calibrated. The individual sub-models may be overconfident, but the averaging effect of 8 models naturally prevents extreme probabilities.

Validation

Temporal cross-validation

The pipeline uses rolling temporal CV — for each test year, train only on data before that year. This mirrors the actual prediction task.

5-fold CV uses test years [2001, 2006, 2011, 2016, 2021]. 3-fold CV uses only post-delimitation years [2011, 2016, 2021] for a more realistic 2026 estimate.

A critical design decision: Lok Sabha 2024 features are excluded from CV entirely. Since those results happened in 2024, using them to predict 2011 or 2016 would be leaking future information into the model. They are only included in the final model that produces 2026 predictions.

Two accuracy metrics

I report two per-constituency accuracy numbers:

Ceiling (best model per seat): For each constituency, pick whichever of the 8 models was most accurate across all folds. This is the theoretical upper bound if routing were perfect. It is also cherry-picked and optimistic.
Honest (router-blended): Simulate what the router actually does — blend all 8 model outputs using the learned weights, take the argmax, and check if it matches the true winner. This is the real generalization estimate.

Metric	5-fold ceiling	5-fold honest
Mean per-seat accuracy	0.778	0.667
Seats at 100%	41	17
Seats ≥ 80%	84	50
Seats ≥ 60%	134	122
Seats < 40%	2	7

The honest number — 67% per-constituency accuracy — is the one I trust. It means roughly 2 out of 3 constituency predictions are correct historically.

Per-model accuracy

Model	5-fold Mean
Hybrid-XGB	72.1%
Ensemble-Hybrid	71.9%
Hybrid-ET	70.3%
EmpiricalBayes	70.2%
Hybrid-RF	69.4%
RegionLevel	69.2%
DistrictLevel	68.7%
StateLevel	43.9%

No single model dominates. XGB is best overall but loses to RF and ET on specific constituencies. The router's value comes from per-constituency specialisation, not from picking the globally best model.

Where the model is overconfident

Even after calibration, the model outputs high confidence values for many seats (mean ~70%, with some above 85%). There are structural reasons for this.

Why the scores are high:

Kerala's two-front system means the model sees clear historical patterns in most constituencies. A seat that has gone LDF in 8 of 11 elections naturally gets a high LDF probability. The model is not wrong to have high confidence there — but it is missing any information about the current election dynamics.

What the model cannot see:

No candidate quality data (strong vs weak candidate, star power, local popularity)
No campaign spending or organizational strength
No caste arithmetic or community-level dynamics
No media sentiment, social media trends, or polling data
No party internal conflicts or local issues

All of these can override structural patterns. The model is essentially saying "based on history, this is a safe LDF seat" — but a strong UDF candidate or a local scandal could easily flip it.

The seat split reality:

The current output is LDF 102, UDF 36, NDA 0, Others 2. This is almost certainly too lopsided for 2026. In reality, Kerala elections tend to be much closer in seat counts (the actual margin in most elections is 20-30 seats, not 60+). The model's inability to predict a competitive race is its biggest limitation.

The reason for this skew: the model heavily weights 2021 results (where LDF won a historic majority), and the Lok Sabha 2024 data reinforces the same constituency leanings. Without any counter-signal (like polling showing UDF momentum), the model defaults to "more of the same."

Design choices: what worked and what did not

What worked

Per-constituency routing improved predictions across the board. Different constituencies genuinely respond to different model types.
Native NaN handling instead of sentinel fill values. The old approach of filling -1 was creating phantom signal.
Recency weighting everywhere — CV fold weights, Bayesian posterior updates, residual corrections. Recent elections matter more.
Multi-level geographic models add value for volatile seats where constituency-level history is noisy.
Dual CV reporting (5-fold + 3-fold) gives a more nuanced view of expected accuracy.

What did not work as well

Probability calibration (temperature scaling) ended up at T = 1.0 — the model's blended probabilities are already calibrated at the ensemble level. Individual models are overconfident, but averaging fixes it. The calibration infrastructure is not actively helping.
StateLevel model at 43.9% accuracy is barely useful. It only helps for a handful of wave-sensitive constituencies (Kuttiadi, Tanur). The cost-benefit is debatable.
Swing adjustment (transferring probability from leader to challenger based on margin and volatility) is a heuristic patch. It works for some tight seats but can also flip predictions incorrectly. Its parameters (MAX_SWING_BOOST = 0.12) were hand-tuned.

Things I chose not to do

Nested CV for routing weights. Proper nested CV would use inner folds to learn weights and outer folds to evaluate — but with only 5 folds total, that would leave too little data per inner fold. The honest accuracy metric is a cheaper alternative.
Probability squashing. I considered artificially capping max probability at 0.75 or adding an entropy penalty, but this felt like hiding the problem rather than fixing it. The model's confidence reflects its information set — the real issue is the information set being incomplete.
Including 2026 poll or media data. This would make predictions more realistic but also more fragile and harder to validate. The model is intentionally a structural-only estimator.

The pipeline in practice

The full pipeline runs in about 70 seconds. It:

Loads 1,298 rows of election data (1977–2021) plus external data sources
Engineers 48 features across 9 groups
Runs 5-fold temporal CV with 8 models (40 model fits) to learn constituency routing weights
Runs 3-fold CV (24 model fits) for supplementary reporting
Trains the final ConstituencyRouter on all data (8 model fits)
Generates 140 constituency predictions for 2026
Writes the output CSV that the webapp reads

The output CSV has 13 columns per constituency: predicted party and alliance, confidence, per-front probabilities, and second/third place predictions with their parties and confidences.

Honest assessment

If I had to summarise the model in one sentence: it is a structured probability aid that gets about 67% of constituency-level predictions right, based purely on historical patterns.

What it is good for:

Identifying historically safe seats with high reliability
Spotting genuinely competitive constituencies where the model is uncertain
Providing a starting point for discussion ("the model says X, but on the ground...")

What it is not good for:

Predicting actual seat counts for party strategists
Forecasting wave elections or momentum shifts
Capturing anything about 2026 specifically (candidates, campaign quality, local issues)

The predictions should be read as "if history repeats, this is what happens" — with the strong caveat that history does not always repeat.

What I would do differently

If I were to rebuild this from scratch with more time:

Add polling data. Even noisy pre-election polls would provide signal about current momentum that historical data completely misses.
Model margin, not just winner. Predicting the vote share margin would give more nuanced confidence estimates than classification.
Use nested CV properly. Even with small data, a 3×3 nested CV would give less biased accuracy estimates.
Add candidate features. Criminal record data, political family connections, and candidate education levels are publicly available and predictive.
Reduce model count. 8 models is likely overkill for 1,298 rows. A simpler 3-model router (one tree model, one Bayesian, one geographic) would probably achieve 90% of the accuracy with better interpretability.

But for a hobby project, the current pipeline captures the structural signals well enough to be genuinely useful as a quick-reference tool. Just do not bet money on it.

Problem setup

At the core, this is a multi-class classification problem. For each of Kerala's 140 assembly constituencies, the model predicts which alliance wins in 2026: LDF, UDF, NDA, or Others.

Why Kerala is unusually hard to model

Three things make this problem different from typical classification.

01
Near-perfect alternation. Kerala has alternated between LDF and UDF at the state level in every election since 1982 — until 2021, when LDF became the first government re-elected in 40 years. Any model that learns the alternation pattern will get 2021 wrong. Any model that ignores it will overfit to one exception.
02
Extreme class imbalance. LDF and UDF together account for 93.5% of all wins. NDA has exactly 1 win in the entire dataset (Nemom 2016, lost in 2021). Others mostly won seats in the 1977–1987 era. The model effectively cannot learn NDA or Others patterns from data alone.
03
Constituency heterogeneity. Some seats are fortress-like strongholds that never change. Others are genuine toss-ups. A single global model cannot handle both — it either over-predicts the majority class in swing seats or under-predicts it in safe seats.

These three constraints shaped every design decision in the pipeline.

Feature engineering: 48 features across 9 groups

The features fall into 9 groups. Each group captures a different signal.

Group A — Win patterns and history (8 features)

Group B — Vote shares and margins (10 features)

Group C — State-level context (10 features)

This group was rewritten from scratch during development. It now includes:

ruling_tenure: how many consecutive terms the current ruling alliance has served
anti_incumbency: scaled signal — starts at 1.0 for one term, increases by 0.3 per additional term
alternation_streak: how many consecutive alternations have occurred before this election
swing_direction and swing_strength: composite features combining tenure and alternation

The idea is that "anti-incumbency" is not a binary yes/no — a second consecutive term is very different from an unprecedented third term.

Group D — Lok Sabha proxy (6 features)

2024 Lok Sabha results broken down by assembly constituency. Features include LDF/UDF/NDA vote shares, turnout, and the LDF–UDF margin.

Group E–F — Local body elections and Census demographics

Group G — Constituency identity (2 features)

Group H — BJP vote trends (3 features)

Previous BJP vote share, and how much of BJP's vote appears to come from LDF vs UDF voters. This matters for seats like Nemom where BJP's rise directly affected the LDF–UDF balance.

Group I — District and region aggregates (6 features)

NaN handling

Model architecture

Instead of training one model and hoping it works everywhere, the pipeline uses a multi-level ensemble with per-constituency routing. There are three tiers.

Tier 1 — Constituency-level models (5 models)

These are the workhorses, trained on all 42 CV features (48 minus the 6 Lok Sabha features during CV, all 48 for final prediction).

Hybrid-XGB: XGBoost (400 trees, depth 5) + per-constituency residual correction with recency decay
Hybrid-RF: RandomForest version of the same architecture
Hybrid-ET: ExtraTrees version
Ensemble-Hybrid: Equal-weight blend of the three above
EmpiricalBayes: Dirichlet posterior updated with recency-weighted election observations, no tree model at all

Tier 2 — Geographic models (3 models)

These use restricted feature subsets to force the model to think at coarser granularity.

StateLevelModel: Only 9 state-context features (anti-incumbency, swing, tenure etc.). This cannot distinguish between constituencies at all — it predicts the same thing for every seat. It exists to capture state-wide wave dynamics.
DistrictModel: 9 district + key features. Groups constituencies by their 14 districts and learns district-level patterns.
RegionModel: 10 region + broad features. Groups constituencies into 4 geographic regions (North, Central, South-Central, South).

The StateLevelModel is intentionally weak (43.9% accuracy alone). It only adds value when the macro swing overwhelms local factors — think wave elections like 2006.

Tier 3 — Per-constituency router

The ConstituencyRouter learns which blend of the 8 models works best for each constituency using temporal cross-validation performance.

For each constituency, the router computes a weight vector across all 8 models:

Run 5-fold temporal CV (train on all data before year X, test on year X)
For each constituency, record which models got it right in which folds
Apply recency weighting (0.85 decay per fold — 2021 matters more than 2001)
Apply minimum weight floor (0.05 — no model is ever zeroed out)
Blend with global weights for constituencies with sparse CV data

At prediction time, each constituency gets a weighted average of all 8 model probabilities, using its learned weight vector.

Example routing outcomes:

Kuttiadi: StateLevel 32%, RegionLevel 38% — a highly volatile seat where macro and regional signals dominate
Nemom: Hybrid-XGB 22% — the only BJP win in history; the BJP-feature-aware XGBoost model gets the most weight
Kunnamangalam: Hybrid-RF/ET 27% each — a stable seat where the tree models are most reliable

Tier 4 — Probability calibration

Validation

Temporal cross-validation

The pipeline uses rolling temporal CV — for each test year, train only on data before that year. This mirrors the actual prediction task.

5-fold CV uses test years [2001, 2006, 2011, 2016, 2021]. 3-fold CV uses only post-delimitation years [2011, 2016, 2021] for a more realistic 2026 estimate.

Two accuracy metrics

I report two per-constituency accuracy numbers:

Ceiling (best model per seat): For each constituency, pick whichever of the 8 models was most accurate across all folds. This is the theoretical upper bound if routing were perfect. It is also cherry-picked and optimistic.
Honest (router-blended): Simulate what the router actually does — blend all 8 model outputs using the learned weights, take the argmax, and check if it matches the true winner. This is the real generalization estimate.

Metric	5-fold ceiling	5-fold honest
Mean per-seat accuracy	0.778	0.667
Seats at 100%	41	17
Seats ≥ 80%	84	50
Seats ≥ 60%	134	122
Seats < 40%	2	7

The honest number — 67% per-constituency accuracy — is the one I trust. It means roughly 2 out of 3 constituency predictions are correct historically.

Per-model accuracy

Model	5-fold Mean
Hybrid-XGB	72.1%
Ensemble-Hybrid	71.9%
Hybrid-ET	70.3%
EmpiricalBayes	70.2%
Hybrid-RF	69.4%
RegionLevel	69.2%
DistrictLevel	68.7%
StateLevel	43.9%

Where the model is overconfident

Even after calibration, the model outputs high confidence values for many seats (mean ~70%, with some above 85%). There are structural reasons for this.

Why the scores are high:

What the model cannot see:

No candidate quality data (strong vs weak candidate, star power, local popularity)
No campaign spending or organizational strength
No caste arithmetic or community-level dynamics
No media sentiment, social media trends, or polling data
No party internal conflicts or local issues

All of these can override structural patterns. The model is essentially saying "based on history, this is a safe LDF seat" — but a strong UDF candidate or a local scandal could easily flip it.

The seat split reality:

Design choices: what worked and what did not

What worked

Per-constituency routing improved predictions across the board. Different constituencies genuinely respond to different model types.
Native NaN handling instead of sentinel fill values. The old approach of filling -1 was creating phantom signal.
Recency weighting everywhere — CV fold weights, Bayesian posterior updates, residual corrections. Recent elections matter more.
Multi-level geographic models add value for volatile seats where constituency-level history is noisy.
Dual CV reporting (5-fold + 3-fold) gives a more nuanced view of expected accuracy.

What did not work as well

Probability calibration (temperature scaling) ended up at T = 1.0 — the model's blended probabilities are already calibrated at the ensemble level. Individual models are overconfident, but averaging fixes it. The calibration infrastructure is not actively helping.
StateLevel model at 43.9% accuracy is barely useful. It only helps for a handful of wave-sensitive constituencies (Kuttiadi, Tanur). The cost-benefit is debatable.
Swing adjustment (transferring probability from leader to challenger based on margin and volatility) is a heuristic patch. It works for some tight seats but can also flip predictions incorrectly. Its parameters (MAX_SWING_BOOST = 0.12) were hand-tuned.

Things I chose not to do

Nested CV for routing weights. Proper nested CV would use inner folds to learn weights and outer folds to evaluate — but with only 5 folds total, that would leave too little data per inner fold. The honest accuracy metric is a cheaper alternative.
Probability squashing. I considered artificially capping max probability at 0.75 or adding an entropy penalty, but this felt like hiding the problem rather than fixing it. The model's confidence reflects its information set — the real issue is the information set being incomplete.
Including 2026 poll or media data. This would make predictions more realistic but also more fragile and harder to validate. The model is intentionally a structural-only estimator.

The pipeline in practice

The full pipeline runs in about 70 seconds. It:

Loads 1,298 rows of election data (1977–2021) plus external data sources
Engineers 48 features across 9 groups
Runs 5-fold temporal CV with 8 models (40 model fits) to learn constituency routing weights
Runs 3-fold CV (24 model fits) for supplementary reporting
Trains the final ConstituencyRouter on all data (8 model fits)
Generates 140 constituency predictions for 2026
Writes the output CSV that the webapp reads

The output CSV has 13 columns per constituency: predicted party and alliance, confidence, per-front probabilities, and second/third place predictions with their parties and confidences.

Honest assessment

If I had to summarise the model in one sentence: it is a structured probability aid that gets about 67% of constituency-level predictions right, based purely on historical patterns.

What it is good for:

Identifying historically safe seats with high reliability
Spotting genuinely competitive constituencies where the model is uncertain
Providing a starting point for discussion ("the model says X, but on the ground...")

What it is not good for:

Predicting actual seat counts for party strategists
Forecasting wave elections or momentum shifts
Capturing anything about 2026 specifically (candidates, campaign quality, local issues)

The predictions should be read as "if history repeats, this is what happens" — with the strong caveat that history does not always repeat.

What I would do differently

If I were to rebuild this from scratch with more time:

Add polling data. Even noisy pre-election polls would provide signal about current momentum that historical data completely misses.
Model margin, not just winner. Predicting the vote share margin would give more nuanced confidence estimates than classification.
Use nested CV properly. Even with small data, a 3×3 nested CV would give less biased accuracy estimates.
Add candidate features. Criminal record data, political family connections, and candidate education levels are publicly available and predictive.
Reduce model count. 8 models is likely overkill for 1,298 rows. A simpler 3-model router (one tree model, one Bayesian, one geographic) would probably achieve 90% of the accuracy with better interpretability.

But for a hobby project, the current pipeline captures the structural signals well enough to be genuinely useful as a quick-reference tool. Just do not bet money on it.