oracle-ai-developer-hub

Soccer Analytics Agent Workshop

This directory contains the runnable 90-minute workshop materials for the World Cup 2026 soccer analytics agent.

Final ML pipeline diagram

World Cup 2026 ML feature pipeline diagram

Editable draw.io source: ../docs/assets/ml_feature_pipeline.drawio

Detailed feature notes: ../docs/ML_FEATURE_PIPELINE.md

What the model uses

The prediction engine is not a toy prompt wrapper. The final predict_match path builds the same 92-feature row that trained the model:

Feature family Count Why it matters
Original baseline: Elo, form/goals, H2H, match context 40 Long-term strength, recent trajectory, pair history, venue/rest/tournament setup.
Goalscorer intelligence 12 Scoring depth, star dependency, penalties, late goals, first-half scoring.
Momentum / psychology 16 Streaks, unbeaten runs, clean sheets, comebacks, draw tendency, blowouts.
Poisson expected goals 8 Independent goal-rate model with lambdas, win/draw probabilities, variance, overperformance.
Venue / geography 5 Altitude, confederation strength, intercontinental matchup effects.
Tournament context 11 World Cup form, competitive form, big-game factor, World Cup experience.
Total 92 Full XGBoost input row for live and cached predictions.

Elo scoring note

The Elo system is replayed chronologically from the match dataset. Every team starts at 1500; non-neutral home teams receive a +100 expected-score bonus; and match importance is weighted by K-factor:

Match category K-factor
FIFA World Cup 60
Continental championship 50
Qualifier / Nations League / default competitive 40
Friendly 20

Expected score:

E_home = 1 / (1 + 10^(-(R_home - R_away + HA) / 400))

Rating update:

R_new = R_old + K * G * (S_actual - S_expected)

G is the goal-difference multiplier: 1.0 for one-goal-or-draw outcomes, 1.5 for two goals, 1.75 for three goals, and 1.75 + (GD - 3) / 8 for four or more.

Retrieval and observability proof points

After predictions are created, scripts/load_langchain_vectors.py converts cached prediction rows and football facts into LangChain OracleVS documents in SOCCER_LANGCHAIN_DOCS using langchain-oracledb. The final Grok 4 chat should use hybrid_retrieve as the primary evidence path, then contrast it with semantic-only vector_search when requested.

Every /chat turn also writes ordered execution steps into Oracle through langgraph-oracledb OracleStore. Inspect them after a demo turn with:

curl http://localhost:8000/observability/<session_id> | uv run python -m json.tool

Expected events include turn_start, grounding_retrieved, model_response, tool_call, tool_result, and final_response.

Workshop demo prompt:

Use hybrid retrieval to explain the evidence for Spain vs Brazil, and contrast it with semantic-only memory.