Training Pipeline
How the FeedRight DQN model is pretrained — network architecture, hyperparameters, exploration strategy, and evaluation.
The FeedRight agent is pretrained using Deep Q-Network (DQN) via the Stable-Baselines 3 library. Training runs entirely inside the simulated FishFeedingEnv environment before the model is deployed to production cages.
Algorithm: Double DQN
Stable-Baselines 3's DQN implementation includes Double DQN by default, which decouples action selection from value estimation to reduce overestimation bias — a known issue in standard Q-learning.
The core Bellman update becomes:
Q(s, a) ← r + γ · Q_target(s', argmax_a' Q_online(s', a'))where Q_online selects the best next action and Q_target evaluates it.
Network Architecture
The policy uses a fully-connected MLP with four hidden layers:
Input (44) → 512 → 256 → 128 → 64 → Output (6 Q-values)| Layer | Neurons | Activation |
|---|---|---|
| Input | 44 | — |
| Hidden 1 | 512 | ReLU |
| Hidden 2 | 256 | ReLU |
| Hidden 3 | 128 | ReLU |
| Hidden 4 | 64 | ReLU |
| Output | 6 | Linear (Q-values) |
This produces approximately 200K+ trainable parameters, a ~4× increase from the earlier custom DQN implementation (which used [256, 128, 64] with ~52K params). The larger capacity helps the network model complex interactions between the 44 input features.
Hyperparameters
| Parameter | Value | Notes |
|---|---|---|
learning_rate | 1e-4 | Adam optimiser |
buffer_size | 50 000 | Replay buffer capacity (5× the original implementation) |
learning_starts | 1 000 | Steps of random exploration before training begins |
batch_size | 64 | Mini-batch size for gradient updates (2× the original) |
gamma | 0.99 | Discount factor — agent values future rewards highly |
target_update_interval | 1 000 | Steps between target network syncs |
exploration_fraction | 0.3 | Fraction of total training over which ε decays |
exploration_initial_eps | 1.0 | Starting ε (fully random) |
exploration_final_eps | 0.05 | Final ε (5% random actions) |
total_timesteps | 100 000 | Default training length |
seed | 42 | For reproducibility |
Exploration Strategy: ε-Greedy
The agent uses linear ε-greedy decay:
ε
1.0 ┤━━━━━━╲
│ ╲
│ ╲
0.05 ┤ ╲━━━━━━━━━━━━━━━━━━━━━
└──┬──────┬──────────────────────
0 30K 100K
timesteps- Steps 0 – 1 000 (
learning_starts): Pure random actions, no gradient updates. The replay buffer fills with diverse initial experiences. - Steps 1 000 – 30 000 (30% of 100K): ε decays linearly from 1.0 to 0.05. The agent gradually shifts from exploration to exploitation.
- Steps 30 000 – 100 000: ε stays at 0.05. The agent mostly exploits learned policy but retains 5% random exploration to avoid local optima.
Training Loop
The training loop is managed by FeedingDQNTrainer.train():
- Environment wrapping: The
FishFeedingEnvis wrapped in SB3'sMonitorfor automatic episode logging. - Callback stack:
EvalCallback: Every 5 000 steps, evaluates the model over 10 episodes and saves the best-performing checkpoint.CheckpointCallback: Saves a full checkpoint (model + replay buffer) every 10 000 steps.TensorboardCallback: Custom callback logging per-episode rewards, lengths, and feed counts.
- Training:
model.learn(total_timesteps=100000, progress_bar=True) - Final save: The model is saved to
./models/final_model. - Plotting: A training progress chart (episode rewards and lengths with moving average) is saved to
./models/training_progress_sb3.png.
Device Selection
The trainer automatically detects the best available compute device:
Priority: MPS (Apple Silicon) → CUDA (NVIDIA GPU) → CPUEvaluation
After training, FeedingDQNTrainer.evaluate() uses SB3's evaluate_policy to run n_eval_episodes (default 20) in deterministic mode and reports:
- Mean reward ± standard deviation
- Individual test episode replay with step-by-step action/reward trace
Episode Structure
Each training episode follows this cycle:
env.reset()generates a random initial state within realistic bounds.- Initial
time_since_last_feedis randomised between 3–8 hours. - Initial
feeds_todayis set to 0.
- Initial
- The agent selects actions until
feeds_count >= max_feeds_per_episode(default 6). - Each step:
- Agent picks action (0–5) → mapped to feed amount via
action_to_kg. _transition_state()updates feeding history, hunger, activity, and adds stochastic noise._calculate_reward()produces the scalar reward signal.
- Agent picks action (0–5) → mapped to feed amount via
- The episode terminates; SB3 logs the cumulative reward and length.
Pretraining vs Fine-Tuning
The initial model is pretrained entirely in simulation. The simulated environment models:
- Deterministic state transitions (hunger reduction proportional to feed, time-based hunger increase)
- Stochastic environmental noise (±0.02 DO, ±0.01 temperature, ±0.05 motion per step)
- Realistic initial state sampling from defined min/max ranges
After deployment, the Adaptive Learning pipeline collects real-world outcomes and retrains the model by pre-filling the replay buffer with actual experiences, bridging the sim-to-real gap.
Reproducing a Training Run
cd lib/dqn_sb3
python train.pyOutput:
./models/best_model/best_model.zip— best checkpoint by evaluation reward./models/final_model.zip— final checkpoint./models/checkpoints/— periodic snapshots./tensorboard_logs/— TensorBoard event files./models/training_progress_sb3.png— reward/length plot
To monitor live:
tensorboard --logdir=./tensorboard_logs