Feed Right Docs

Training Pipeline

How the FeedRight DQN model is pretrained — network architecture, hyperparameters, exploration strategy, and evaluation.

The FeedRight agent is pretrained using Deep Q-Network (DQN) via the Stable-Baselines 3 library. Training runs entirely inside the simulated FishFeedingEnv environment before the model is deployed to production cages.

Algorithm: Double DQN

Stable-Baselines 3's DQN implementation includes Double DQN by default, which decouples action selection from value estimation to reduce overestimation bias — a known issue in standard Q-learning.

The core Bellman update becomes:

Q(s, a) ← r + γ · Q_target(s', argmax_a' Q_online(s', a'))

where Q_online selects the best next action and Q_target evaluates it.

Network Architecture

The policy uses a fully-connected MLP with four hidden layers:

Input (44) → 512 → 256 → 128 → 64 → Output (6 Q-values)
LayerNeuronsActivation
Input44
Hidden 1512ReLU
Hidden 2256ReLU
Hidden 3128ReLU
Hidden 464ReLU
Output6Linear (Q-values)

This produces approximately 200K+ trainable parameters, a ~4× increase from the earlier custom DQN implementation (which used [256, 128, 64] with ~52K params). The larger capacity helps the network model complex interactions between the 44 input features.

Hyperparameters

ParameterValueNotes
learning_rate1e-4Adam optimiser
buffer_size50 000Replay buffer capacity (5× the original implementation)
learning_starts1 000Steps of random exploration before training begins
batch_size64Mini-batch size for gradient updates (2× the original)
gamma0.99Discount factor — agent values future rewards highly
target_update_interval1 000Steps between target network syncs
exploration_fraction0.3Fraction of total training over which ε decays
exploration_initial_eps1.0Starting ε (fully random)
exploration_final_eps0.05Final ε (5% random actions)
total_timesteps100 000Default training length
seed42For reproducibility

Exploration Strategy: ε-Greedy

The agent uses linear ε-greedy decay:

       ε
  1.0 ┤━━━━━━╲
       │       ╲
       │        ╲
  0.05 ┤         ╲━━━━━━━━━━━━━━━━━━━━━
       └──┬──────┬──────────────────────
          0    30K                  100K
                timesteps
  • Steps 0 – 1 000 (learning_starts): Pure random actions, no gradient updates. The replay buffer fills with diverse initial experiences.
  • Steps 1 000 – 30 000 (30% of 100K): ε decays linearly from 1.0 to 0.05. The agent gradually shifts from exploration to exploitation.
  • Steps 30 000 – 100 000: ε stays at 0.05. The agent mostly exploits learned policy but retains 5% random exploration to avoid local optima.

Training Loop

The training loop is managed by FeedingDQNTrainer.train():

  1. Environment wrapping: The FishFeedingEnv is wrapped in SB3's Monitor for automatic episode logging.
  2. Callback stack:
    • EvalCallback: Every 5 000 steps, evaluates the model over 10 episodes and saves the best-performing checkpoint.
    • CheckpointCallback: Saves a full checkpoint (model + replay buffer) every 10 000 steps.
    • TensorboardCallback: Custom callback logging per-episode rewards, lengths, and feed counts.
  3. Training: model.learn(total_timesteps=100000, progress_bar=True)
  4. Final save: The model is saved to ./models/final_model.
  5. Plotting: A training progress chart (episode rewards and lengths with moving average) is saved to ./models/training_progress_sb3.png.

Device Selection

The trainer automatically detects the best available compute device:

Priority: MPS (Apple Silicon) → CUDA (NVIDIA GPU) → CPU

Evaluation

After training, FeedingDQNTrainer.evaluate() uses SB3's evaluate_policy to run n_eval_episodes (default 20) in deterministic mode and reports:

  • Mean reward ± standard deviation
  • Individual test episode replay with step-by-step action/reward trace

Episode Structure

Each training episode follows this cycle:

  1. env.reset() generates a random initial state within realistic bounds.
    • Initial time_since_last_feed is randomised between 3–8 hours.
    • Initial feeds_today is set to 0.
  2. The agent selects actions until feeds_count >= max_feeds_per_episode (default 6).
  3. Each step:
    • Agent picks action (0–5) → mapped to feed amount via action_to_kg.
    • _transition_state() updates feeding history, hunger, activity, and adds stochastic noise.
    • _calculate_reward() produces the scalar reward signal.
  4. The episode terminates; SB3 logs the cumulative reward and length.

Pretraining vs Fine-Tuning

The initial model is pretrained entirely in simulation. The simulated environment models:

  • Deterministic state transitions (hunger reduction proportional to feed, time-based hunger increase)
  • Stochastic environmental noise (±0.02 DO, ±0.01 temperature, ±0.05 motion per step)
  • Realistic initial state sampling from defined min/max ranges

After deployment, the Adaptive Learning pipeline collects real-world outcomes and retrains the model by pre-filling the replay buffer with actual experiences, bridging the sim-to-real gap.

Reproducing a Training Run

cd lib/dqn_sb3
python train.py

Output:

  • ./models/best_model/best_model.zip — best checkpoint by evaluation reward
  • ./models/final_model.zip — final checkpoint
  • ./models/checkpoints/ — periodic snapshots
  • ./tensorboard_logs/ — TensorBoard event files
  • ./models/training_progress_sb3.png — reward/length plot

To monitor live:

tensorboard --logdir=./tensorboard_logs

On this page