Training Pipeline

How the FeedRight DQN model is pretrained — network architecture, hyperparameters, exploration strategy, and evaluation.

The FeedRight agent is pretrained using Deep Q-Network (DQN) via the Stable-Baselines 3 library. Training runs entirely inside the simulated FishFeedingEnv environment before the model is deployed to production cages.

Algorithm: Double DQN

Stable-Baselines 3's DQN implementation includes Double DQN by default, which decouples action selection from value estimation to reduce overestimation bias — a known issue in standard Q-learning.

The core Bellman update becomes:

Q(s, a) ← r + γ · Q_target(s', argmax_a' Q_online(s', a'))

where Q_online selects the best next action and Q_target evaluates it.

Network Architecture

The policy uses a fully-connected MLP with four hidden layers:

Input (44) → 512 → 256 → 128 → 64 → Output (6 Q-values)

Layer	Neurons	Activation
Input	44	—
Hidden 1	512	ReLU
Hidden 2	256	ReLU
Hidden 3	128	ReLU
Hidden 4	64	ReLU
Output	6	Linear (Q-values)

This produces approximately 200K+ trainable parameters, a ~4× increase from the earlier custom DQN implementation (which used [256, 128, 64] with ~52K params). The larger capacity helps the network model complex interactions between the 44 input features.

Hyperparameters

Parameter	Value	Notes
`learning_rate`	1e-4	Adam optimiser
`buffer_size`	50 000	Replay buffer capacity (5× the original implementation)
`learning_starts`	1 000	Steps of random exploration before training begins
`batch_size`	64	Mini-batch size for gradient updates (2× the original)
`gamma`	0.99	Discount factor — agent values future rewards highly
`target_update_interval`	1 000	Steps between target network syncs
`exploration_fraction`	0.3	Fraction of total training over which ε decays
`exploration_initial_eps`	1.0	Starting ε (fully random)
`exploration_final_eps`	0.05	Final ε (5% random actions)
`total_timesteps`	100 000	Default training length
`seed`	42	For reproducibility

Exploration Strategy: ε-Greedy

The agent uses linear ε-greedy decay:

       ε
  1.0 ┤━━━━━━╲
       │       ╲
       │        ╲
  0.05 ┤         ╲━━━━━━━━━━━━━━━━━━━━━
       └──┬──────┬──────────────────────
          0    30K                  100K
                timesteps

Steps 0 – 1 000 (learning_starts): Pure random actions, no gradient updates. The replay buffer fills with diverse initial experiences.
Steps 1 000 – 30 000 (30% of 100K): ε decays linearly from 1.0 to 0.05. The agent gradually shifts from exploration to exploitation.
Steps 30 000 – 100 000: ε stays at 0.05. The agent mostly exploits learned policy but retains 5% random exploration to avoid local optima.

Training Loop

The training loop is managed by FeedingDQNTrainer.train():

Environment wrapping: The FishFeedingEnv is wrapped in SB3's Monitor for automatic episode logging.
Callback stack:
- EvalCallback: Every 5 000 steps, evaluates the model over 10 episodes and saves the best-performing checkpoint.
- CheckpointCallback: Saves a full checkpoint (model + replay buffer) every 10 000 steps.
- TensorboardCallback: Custom callback logging per-episode rewards, lengths, and feed counts.
Training: model.learn(total_timesteps=100000, progress_bar=True)
Final save: The model is saved to ./models/final_model.
Plotting: A training progress chart (episode rewards and lengths with moving average) is saved to ./models/training_progress_sb3.png.

Device Selection

The trainer automatically detects the best available compute device:

Priority: MPS (Apple Silicon) → CUDA (NVIDIA GPU) → CPU

Evaluation

After training, FeedingDQNTrainer.evaluate() uses SB3's evaluate_policy to run n_eval_episodes (default 20) in deterministic mode and reports:

Mean reward ± standard deviation
Individual test episode replay with step-by-step action/reward trace

Episode Structure

Each training episode follows this cycle:

env.reset() generates a random initial state within realistic bounds.
- Initial time_since_last_feed is randomised between 3–8 hours.
- Initial feeds_today is set to 0.
The agent selects actions until feeds_count >= max_feeds_per_episode (default 6).
Each step:
- Agent picks action (0–5) → mapped to feed amount via action_to_kg.
- _transition_state() updates feeding history, hunger, activity, and adds stochastic noise.
- _calculate_reward() produces the scalar reward signal.
The episode terminates; SB3 logs the cumulative reward and length.

Pretraining vs Fine-Tuning

The initial model is pretrained entirely in simulation. The simulated environment models:

Deterministic state transitions (hunger reduction proportional to feed, time-based hunger increase)
Stochastic environmental noise (±0.02 DO, ±0.01 temperature, ±0.05 motion per step)
Realistic initial state sampling from defined min/max ranges

After deployment, the Adaptive Learning pipeline collects real-world outcomes and retrains the model by pre-filling the replay buffer with actual experiences, bridging the sim-to-real gap.

Reproducing a Training Run

cd lib/dqn_sb3
python train.py

Output:

./models/best_model/best_model.zip — best checkpoint by evaluation reward
./models/final_model.zip — final checkpoint
./models/checkpoints/ — periodic snapshots
./tensorboard_logs/ — TensorBoard event files
./models/training_progress_sb3.png — reward/length plot

To monitor live:

tensorboard --logdir=./tensorboard_logs

Training Pipeline

On this page