Feed Right Docs

DQN Agent Model

Deep Q-Network architecture, Markov Decision Process formulation, and how Stable-Baselines 3 powers the FeedRight feeding agent.

FeedRight's core decision-maker is a Deep Q-Network (DQN) agent built on top of Stable-Baselines 3 (SB3). This page explains the reinforcement learning formulation, the neural network that drives action selection, and how the agent operates at inference time inside CageFeedingAgent.

Why DQN?

The feeding problem is a sequential decision task with:

  • A discrete, small action space (6 feed amounts) — DQN handles discrete actions natively, unlike actor-critic methods which are designed for continuous spaces.
  • A rich, continuous observation (44 features) — the MLP policy can ingest a high-dimensional state without the sample-efficiency concerns of tabular Q-learning.
  • Delayed consequences — over-feeding now causes waste and reduces appetite later. The discount factor (γ = 0.99) ensures the agent accounts for long-horizon effects.
  • Safety-critical constraints — DQN's deterministic greedy policy at inference time is predictable and easy to override with hard safety rules, unlike stochastic policies.

MDP Formulation

The feeding problem is modelled as a finite-horizon Markov Decision Process ⟨S, A, T, R, γ⟩:

ElementDefinition
S (State)44-dimensional vector normalised to [0, 1]. Six feature groups: Environmental (13), Biomass (6), Video (5), Feeding History (9), Performance (5), Cage (6). See Input Features.
A (Action)Discrete(6) — mapped to concrete feed amounts via action_to_kg: 0 kg, 0.5 kg, 1.0 kg, 2.0 kg, 3.5 kg, 5.0 kg.
T (Transition)Simulated during training by FishFeedingEnv._transition_state(). In production, the real environment provides the next state.
R (Reward)Multi-factor shaped reward combining consumption efficiency, environmental safety, timing, and economic waste. See Reward System.
γ (Discount)0.99 — the agent values future rewards almost as much as immediate ones, encouraging long-horizon planning.

An episode corresponds to one feeding day, terminating after max_feeds_per_episode (default 6) feed actions.

Deep Q-Network: How It Works

A Q-function Q(s, a) estimates the expected cumulative discounted reward of taking action a in state s and then following the optimal policy thereafter. DQN approximates this function with a neural network.

Bellman Target (Double DQN)

SB3's DQN uses Double DQN by default. Standard DQN tends to overestimate Q-values because the same network both selects and evaluates the best next action. Double DQN decouples these two roles:

Q(s, a) ← r + γ · Q_target(s', argmax_a' Q_online(s', a'))
  • Q_online (the policy network) selects the greedy next action.
  • Q_target (a periodically-synced copy) evaluates that action's value.

The target network is hard-copied from the online network every target_update_interval = 1 000 gradient steps.

Network Architecture

The policy is a fully-connected MLP (Multi-Layer Perceptron):

Input (44) → 512 → ReLU → 256 → ReLU → 128 → ReLU → 64 → ReLU → Output (6)
LayerWidthActivationPurpose
Input44Normalised state vector
Hidden 1512ReLUHigh-capacity feature extraction
Hidden 2256ReLUCross-feature interaction
Hidden 3128ReLUCompression toward decision
Hidden 464ReLUFinal representation
Output6LinearOne Q-value per action

This yields approximately 200K+ trainable parameters — a deliberate 4x increase from the earlier custom implementation ([256, 128, 64], ~52K params). The extra capacity helps model the complex interactions between environmental, biomass, video, and feeding-history features.

Exploration: ε-Greedy

During training the agent follows a linear ε-greedy schedule:

PhaseStepsεBehaviour
Warm-up0 – 1 0001.0Pure random actions; no gradient updates. Fills replay buffer with diverse experiences.
Decay1 000 – 30 0001.0 → 0.05Linear decay. Agent gradually shifts from exploration to exploitation.
Exploitation30 000 – 100 0000.05Mostly greedy with 5% random exploration to avoid local optima.

At inference time (production), the agent always acts greedily — model.predict(obs, deterministic=True) — so the policy is fully deterministic and predictable.

Experience Replay

Every transition (s, a, r, s', done) is stored in a replay buffer of 50 000 transitions. Training samples uniformly-random mini-batches of size 64 to break temporal correlations and stabilise learning.

Why Stable-Baselines 3?

ConcernSB3 Advantage
CorrectnessBattle-tested implementations of Double DQN, target networks, and replay buffers — avoiding subtle bugs common in hand-rolled RL code.
Serialisationmodel.save() / DQN.load() handles weights, optimiser state, replay buffer, and hyperparameters in a single .zip file.
LoggingNative TensorBoard integration and Monitor wrapper for automatic episode statistics.
Evaluationevaluate_policy() provides standardised mean ± std reward over N deterministic episodes.
CallbacksEvalCallback, CheckpointCallback, and custom callbacks plug directly into the training loop for checkpointing and early stopping.
Device supportAutomatic detection of MPS (Apple Silicon), CUDA (NVIDIA), or CPU.
Gymnasium complianceEnforces the standard gym.Env interface, making the environment reusable and testable independently.

Gymnasium Environment: FishFeedingEnv

The environment (lib/dqn_sb3/env.py) wraps the feeding problem as a standard Gymnasium gym.Env:

observation_space = Box(low=0, high=1, shape=(44,), dtype=float32)
action_space      = Discrete(6)

Key methods:

MethodRole
reset()Samples a random initial state within realistic feature bounds. Randomises time_since_last_feed (3–8 h) and sets feeds_today = 0.
step(action)Maps discrete action → kg via action_to_kg, calls _transition_state() then _calculate_reward(), and returns (obs, reward, terminated, truncated, info).
_normalize_state()Min-max scales each of the 44 raw features to [0, 1] using pre-defined feature ranges.
_transition_state()Updates feeding history, hunger, and motion; adds stochastic environmental noise (±0.02 DO, ±0.01 temp, ±0.05 motion).
_calculate_reward()Multi-factor additive reward detailed in Reward System.

Action Mapping

IndexAmountLabel
00.0 kgNo feed (wait)
10.5 kgSmall feed
21.0 kgLight feed
32.0 kgMedium feed
43.5 kgHeavy feed
55.0 kgMaximum feed

Production Agent: CageFeedingAgent

CageFeedingAgent (lib/dqn_sb3/agent.py) wraps the trained DQN model for real-time use. Each cage runs its own agent instance, identified by cage_id.

Inference Pipeline

                    ┌──────────────────────┐
   Dict or array ──►│  _dict_to_array()    │  Convert named features
                    │  _normalize_state()  │  to normalised 44-d vector
                    └──────────┬───────────┘


                    ┌──────────────────────┐
                    │  DQN.predict(obs,    │  Forward pass through
                    │    deterministic=    │  MLP, returns argmax
                    │    True)             │  action index (0–5)
                    └──────────┬───────────┘


                    ┌──────────────────────┐
                    │  action_to_kg[idx]   │  Map index → kg
                    └──────────┬───────────┘


                    ┌──────────────────────┐
                    │  apply_safety_       │  Hard-coded overrides
                    │  constraints()       │  may block or reduce
                    └──────────┬───────────┘


                    ┌──────────────────────┐
                    │  Return decision:    │
                    │  { feed_amount,      │
                    │    is_safe,          │
                    │    confidence,       │
                    │    raw_prediction,   │
                    │    action }          │
                    └──────────────────────┘

Usage

from lib.dqn_sb3 import CageFeedingAgent

agent = CageFeedingAgent(
    cage_id="CAGE-001",
    model_path="./models/best_model/best_model.zip",
    use_safety_constraints=True,
    enable_adaptive_learning=True,
)

decision = agent.decide_feeding({
    "dissolved_oxygen": 7.2,
    "temperature": 28.5,
    "motion_intensity": 75.0,
    "feeding_frenzy_score": 0.8,
    "time_since_last_feed": 4.5,
    "feeds_today": 2,
    "wind_speed": 5.0,
    "feed_waste_rate": 0.12,
})

print(decision["feed_amount"])   # e.g. 1.0
print(decision["confidence"])    # e.g. 0.85
print(decision["is_safe"])       # True

Missing keys in the state dictionary are filled with sensible defaults, so the agent can run even when some sensors are temporarily offline.

Confidence Score

The confidence value reflects two factors:

  1. Model certainty — how close the predicted amount is to the maximum feed capacity (larger absolute predictions → higher confidence).
  2. Safety penalty — a 0.3 deduction if the safety layer overrode the model's original recommendation.

Adaptive Learning Loop

After each feeding event, record_outcome() stores the transition (s, a, r, s') in an SQLite ExperienceDatabase. The Adaptive Learning service periodically replays these real-world experiences back into the model's replay buffer and continues training, bridging the sim-to-real gap.

Key Hyperparameters at a Glance

ParameterValue
AlgorithmDouble DQN (SB3)
PolicyMlpPolicy [512, 256, 128, 64]
Observation dim44 (continuous, normalised 0–1)
Action spaceDiscrete(6)
γ (discount)0.99
Learning rate1e-4 (Adam)
Replay buffer50 000 transitions
Batch size64
Target updateevery 1 000 steps
ε schedule1.0 → 0.05 over 30% of training
Training length100 000 timesteps
DeviceAuto (MPS / CUDA / CPU)

For full training pipeline details — callbacks, checkpointing, evaluation, and reproduction instructions — see Training Pipeline.

On this page