DQN Agent Model

Deep Q-Network architecture, Markov Decision Process formulation, and how Stable-Baselines 3 powers the FeedRight feeding agent.

FeedRight's core decision-maker is a Deep Q-Network (DQN) agent built on top of Stable-Baselines 3 (SB3). This page explains the reinforcement learning formulation, the neural network that drives action selection, and how the agent operates at inference time inside CageFeedingAgent.

Why DQN?

The feeding problem is a sequential decision task with:

A discrete, small action space (6 feed amounts) — DQN handles discrete actions natively, unlike actor-critic methods which are designed for continuous spaces.
A rich, continuous observation (44 features) — the MLP policy can ingest a high-dimensional state without the sample-efficiency concerns of tabular Q-learning.
Delayed consequences — over-feeding now causes waste and reduces appetite later. The discount factor (γ = 0.99) ensures the agent accounts for long-horizon effects.
Safety-critical constraints — DQN's deterministic greedy policy at inference time is predictable and easy to override with hard safety rules, unlike stochastic policies.

MDP Formulation

The feeding problem is modelled as a finite-horizon Markov Decision Process ⟨S, A, T, R, γ⟩:

Element	Definition
S (State)	44-dimensional vector normalised to [0, 1]. Six feature groups: Environmental (13), Biomass (6), Video (5), Feeding History (9), Performance (5), Cage (6). See Input Features.
A (Action)	`Discrete(6)` — mapped to concrete feed amounts via `action_to_kg`: 0 kg, 0.5 kg, 1.0 kg, 2.0 kg, 3.5 kg, 5.0 kg.
T (Transition)	Simulated during training by `FishFeedingEnv._transition_state()`. In production, the real environment provides the next state.
R (Reward)	Multi-factor shaped reward combining consumption efficiency, environmental safety, timing, and economic waste. See Reward System.
γ (Discount)	0.99 — the agent values future rewards almost as much as immediate ones, encouraging long-horizon planning.

An episode corresponds to one feeding day, terminating after max_feeds_per_episode (default 6) feed actions.

Deep Q-Network: How It Works

A Q-function Q(s, a) estimates the expected cumulative discounted reward of taking action a in state s and then following the optimal policy thereafter. DQN approximates this function with a neural network.

Bellman Target (Double DQN)

SB3's DQN uses Double DQN by default. Standard DQN tends to overestimate Q-values because the same network both selects and evaluates the best next action. Double DQN decouples these two roles:

Q(s, a) ← r + γ · Q_target(s', argmax_a' Q_online(s', a'))

Q_online (the policy network) selects the greedy next action.
Q_target (a periodically-synced copy) evaluates that action's value.

The target network is hard-copied from the online network every target_update_interval = 1 000 gradient steps.

Network Architecture

The policy is a fully-connected MLP (Multi-Layer Perceptron):

Input (44) → 512 → ReLU → 256 → ReLU → 128 → ReLU → 64 → ReLU → Output (6)

Layer	Width	Activation	Purpose
Input	44	—	Normalised state vector
Hidden 1	512	ReLU	High-capacity feature extraction
Hidden 2	256	ReLU	Cross-feature interaction
Hidden 3	128	ReLU	Compression toward decision
Hidden 4	64	ReLU	Final representation
Output	6	Linear	One Q-value per action

This yields approximately 200K+ trainable parameters — a deliberate 4x increase from the earlier custom implementation ([256, 128, 64], ~52K params). The extra capacity helps model the complex interactions between environmental, biomass, video, and feeding-history features.

Exploration: ε-Greedy

During training the agent follows a linear ε-greedy schedule:

Phase	Steps	ε	Behaviour
Warm-up	0 – 1 000	1.0	Pure random actions; no gradient updates. Fills replay buffer with diverse experiences.
Decay	1 000 – 30 000	1.0 → 0.05	Linear decay. Agent gradually shifts from exploration to exploitation.
Exploitation	30 000 – 100 000	0.05	Mostly greedy with 5% random exploration to avoid local optima.

At inference time (production), the agent always acts greedily — model.predict(obs, deterministic=True) — so the policy is fully deterministic and predictable.

Experience Replay

Every transition (s, a, r, s', done) is stored in a replay buffer of 50 000 transitions. Training samples uniformly-random mini-batches of size 64 to break temporal correlations and stabilise learning.

Why Stable-Baselines 3?

Concern	SB3 Advantage
Correctness	Battle-tested implementations of Double DQN, target networks, and replay buffers — avoiding subtle bugs common in hand-rolled RL code.
Serialisation	`model.save()` / `DQN.load()` handles weights, optimiser state, replay buffer, and hyperparameters in a single `.zip` file.
Logging	Native TensorBoard integration and `Monitor` wrapper for automatic episode statistics.
Evaluation	`evaluate_policy()` provides standardised mean ± std reward over N deterministic episodes.
Callbacks	`EvalCallback`, `CheckpointCallback`, and custom callbacks plug directly into the training loop for checkpointing and early stopping.
Device support	Automatic detection of MPS (Apple Silicon), CUDA (NVIDIA), or CPU.
Gymnasium compliance	Enforces the standard `gym.Env` interface, making the environment reusable and testable independently.

Gymnasium Environment: `FishFeedingEnv`

The environment (lib/dqn_sb3/env.py) wraps the feeding problem as a standard Gymnasium gym.Env:

observation_space = Box(low=0, high=1, shape=(44,), dtype=float32)
action_space      = Discrete(6)

Key methods:

Method	Role
`reset()`	Samples a random initial state within realistic feature bounds. Randomises `time_since_last_feed` (3–8 h) and sets `feeds_today = 0`.
`step(action)`	Maps discrete action → kg via `action_to_kg`, calls `_transition_state()` then `_calculate_reward()`, and returns `(obs, reward, terminated, truncated, info)`.
`_normalize_state()`	Min-max scales each of the 44 raw features to [0, 1] using pre-defined feature ranges.
`_transition_state()`	Updates feeding history, hunger, and motion; adds stochastic environmental noise (±0.02 DO, ±0.01 temp, ±0.05 motion).
`_calculate_reward()`	Multi-factor additive reward detailed in Reward System.

Action Mapping

Index	Amount	Label
0	0.0 kg	No feed (wait)
1	0.5 kg	Small feed
2	1.0 kg	Light feed
3	2.0 kg	Medium feed
4	3.5 kg	Heavy feed
5	5.0 kg	Maximum feed

Production Agent: `CageFeedingAgent`

CageFeedingAgent (lib/dqn_sb3/agent.py) wraps the trained DQN model for real-time use. Each cage runs its own agent instance, identified by cage_id.

Inference Pipeline

                    ┌──────────────────────┐
   Dict or array ──►│  _dict_to_array()    │  Convert named features
                    │  _normalize_state()  │  to normalised 44-d vector
                    └──────────┬───────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │  DQN.predict(obs,    │  Forward pass through
                    │    deterministic=    │  MLP, returns argmax
                    │    True)             │  action index (0–5)
                    └──────────┬───────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │  action_to_kg[idx]   │  Map index → kg
                    └──────────┬───────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │  apply_safety_       │  Hard-coded overrides
                    │  constraints()       │  may block or reduce
                    └──────────┬───────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │  Return decision:    │
                    │  { feed_amount,      │
                    │    is_safe,          │
                    │    confidence,       │
                    │    raw_prediction,   │
                    │    action }          │
                    └──────────────────────┘

Usage

from lib.dqn_sb3 import CageFeedingAgent

agent = CageFeedingAgent(
    cage_id="CAGE-001",
    model_path="./models/best_model/best_model.zip",
    use_safety_constraints=True,
    enable_adaptive_learning=True,
)

decision = agent.decide_feeding({
    "dissolved_oxygen": 7.2,
    "temperature": 28.5,
    "motion_intensity": 75.0,
    "feeding_frenzy_score": 0.8,
    "time_since_last_feed": 4.5,
    "feeds_today": 2,
    "wind_speed": 5.0,
    "feed_waste_rate": 0.12,
})

print(decision["feed_amount"])   # e.g. 1.0
print(decision["confidence"])    # e.g. 0.85
print(decision["is_safe"])       # True

Missing keys in the state dictionary are filled with sensible defaults, so the agent can run even when some sensors are temporarily offline.

Confidence Score

The confidence value reflects two factors:

Model certainty — how close the predicted amount is to the maximum feed capacity (larger absolute predictions → higher confidence).
Safety penalty — a 0.3 deduction if the safety layer overrode the model's original recommendation.

Adaptive Learning Loop

After each feeding event, record_outcome() stores the transition (s, a, r, s') in an SQLite ExperienceDatabase. The Adaptive Learning service periodically replays these real-world experiences back into the model's replay buffer and continues training, bridging the sim-to-real gap.

Key Hyperparameters at a Glance

Parameter	Value
Algorithm	Double DQN (SB3)
Policy	`MlpPolicy` [512, 256, 128, 64]
Observation dim	44 (continuous, normalised 0–1)
Action space	`Discrete(6)`
γ (discount)	0.99
Learning rate	1e-4 (Adam)
Replay buffer	50 000 transitions
Batch size	64
Target update	every 1 000 steps
ε schedule	1.0 → 0.05 over 30% of training
Training length	100 000 timesteps
Device	Auto (MPS / CUDA / CPU)

For full training pipeline details — callbacks, checkpointing, evaluation, and reproduction instructions — see Training Pipeline.

DQN Agent Model

On this page