Reward & Penalty System

How the FeedRight DQN agent is rewarded for good feeding decisions and penalised for poor ones.

The reward function is the core learning signal. It is implemented in FishFeedingEnv._calculate_reward() and is designed to drive the agent towards high feed-conversion efficiency, safe environmental conditions, appropriate timing, and economic waste minimisation.

Reward Summary Table

Category	Condition	Reward	Rationale
Appropriate wait	No feed, fish not hungry	+0.5	Conserves feed; avoids unnecessary dispensing
Missed hunger	No feed, but `frenzy > 0.8` and `time_since_feed > 4h`	−1.5	Fish are hungry and have been waiting — missed feeding window
Excellent efficiency	Feed action, consumption rate > 95%	+3.0	Near-perfect match between amount and appetite
Good efficiency	Feed action, consumption rate > 85%	+1.5	Strong efficiency
Acceptable efficiency	Feed action, consumption rate > 70%	+0.5	Adequate but room for improvement
Poor efficiency (waste)	Feed action, consumption rate ≤ 70%	−2.0	Significant feed waste
Active + hungry alignment	`motion > 70` and `frenzy > 0.7`	+1.5	Feeding at the right moment
Overfeeding idle fish	`motion < 40` and `feed > 1.0 kg`	−1.0	Dispensing large amounts to uninterested fish
Critical low O₂	`dissolved_oxygen < 5.0 mg/L`	−4.0	Feeding in dangerously low oxygen
Suboptimal O₂	`dissolved_oxygen < 5.5 mg/L`	−2.0	Feeding in marginal oxygen
Heat stress	`temperature > 30 °C`	−3.0	Fish are heat-stressed; metabolic risk
Warm conditions	`temperature > 29 °C`	−1.0	Elevated but not critical temperature
Too many feeds today	`feeds_today ≥ 5`	−3.0	Over-stimulating the cage
Moderately frequent	`feeds_today ≥ 4`	−1.0	Getting close to daily limit
Good interval	`2.5h < time_since_feed < 5.0h`	+1.5	Ideal inter-feed gap
Too frequent	`time_since_feed < 1.5h`	−2.0	Feeding too soon after last event
Long wait, OK to feed	`time_since_feed > 8h`	+0.5	Fish have been waiting a long time
Excessive amount	`feed > 4.0 kg`	−1.0	Very large single dispensation
Large amount	`feed > 3.0 kg`	−0.5	Above-average dispensation

Detailed Reward Logic

1. No-Feed Decision (action = 0 kg)

When the agent decides not to feed:

if feeding_frenzy_score > 0.8 AND time_since_last_feed > 4h:
    reward -= 1.5   # Penalty: hungry fish were ignored
else:
    reward += 0.5   # Reward: appropriate patience

The key insight is that waiting is not always bad — it is only penalised when the video analysis shows clear hunger signs and enough time has passed since the last feed.

2. Consumption Efficiency (Feed Decisions)

For any feed action ≥ 0.1 kg, the reward is primarily driven by how well the dispensed amount matches the fish's current appetite:

optimal_amount = feeding_frenzy_score × 1.5   (kg)
amount_diff    = |action - optimal_amount|
consumption_rate = max(0.5, 1.0 - amount_diff × 0.3)

The feeding_frenzy_score (from video analysis, range 0–1) is the primary appetite signal. The "optimal" amount scales linearly with hunger, capping at ~1.5 kg for maximum frenzy. The further the actual feed amount is from optimal, the lower the estimated consumption rate — and the lower the reward.

Consumption Rate	Reward
> 95%	+3.0 (excellent)
> 85%	+1.5 (good)
> 70%	+0.5 (acceptable)
≤ 70%	−2.0 (wasteful)

3. Activity & Hunger Alignment

The agent receives a bonus for feeding when fish are both active and hungry, and a penalty for overfeeding inactive fish:

if motion_intensity > 70 AND feeding_frenzy_score > 0.7:
    reward += 1.5   # Great timing
elif motion_intensity < 40 AND action > 1.0 kg:
    reward -= 1.0   # Overfeeding uninterested fish

4. Environmental Safety Penalties

Feeding during adverse environmental conditions is heavily penalised because uneaten feed degrades water quality further:

if dissolved_oxygen < 5.0:    reward -= 4.0   # Critical
elif dissolved_oxygen < 5.5:  reward -= 2.0   # Marginal

if temperature > 30:  reward -= 3.0   # Severe heat stress
elif temperature > 29: reward -= 1.0   # Elevated temp

These are the largest single penalties in the system, ensuring the agent learns that environmental safety always takes priority over feeding.

5. Feeding Frequency Penalties

To prevent over-stimulation and digestive stress:

if feeds_today >= 5:  reward -= 3.0   # Too many feeds
elif feeds_today >= 4: reward -= 1.0   # Approaching limit

6. Inter-Feed Timing Rewards

The agent is rewarded for maintaining an appropriate gap between consecutive feeds:

if 2.5h < time_since_last_feed < 5.0h:
    reward += 1.5   # Ideal interval (sweet spot)
elif time_since_last_feed < 1.5h:
    reward -= 2.0   # Too frequent
elif time_since_last_feed > 8h:
    reward += 0.5   # Long gap, OK to feed now

7. Economic Waste Penalty

Large single dispensations are penalised to encourage smaller, more frequent, more efficient feeds:

if action > 4.0 kg:  reward -= 1.0
elif action > 3.0 kg: reward -= 0.5

Reward Stacking

All applicable reward components are additive. A single step can trigger multiple bonuses and penalties simultaneously. For example, feeding 3.5 kg to inactive fish during low-oxygen conditions could produce:

  +0.5  acceptable efficiency
  -1.0  overfeeding inactive fish
  -2.0  suboptimal O₂
  -0.5  large amount
  ─────
  -3.0  net reward

This multi-factor stacking ensures the agent cannot "game" a single metric while ignoring others.

Design Philosophy

Safety-first: Environmental penalties (−4.0 for low O₂) dominate all other signals, ensuring the agent never learns to feed in dangerous conditions.
Efficiency-driven: The largest positive reward (+3.0) goes to near-perfect consumption efficiency, making waste minimisation the primary optimisation target.
Timing-aware: Both the inter-feed interval and daily frequency are shaped, teaching the agent a natural feeding rhythm.
Economically conscious: Feed is the largest operational cost in aquaculture; waste penalties directly reduce cost.

Reward & Penalty System

On this page