Reward & Penalty System
How the FeedRight DQN agent is rewarded for good feeding decisions and penalised for poor ones.
The reward function is the core learning signal. It is implemented in FishFeedingEnv._calculate_reward() and is designed to drive the agent towards high feed-conversion efficiency, safe environmental conditions, appropriate timing, and economic waste minimisation.
Reward Summary Table
| Category | Condition | Reward | Rationale |
|---|---|---|---|
| Appropriate wait | No feed, fish not hungry | +0.5 | Conserves feed; avoids unnecessary dispensing |
| Missed hunger | No feed, but frenzy > 0.8 and time_since_feed > 4h | −1.5 | Fish are hungry and have been waiting — missed feeding window |
| Excellent efficiency | Feed action, consumption rate > 95% | +3.0 | Near-perfect match between amount and appetite |
| Good efficiency | Feed action, consumption rate > 85% | +1.5 | Strong efficiency |
| Acceptable efficiency | Feed action, consumption rate > 70% | +0.5 | Adequate but room for improvement |
| Poor efficiency (waste) | Feed action, consumption rate ≤ 70% | −2.0 | Significant feed waste |
| Active + hungry alignment | motion > 70 and frenzy > 0.7 | +1.5 | Feeding at the right moment |
| Overfeeding idle fish | motion < 40 and feed > 1.0 kg | −1.0 | Dispensing large amounts to uninterested fish |
| Critical low O₂ | dissolved_oxygen < 5.0 mg/L | −4.0 | Feeding in dangerously low oxygen |
| Suboptimal O₂ | dissolved_oxygen < 5.5 mg/L | −2.0 | Feeding in marginal oxygen |
| Heat stress | temperature > 30 °C | −3.0 | Fish are heat-stressed; metabolic risk |
| Warm conditions | temperature > 29 °C | −1.0 | Elevated but not critical temperature |
| Too many feeds today | feeds_today ≥ 5 | −3.0 | Over-stimulating the cage |
| Moderately frequent | feeds_today ≥ 4 | −1.0 | Getting close to daily limit |
| Good interval | 2.5h < time_since_feed < 5.0h | +1.5 | Ideal inter-feed gap |
| Too frequent | time_since_feed < 1.5h | −2.0 | Feeding too soon after last event |
| Long wait, OK to feed | time_since_feed > 8h | +0.5 | Fish have been waiting a long time |
| Excessive amount | feed > 4.0 kg | −1.0 | Very large single dispensation |
| Large amount | feed > 3.0 kg | −0.5 | Above-average dispensation |
Detailed Reward Logic
1. No-Feed Decision (action = 0 kg)
When the agent decides not to feed:
if feeding_frenzy_score > 0.8 AND time_since_last_feed > 4h:
reward -= 1.5 # Penalty: hungry fish were ignored
else:
reward += 0.5 # Reward: appropriate patienceThe key insight is that waiting is not always bad — it is only penalised when the video analysis shows clear hunger signs and enough time has passed since the last feed.
2. Consumption Efficiency (Feed Decisions)
For any feed action ≥ 0.1 kg, the reward is primarily driven by how well the dispensed amount matches the fish's current appetite:
optimal_amount = feeding_frenzy_score × 1.5 (kg)
amount_diff = |action - optimal_amount|
consumption_rate = max(0.5, 1.0 - amount_diff × 0.3)The feeding_frenzy_score (from video analysis, range 0–1) is the primary appetite signal. The "optimal" amount scales linearly with hunger, capping at ~1.5 kg for maximum frenzy. The further the actual feed amount is from optimal, the lower the estimated consumption rate — and the lower the reward.
| Consumption Rate | Reward |
|---|---|
| > 95% | +3.0 (excellent) |
| > 85% | +1.5 (good) |
| > 70% | +0.5 (acceptable) |
| ≤ 70% | −2.0 (wasteful) |
3. Activity & Hunger Alignment
The agent receives a bonus for feeding when fish are both active and hungry, and a penalty for overfeeding inactive fish:
if motion_intensity > 70 AND feeding_frenzy_score > 0.7:
reward += 1.5 # Great timing
elif motion_intensity < 40 AND action > 1.0 kg:
reward -= 1.0 # Overfeeding uninterested fish4. Environmental Safety Penalties
Feeding during adverse environmental conditions is heavily penalised because uneaten feed degrades water quality further:
if dissolved_oxygen < 5.0: reward -= 4.0 # Critical
elif dissolved_oxygen < 5.5: reward -= 2.0 # Marginal
if temperature > 30: reward -= 3.0 # Severe heat stress
elif temperature > 29: reward -= 1.0 # Elevated tempThese are the largest single penalties in the system, ensuring the agent learns that environmental safety always takes priority over feeding.
5. Feeding Frequency Penalties
To prevent over-stimulation and digestive stress:
if feeds_today >= 5: reward -= 3.0 # Too many feeds
elif feeds_today >= 4: reward -= 1.0 # Approaching limit6. Inter-Feed Timing Rewards
The agent is rewarded for maintaining an appropriate gap between consecutive feeds:
if 2.5h < time_since_last_feed < 5.0h:
reward += 1.5 # Ideal interval (sweet spot)
elif time_since_last_feed < 1.5h:
reward -= 2.0 # Too frequent
elif time_since_last_feed > 8h:
reward += 0.5 # Long gap, OK to feed now7. Economic Waste Penalty
Large single dispensations are penalised to encourage smaller, more frequent, more efficient feeds:
if action > 4.0 kg: reward -= 1.0
elif action > 3.0 kg: reward -= 0.5Reward Stacking
All applicable reward components are additive. A single step can trigger multiple bonuses and penalties simultaneously. For example, feeding 3.5 kg to inactive fish during low-oxygen conditions could produce:
+0.5 acceptable efficiency
-1.0 overfeeding inactive fish
-2.0 suboptimal O₂
-0.5 large amount
─────
-3.0 net rewardThis multi-factor stacking ensures the agent cannot "game" a single metric while ignoring others.
Design Philosophy
- Safety-first: Environmental penalties (−4.0 for low O₂) dominate all other signals, ensuring the agent never learns to feed in dangerous conditions.
- Efficiency-driven: The largest positive reward (+3.0) goes to near-perfect consumption efficiency, making waste minimisation the primary optimisation target.
- Timing-aware: Both the inter-feed interval and daily frequency are shaped, teaching the agent a natural feeding rhythm.
- Economically conscious: Feed is the largest operational cost in aquaculture; waste penalties directly reduce cost.