Feed Right Docs

Reward & Penalty System

How the FeedRight DQN agent is rewarded for good feeding decisions and penalised for poor ones.

The reward function is the core learning signal. It is implemented in FishFeedingEnv._calculate_reward() and is designed to drive the agent towards high feed-conversion efficiency, safe environmental conditions, appropriate timing, and economic waste minimisation.

Reward Summary Table

CategoryConditionRewardRationale
Appropriate waitNo feed, fish not hungry+0.5Conserves feed; avoids unnecessary dispensing
Missed hungerNo feed, but frenzy > 0.8 and time_since_feed > 4h−1.5Fish are hungry and have been waiting — missed feeding window
Excellent efficiencyFeed action, consumption rate > 95%+3.0Near-perfect match between amount and appetite
Good efficiencyFeed action, consumption rate > 85%+1.5Strong efficiency
Acceptable efficiencyFeed action, consumption rate > 70%+0.5Adequate but room for improvement
Poor efficiency (waste)Feed action, consumption rate ≤ 70%−2.0Significant feed waste
Active + hungry alignmentmotion > 70 and frenzy > 0.7+1.5Feeding at the right moment
Overfeeding idle fishmotion < 40 and feed > 1.0 kg−1.0Dispensing large amounts to uninterested fish
Critical low O₂dissolved_oxygen < 5.0 mg/L−4.0Feeding in dangerously low oxygen
Suboptimal O₂dissolved_oxygen < 5.5 mg/L−2.0Feeding in marginal oxygen
Heat stresstemperature > 30 °C−3.0Fish are heat-stressed; metabolic risk
Warm conditionstemperature > 29 °C−1.0Elevated but not critical temperature
Too many feeds todayfeeds_today ≥ 5−3.0Over-stimulating the cage
Moderately frequentfeeds_today ≥ 4−1.0Getting close to daily limit
Good interval2.5h < time_since_feed < 5.0h+1.5Ideal inter-feed gap
Too frequenttime_since_feed < 1.5h−2.0Feeding too soon after last event
Long wait, OK to feedtime_since_feed > 8h+0.5Fish have been waiting a long time
Excessive amountfeed > 4.0 kg−1.0Very large single dispensation
Large amountfeed > 3.0 kg−0.5Above-average dispensation

Detailed Reward Logic

1. No-Feed Decision (action = 0 kg)

When the agent decides not to feed:

if feeding_frenzy_score > 0.8 AND time_since_last_feed > 4h:
    reward -= 1.5   # Penalty: hungry fish were ignored
else:
    reward += 0.5   # Reward: appropriate patience

The key insight is that waiting is not always bad — it is only penalised when the video analysis shows clear hunger signs and enough time has passed since the last feed.

2. Consumption Efficiency (Feed Decisions)

For any feed action ≥ 0.1 kg, the reward is primarily driven by how well the dispensed amount matches the fish's current appetite:

optimal_amount = feeding_frenzy_score × 1.5   (kg)
amount_diff    = |action - optimal_amount|
consumption_rate = max(0.5, 1.0 - amount_diff × 0.3)

The feeding_frenzy_score (from video analysis, range 0–1) is the primary appetite signal. The "optimal" amount scales linearly with hunger, capping at ~1.5 kg for maximum frenzy. The further the actual feed amount is from optimal, the lower the estimated consumption rate — and the lower the reward.

Consumption RateReward
> 95%+3.0 (excellent)
> 85%+1.5 (good)
> 70%+0.5 (acceptable)
≤ 70%−2.0 (wasteful)

3. Activity & Hunger Alignment

The agent receives a bonus for feeding when fish are both active and hungry, and a penalty for overfeeding inactive fish:

if motion_intensity > 70 AND feeding_frenzy_score > 0.7:
    reward += 1.5   # Great timing
elif motion_intensity < 40 AND action > 1.0 kg:
    reward -= 1.0   # Overfeeding uninterested fish

4. Environmental Safety Penalties

Feeding during adverse environmental conditions is heavily penalised because uneaten feed degrades water quality further:

if dissolved_oxygen < 5.0:    reward -= 4.0   # Critical
elif dissolved_oxygen < 5.5:  reward -= 2.0   # Marginal

if temperature > 30:  reward -= 3.0   # Severe heat stress
elif temperature > 29: reward -= 1.0   # Elevated temp

These are the largest single penalties in the system, ensuring the agent learns that environmental safety always takes priority over feeding.

5. Feeding Frequency Penalties

To prevent over-stimulation and digestive stress:

if feeds_today >= 5:  reward -= 3.0   # Too many feeds
elif feeds_today >= 4: reward -= 1.0   # Approaching limit

6. Inter-Feed Timing Rewards

The agent is rewarded for maintaining an appropriate gap between consecutive feeds:

if 2.5h < time_since_last_feed < 5.0h:
    reward += 1.5   # Ideal interval (sweet spot)
elif time_since_last_feed < 1.5h:
    reward -= 2.0   # Too frequent
elif time_since_last_feed > 8h:
    reward += 0.5   # Long gap, OK to feed now

7. Economic Waste Penalty

Large single dispensations are penalised to encourage smaller, more frequent, more efficient feeds:

if action > 4.0 kg:  reward -= 1.0
elif action > 3.0 kg: reward -= 0.5

Reward Stacking

All applicable reward components are additive. A single step can trigger multiple bonuses and penalties simultaneously. For example, feeding 3.5 kg to inactive fish during low-oxygen conditions could produce:

  +0.5  acceptable efficiency
  -1.0  overfeeding inactive fish
  -2.0  suboptimal O₂
  -0.5  large amount
  ─────
  -3.0  net reward

This multi-factor stacking ensures the agent cannot "game" a single metric while ignoring others.

Design Philosophy

  1. Safety-first: Environmental penalties (−4.0 for low O₂) dominate all other signals, ensuring the agent never learns to feed in dangerous conditions.
  2. Efficiency-driven: The largest positive reward (+3.0) goes to near-perfect consumption efficiency, making waste minimisation the primary optimisation target.
  3. Timing-aware: Both the inter-feed interval and daily frequency are shaped, teaching the agent a natural feeding rhythm.
  4. Economically conscious: Feed is the largest operational cost in aquaculture; waste penalties directly reduce cost.

On this page