Video Analysis
How FeedRight extracts biomass estimates and feeding-behaviour features from underwater cage video.
The DQN agent's 44-dimensional observation vector includes 11 features derived from underwater video — 6 Biomass features (indices 13–18) and 5 Video features (indices 19–23). This page documents the methodology for computing each group.
Biomass Features (indices 13–18)
Biomass estimation uses a fine-tuned YOLO object-detection model to detect individual fish in sampled video frames and convert bounding-box geometry into morphometric measurements.
Pipeline
frame ──► YOLO (conf ≥ 0.25) ──► bounding boxes
│
┌───────────┴───────────┐
▼ ▼
major axis (px) minor axis (px)
÷ pixels_per_cm ÷ pixels_per_cm
│ │
length_cm diameter_cm
│ × π │
│ girth_cm
│ │
└───────┬───────────────┘
▼
W = a × L^b × G^c (grams)
│
▼ average across all sampled frames
avg_weight × stocking_count → total_biomass (kg)Step-by-step
-
Detection — A YOLO model (YOLOv8/v11/v12, fine-tuned on underwater cage footage) runs on each sampled frame and produces per-fish bounding boxes with confidence scores. Only detections above a configurable threshold (default 0.25) are retained.
-
Length estimation — The longer axis of each bounding box (in pixels) is treated as fork length and converted to centimetres using a calibrated
pixels_per_cmfactor. This factor must be determined per camera via a reference object of known size. -
Girth estimation — The shorter axis is treated as body diameter. Girth is approximated as
π × diameter_cmunder a cylindrical body assumption. -
Weight estimation — Individual weight follows the standard allometric relationship:
W = a × L^b × G^cwhere b ≈ 2.5 (length exponent) and c ≈ 0.5 (girth exponent). The coefficient a and exponents must be calibrated per species and camera setup.
- Aggregation — Per-fish weights are collected across all sampled frames. The mean weight is multiplied by the externally-known stocking count to yield total cage biomass in kilograms. The stocking count is used (rather than the detection count) because only a small fraction of fish are visible in any single frame.
Output features
| Index | Feature | Units | Description |
|---|---|---|---|
| 13 | current_total_biomass | kg | stocking_count × avg_weight from YOLO detections |
| 14 | estimated_fish_count | count | Externally supplied stocking count |
| 15 | average_fish_weight | g | Mean individual weight across all sampled detections |
| 16 | biomass_growth_rate_7d | g/week | Weekly growth rate per individual (requires historical data) |
| 17 | days_since_stocking | days | Days since fish were stocked into the cage |
| 18 | growth_stage | enum | 0 = Juvenile (< 50 g), 1 = Growing (50–150 g), 2 = Mature (> 150 g) |
References
- FishKP-YOLOv11 (2025) — keypoint-based size estimation in turbid water
- Stereo-vision biomass estimation with YOLOv11n-pose (MAPE ≈ 8.6%)
- AquaYOLO (2025) — mAP@50 = 0.909 on DePondFi dataset
Video Features (indices 19–23)
The five video features capture feeding behaviour and feed waste in real time. The pipeline fuses spatial motion analysis, temporal dynamics, and optional pellet detection into robust, informative signals.
Architecture overview
frame_t, frame_{t-1}
│
├──► Dense optical flow (Farnebäck)
│ │
│ ├──► Global magnitude histogram ──────► motion_intensity
│ ├──► Spatial attention map ────────────► feeding_frenzy_score
│ └──► Vertical flow profile ───────────► surface_activity
│
└──► YOLO pellet detection (optional)
│
├──► Pellet count + vertical spread ──► uneaten_pellet_count
└──► Depth-weighted sinking estimate ─► pellet_sinking_timeFeature computation
motion_intensity (index 19, range 0–100)
Histogram-based percentile scoring with temporal smoothing.
- Compute Farnebäck dense optical flow between consecutive greyscale frames.
- Build a magnitude histogram across the full flow field.
- Extract the 75th-percentile magnitude (P75) — this captures the motion of the actively moving fish while rejecting the static background and low-amplitude noise.
- Apply an exponential moving average (EMA, α = 0.3) across consecutive frames to smooth transient spikes from individual fish darting through the frame.
- Map the smoothed P75 to 0–100 using a species-calibrated reference range.
Underwater scenes have large static regions (cage netting, substrate) that would dilute a simple mean. Percentile scoring isolates the motion of the fish population itself.
feeding_frenzy_score (index 20, range 0–1)
Spatial attention ratio with directional weighting.
- Divide the flow field into a 3×3 spatial grid.
- For each cell, compute the upward-component magnitude (vertical flow with negative v, i.e. fish swimming toward the surface) separately from the total magnitude.
- Compute the surface convergence ratio: the fraction of total upward-flow energy concentrated in the top row of the grid.
- Weight this by a directional coherence score — the mean cosine similarity of flow vectors in the top row. When many fish swim in the same upward direction simultaneously (coordinated rush to surface), coherence is high; random jittering produces low coherence.
- Final score:
convergence_ratio × directional_coherence, clipped to [0, 1].
The directional coherence component ensures that only coordinated upward movement registers as frenzy. Camera shake, water turbulence from wave action, or random jittering all produce low coherence and are filtered out.
surface_activity (index 21, range 0–1)
Vertical motion energy profile with adaptive thresholding.
- Divide the frame into horizontal bands (e.g. 8 bands of equal height).
- For each band, compute the total kinetic energy proxy:
Σ(magnitude²)summed over all pixels in the band. Squaring emphasises fast-moving fish. - Fit a monotonic energy profile from bottom to top. Under normal conditions, fish distribute evenly and the profile is flat. When fish congregate near the surface (hunger, low O₂, pellet arrival), the profile skews heavily toward the top bands.
- Compute the energy skew as the fraction of total squared-magnitude energy in the top two bands, minus the expected fraction under uniform distribution (0.25 for 2 of 8 bands).
- Normalise to [0, 1] using a calibrated maximum skew value.
This approach captures the redistribution of fish activity toward the surface rather than an absolute motion ratio. It self-calibrates against variable overall motion levels — if the entire frame is active but evenly distributed, surface_activity stays near zero.
pellet_sinking_time (index 22, range 0–30 seconds)
With pellet detection model: A YOLO model fine-tuned on feed pellets detects uneaten pellets in each sampled frame. Sinking time is estimated via depth-weighted tracking:
- For each detected pellet, record its vertical centre position as a fraction of frame height (0 = top, 1 = bottom).
- Across consecutive frames, match pellets by nearest-neighbour in position space to build short tracklets.
- For each tracklet, compute the vertical displacement rate (pixels/second downward).
- Convert to estimated real-world sinking speed using the camera's
pixels_per_cmcalibration. - Report the median sinking time across all active tracklets:
median(depth_fraction × cage_depth / sinking_speed).
Without pellet detection model: Falls back to an inverse-activity proxy — low global motion combined with high surface activity suggests pellets are sinking uneaten while fish are not pursuing them.
uneaten_pellet_count (index 23, range 0–500)
With pellet detection model: Direct count of pellets detected with confidence above threshold (default 0.25). Applies non-maximum suppression (IoU ≥ 0.5) and temporal consistency filtering — a pellet must appear in at least 2 of 3 consecutive sampled frames to be counted. This eliminates false positives from debris, bubbles, or specular reflections that tend to appear transiently.
Without pellet detection model: Uses a motion deficit heuristic: compare the current motion_intensity against the trailing 60-second EMA. When current motion drops below the baseline by more than a calibrated threshold (suggesting fish have lost interest in feeding), the deficit magnitude is mapped to an estimated uneaten count via a species-calibrated lookup curve.
Temporal aggregation
For the DQN observation vector, all five video features are aggregated across the analysis window (typically 1–2 minutes of sampled frames):
motion_intensity: EMA-smoothed final value (most recent state)feeding_frenzy_score: Maximum over the window (captures peak hunger signal)surface_activity: Mean over the window (captures sustained redistribution)pellet_sinking_time: Median across all tracklets in the windowuneaten_pellet_count: Maximum over the window (worst-case waste signal)
Signal source summary
| Feature | Primary signal | Fallback (no pellet model) |
|---|---|---|
motion_intensity | Optical flow P75 + EMA | Same |
feeding_frenzy_score | Upward flow convergence × directional coherence | Same |
surface_activity | Vertical energy profile skew | Same |
pellet_sinking_time | YOLO pellet depth-weighted tracking | Inverse-activity proxy |
uneaten_pellet_count | YOLO pellet count with temporal filtering | Motion deficit heuristic |
References
- Improved YOLO-V4 for real-time uneaten pellet detection (Computers and Electronics in Agriculture, 2021)
- YOLOv8-BaitScan (Fishes, 2025) — lightweight pellet detection and counting framework
- Frame-pair motion encoding with EfficientFeedingNet (Animals, 2026) — Farnebäck optical-flow images for feeding-state classification
- PM-YOLO (2025) — Parallelised Patch-Aware Attention for fish feeding behaviour detection at 8.1 ms/frame