Skip to content

Tracking Concepts

Explanations of the core algorithms used by the world-coordinate tracker (app/tracking/world/tracker.py). These concepts underpin the Frame Processing Pipeline.

Bounding box and IoU

A bounding box is a rectangle that encloses a detected person. CTS uses axis-aligned boxes described by four pixel coordinates: (x_min, y_min, x_max, y_max).

Intersection over Union (IoU) measures how much two bounding boxes overlap:

IoU = area(A n B) / area(A u B)
+------------------+
|    Box A         |
|       +----------+--------+
|       |   n      |        |
+-------+----------+        |
        |    Box B          |
        +-------------------+

IoU = (area of overlap) / (area of A + area of B minus area of overlap)
IoU valueMeaning
1.0Boxes are identical
0.5Roughly half the combined area is shared
0.25Minimal but meaningful overlap
0.0No overlap

IoU is used in the detection dedup step: newly decoded bounding boxes whose IoU with an already-kept box exceeds detection_iou_dedup_threshold (default: 0.55) are suppressed before the tracker sees them. The world tracker's association uses floor-plane Mahalanobis distance, not IoU.

Kalman filter

A Kalman filter is a recursive state estimator. Given a noisy stream of floor-position observations, it maintains a smoothed estimate of a person's position and velocity and predicts where they will be in the next frame.

State vector

The world tracker uses a 4-dimensional floor-plane state (app/tracking/world/kalman.py):

x = [x_m, y_m, vx_m_s, vy_m_s]^T
ComponentMeaning
x_m, y_mPosition on the shared floor plane in metres
vx_m_s, vy_m_sVelocity in metres per second

The observation is z = [x_m, y_m]^T from the per-camera homography (the calibrated FloorPoint). Velocity is inferred from the sequence of floor positions, not measured from the frame directly.

Predict and update

Each frame, the filter runs two steps:

  • Predict: advances the position by velocity * dt and decays velocity toward zero over velocity_decay_s (default: 3.0 s). This is where the filter extrapolates a person's floor position when no observation arrives.
  • Update: blends the prediction with the observed floor point using the Kalman gain, weighted by observation_noise_m (default: 0.25 m, the calibration residual 95th percentile).

The result is a track that tolerates brief observation gaps and produces smooth floor positions even when the detector or homography is noisy.

Mahalanobis gating

Before the Hungarian assignment, pairs whose squared Mahalanobis distance exceeds gate_chi2 = 9.21 (chi-squared 99%, 2 dof) are excluded. This prevents the solver from assigning a detection to a PH on the opposite side of the room.

Hungarian algorithm

The Hungarian algorithm (also called the Munkres algorithm) solves the assignment problem: given a cost matrix where cost[i][j] is the cost of assigning track i to detection j, find the globally optimal one-to-one assignment that minimizes total cost.

Tracks:      T1   T2   T3
            +-------------------------------+
Detection D1| 0.1  0.8  0.9 | <- cheapest: T1
Detection D2| 0.7  0.2  0.8 | <- cheapest: T2
Detection D3| 0.9  0.7  0.3 | <- cheapest: T3
            +-------------------------------+
Optimal assignment: D1->T1, D2->T2, D3->T3  (total cost: 0.6)

A greedy approach (each detection picks its cheapest track independently) would fail when two detections compete for the same track. The Hungarian algorithm solves this globally and runs in O(n^3) time, which is fast enough for typical frame sizes (fewer than 20 people per camera).

Any detection left unmatched after the assignment spawns a new track. Any track left unmatched increments its lost counter.

World tracker association

The world tracker combines the Kalman filter, Hungarian algorithm, and appearance embeddings into a single floor-plane tracker. There is no per-camera tracker; one WorldTracker instance processes observations from all cameras in a single pass.

Association cost

The cost matrix (app/tracking/world/cost_matrix.py) passed to the Hungarian algorithm is a weighted blend of three components:

cost(PH i, observation j) = alpha_geo * geo_cost + alpha_app * app_cost + alpha_height * height_cost
ComponentDefault weightMeasurement
geo_costalpha_geo = 0.5Normalized Mahalanobis distance on the floor plane
app_costalpha_app = 0.4Cosine distance to the best-matching PH view prototype (max-over-views), falling back to the single gallery mean when no prototypes exist
height_costalpha_height = 0.1Gaussian score on height difference

Two hard gates return GATE_INF (never match):

  • Geometric gate: squared Mahalanobis distance > gate_chi2 (9.21). For uncalibrated observations the gate is widened to uncalibrated_gate_chi2 (21.0) and appearance is weighted more heavily, because synthetic floor points are not metric.
  • Identity-conflict gate: the observation carries a recognized face person_id that differs from the PH's identity at confidence >= face_conflict_threshold (0.70).

PH lifecycle

ParameterDefaultMeaning
min_observations_to_publish3Minimum observations before a PH appears in downstream outputs
ph_close_grace_s15.0 sGrace period before an unmatched PH is closed
revive_max_age_s30.0 sA closed PH older than this is not revived
revive_max_distance_m2.0 mSpawn point versus closed PH last position (same-camera revival)
revive_appearance_min_sim0.55Minimum appearance cosine for same-camera revival
inferred_handoff_max_s600 sWindow for publishing a continuation candidate to CC

A PH with observation_count < min_observations_to_publish is tracked internally but produces no WorldFrameSnapshot and no downstream output, filtering out single-frame detection noise.

PH revival (continuity)

Before spawning a brand-new UNKNOWN PH for an unmatched observation, the tracker tries to revive a recently-closed PH that matches on space, time, and appearance, reusing its ph_id, identity, and gallery state. This keeps one person mapped to one PH through brief occlusions, turns, and association gaps, instead of fragmenting into many short-lived hypotheses. Same-camera revival uses revive_* thresholds above; cross-camera revival is gated by the learned camera topology and multi-view appearance (see How identity crosses cameras).

PersonHypothesis (PH)

A PersonHypothesis (PH, frozen dataclass in domain/__init__.py) is the world-level entity representing one tracked person. It is the single physical-track identifier in the system. Every downstream output -- trajectory rows, room dwells, dementia signals, Redis stream events, and MCP tool responses -- references the ph_id (a UUID string) as the primary identity anchor.

A PH aggregates evidence from multiple cameras simultaneously. When two calibrated cameras share a field of view, the pre-association dedup pass ensures they produce exactly one PH, not two.

Key PH fields:

FieldMeaning
state_mean(x, y, vx, vy) floor-plane metres and m/s (Kalman mean)
state_cov16 floats, 4x4 row-major covariance
born_at, last_seen_at, closed_atLifecycle timestamps
last_seen_camera, active_camerasCamera attribution; active_cameras accumulates every camera that has contributed
observation_countTotal observations folded in; gates publication via min_observations_to_publish
current_identity_id, current_identity_committed_atThe committed identity and when it was committed
gallery_meanOnline L2-normalised EMA of SOLIDER embeddings; alpha = 1 / min(count+1, 100)
height_estimate_mEMA of height estimates (alpha = 0.1)
mean_qualityEMA of per-observation CropQuality scores: 0.1 * obs.quality + 0.9 * prev. Travels to CC as the quality field on PersonLocationEnvelope.

Uncalibrated cameras. When a camera has no homography, SpatialProjectionStage yields FloorPoint(calibrated=False). WorldTrackingStage._synthetic_floor_point maps the bbox centre into a 4 m virtual room offset into a per-camera 200 m tile, still flagged calibrated=False. Because synthetic floor points jump as a person walks, the association gate is widened for uncalibrated observations (uncalibrated_gate_chi2 = 21.0) and appearance is weighted over geometry, so a turning or walking person is not dropped and respawned. Floor-distance dedup still skips uncalibrated observations, but appearance dedup within a declared overlap group does not. For uncalibrated home cameras, identity crosses cameras through the shared ReID gallery, recognized-face propagation, PH continuation, group-appearance dedup, and cross-camera revival.

Quality and provenance

Each observation contributes a quality score from CropQuality (app/pipeline/crop_quality.py). The PH's mean_quality accumulates these scores across frames. Downstream consumers receive quality as an explicit field in the response envelope; they never compute it client-side.

Cross-camera dedup

When two cameras see the same person simultaneously (the canonical case: a hallway camera and an adjacent camera at a bathroom door), naive association would produce two PHs for one person. The pre-association floor-point dedup (dedup_observations()) prevents this.

Before the Hungarian assignment runs, all detections with calibrated floor positions are grouped by floor proximity and identity compatibility. Each group elects one representative detection. Only representatives enter the associate() call, preserving the 1-to-1 contract of the Hungarian solver. After association, the cluster membership map propagates all source camera IDs back to the winning PH.

See Frame Pipeline stage 7 for the configuration knobs and the integration proof.

Handling noisy readings

A single camera's view of a person is noisy. Detection boxes jitter frame to frame, homography projection adds error near the edges of a calibrated floor, and two cameras watching the same person from different angles disagree about both where the person is and what posture they hold. CTS does not persist any single camera's raw reading. Both floor position and posture are recomputed every frame as a fused estimate that combines overlapping cameras in the same time window and smooths across time.

The cadence is per frame, but the value is filtered, not raw.

Floor position

Position fusion happens in three layers before a value is written.

  1. Per-camera measurement. SpatialProjectionStage projects each detection's foot point (the bottom-centre of the bounding box) through that camera's homography to a calibrated FloorPoint. Uncalibrated cameras yield a synthetic per-camera tile point instead.
  2. Cross-camera collapse, same window. Before association, dedup_observations() merges observations from different cameras that fall within a residual-aware distance gate into one representative. The representative's floor point is the quality-weighted mean of the cluster's calibrated points, so two cameras seeing one person contribute one averaged position rather than two competing tracks. See Cross-camera dedup.
  3. Temporal fusion. The representative drives one Kalman update for the PH. The filter blends the measurement with the motion model, weighted by observation_noise_m. See Kalman filter.

Two more mechanisms suppress outliers and premature output. The Mahalanobis gate (gate_chi2 = 9.21) prevents a wild measurement from matching a PH at all, and min_observations_to_publish (default: 3) withholds a PH from downstream output until it has accumulated enough evidence to be more than a single-frame spike.

The persisted trajectory point uses state_mean, the Kalman posterior, not any camera's measurement. Raw per-camera observations are still saved for trails and audit, but they are not the authoritative position.

MechanismScopeParameterDefault
Quality-weighted dedup meanCross-camera, same framededup_max_distance_m0.9 m
Residual-widened dedup gateCross-camera, same framededup_residual_coeff_k, dedup_max_distance_ceiling_m1.0, 1.5 m
Kalman blendPer camera and over timeobservation_noise_m, process_noise_accel_m_s20.25 m, 0.5 m/s²
Velocity decay on coastingOver timevelocity_decay_s3.0 s
Mahalanobis outlier gatePer framegate_chi29.21
Publication thresholdPer PHmin_observations_to_publish3

Posture

Posture follows the same shape: soft per-camera evidence, quality-weighted fusion across overlapping cameras, then temporal smoothing.

Soft scoring. PostureStage scores every detection into a PostureScores record (lying, sitting, standing_walking, and keypoint_confidence) rather than committing to one label per camera. Additive soft evidence lets several weak geometric cues accumulate instead of firing on a single threshold, which is the first layer of jitter reduction. In a bedroom, a small lying prior is added when the body is occluded.

Multi-camera fusion. GlobalPostureTracker keeps the latest score snapshot per camera per person. Every camera that sees a person in a frame contributes its own snapshot, keyed by that camera. Fusion is a quality-weighted average across the person's active cameras, where the weight is each camera's keypoint confidence, so a clear full-body view outweighs a partial or occluded one. Depth-only cameras keep a small floor weight. A camera that has lost sight of the person stops contributing after camera_stale_after_s (default: 10 s).

The resolve-and-smooth step runs once per person per frame on the dedup-representative frame. Non-representative cameras ingest their evidence without advancing the smoother, so adding cameras never distorts the temporal hysteresis.

Temporal smoothing. The fused soft scores resolve to a label by clinical priority, then a hysteresis gate requires the new label to persist for required_consecutive frames (default: 2) before the committed posture flips. This prevents a single ambiguous frame from toggling posture.

The persisted posture on each trajectory point is the fused, smoothed value. room_dwells records the dominant posture across the interval.

MechanismScopeParameterDefault
Soft additive scoringPer detection_MIN_EVIDENCE0.5
Quality-weighted fusionCross-cameraweight = keypoint confidencen/a
Depth-only floor weightCross-cameradepth_weight0.15
Stale-camera expiryCross-cameracamera_stale_after_s10.0 s
Hysteresis flip thresholdOver timerequired_consecutive2 frames

PH continuation

When a person leaves the camera field and then re-enters, the world tracker detects the handoff via PHContinuationCandidate. When a new PH spawns, the tracker looks back at recently closed PHs: if a closed PH is within inferred_handoff_max_s (600 s) and inferred_handoff_max_distance_m (5.0 m), a PHContinuationCandidate is emitted with both PH IDs, the elapsed time, and the spatial distance.

The continuation candidate carries predecessor_identity_id, so the CC side can inherit the prior identity without waiting for the Bayesian resolver to re-accumulate evidence. CTS also acts on the handoff internally through cross-camera revival (carrier 6 in How identity crosses cameras): when the appearance and learned topology agree, the closed PH is revived on the new camera and keeps its ph_id and identity, rather than spawning a fresh UNKNOWN track. The Bayesian identity resolver then runs normally; the inherited gallery and face lock maintain the identity across the gap.

Identity resolver

For each PH that received an observation this frame, IdentityResolver.resolve (app/tracking/identity_resolver.py) builds a posterior over {enrolled identities} ∪ {UNKNOWN} from four parts and applies a commit rule.

Commit thresholds: commit_prob = 0.65, commit_margin = 0.15; in dense scenes (2 or more identities with posterior > 0.3) commit_prob_dense = 0.80, commit_margin_dense = 0.20. The temporal prior alone cannot commit an identity: at least one sensory source (face or ReID) must support the top identity unless the PH is inside its maintenance window.

Face lock and maintenance window. When a face anchor's confidence clears face_commit_min_confidence = 0.70, the resolver sets a face lock on the PH. A face-locked identity is held without fresh face evidence for face_lock_maintenance_max_age_s = 300 s. Without a face lock, an existing identity is held by the prior for prior_maintenance_max_age_s = 120 s. After the window expires with no sensory evidence, the identity decays.

Sticky maintenance (favor continuity). When a person turns away, the face anchor vanishes and body ReID drifts, so the per-frame posterior argmax can flip to UNKNOWN even though the PH persists. Sticky maintenance holds the committed identity within the maintenance window unless it is strongly contradicted: a recognized face for a different identity at or above contradiction_face_confidence = 0.70, or a different identity that clears the dense-scene posterior thresholds. A candidate or unrecognized face never contradicts a held identity, so a resident at a bad angle keeps their label instead of dropping to UNKNOWN. Two enrolled people in one room still separate, because a recognized different-identity face does contradict.

Three-valued face evidence and head pose

The person-identification-service already distinguishes a face that was detected but not recognized from no face region at all, and it computes head pose. CTS consumes all three states (recognized / candidate / unrecognized) plus head yaw:

StateSimilarityEffect on the posterior
recognizedat or above recognition.thresholdStrong positive for the matched identity, weighted face_weight_multiplier = 3.0, scaled by frontality
candidatebetween unknown_threshold and thresholdWeak positive for the best candidate (candidate_face_weight_multiplier); corroborates a held identity through near-profile frames; never negative
unrecognizedbelow unknown_thresholdSmall mass toward UNKNOWN (face_present_unknown_unknown_mass = 0.10); never subtracts from a held identity
no face regionn/aNeutral; rely on body ReID, the prior, and sticky maintenance

A frontality factor down-weights off-axis matches: full weight at or below frontality_full_yaw_deg = 15 degrees yaw, ramping to a floor (frontality_min_factor = 0.3) at or above frontality_zero_yaw_deg = 60 degrees. This reduces false commits from a glancing match while leaving frontal matches at full strength.

A single averaged body embedding cannot retrieve a person who turned around, because front and back are far apart in SOLIDER-REID space. The resolver estimates body orientation per observation from pose keypoints (front, back, left, right, unknown) and represents appearance as a small set of view-binned prototypes per PH and per identity. The gallery query runs once per orientation bin and takes the maximum logistic similarity across views, so a back-facing query that matches an identity's back entries scores high even when the front entries do not. A weak match leaves residual probability mass on UNKNOWN, so a non-matching body cannot be normalized onto the only enrolled identity.

Non-frontal gallery coverage grows online: once a PH has a recognized-face committed identity, subsequent observation embeddings are written to that identity's gallery tagged with their estimated orientation, subject to quality and orientation-confidence gates. Seeding requires a recognized face lock, so a candidate or unknown face never poisons the shared gallery.

How a face signal updates a PH

This is the exact stage-by-stage path when ArcFace identifies a person on a camera:

The key detail: FaceIdentityStage produces anchors keyed by detection_id (the PH does not exist yet at stage 6). After association, WorldTracker._resolve_identities remaps each anchor's tracklet_id to the assigned ph_id using det_to_ph. The identity-conflict hard gate in the cost matrix simultaneously prevents a strong, differing face from being matched to the wrong PH.

How identity crosses cameras

Path 1 requires calibration. For uncalibrated home cameras, identity crosses through the ReID gallery (path 2), recognized-face propagation (path 3, gated twice: gallery similarity >= cross_gt_face_propagation_threshold (0.72) and synthetic_confidence = source_confidence * gallery_sim >= face_commit_min_confidence (0.70)), continuation candidates published to CC (path 4), appearance dedup within a declared overlap group (path 5), and cross-camera revival acted on inside CTS (path 6).

Cross-camera revival (path 6) extends PH revival across cameras. A closed PH may be revived on a different camera when the learned camera topology says the transit is plausible (plausible_transit >= cross_camera_min_plausibility = 0.05, with a floor so a first handoff is never blocked) and the multi-view appearance similarity clears cross_camera_revive_appearance_min_sim = 0.60. A recognized different-identity face blocks revival. When the predecessor PH is still open (a genuine co-presence on two overlapping cameras), CTS does not merge the PHs; it adopts the identity onto the new PH and records a co-presence link, so the dashboard shows one person on two views rather than two people.

Camera topology is learned online: each accepted handoff records a directed edge with a running transit-time distribution (camera_topology_edges). Until an edge has enough samples it returns the plausibility floor, so topology widens linking over time without blocking legitimate first handoffs.

Next steps

Released under the AGPL-3.0 License.