Skip to main content

Landmark Detection and Smoothing

Overview

Function 2 of the Hand Landmark Detection model performs the core landmark prediction and applies sophisticated smoothing algorithms for temporal stability. This function processes the warped 224x224 hand images from Function 1 to predict precise 21-point hand landmarks with sub-pixel accuracy.

Purpose and Functionality

This function handles the critical landmark processing tasks:

  • Model Inference: Predicts 21 hand landmarks within the cropped region
  • Coordinate Normalization: Converts pixel coordinates to normalized space
  • Temporal Smoothing: Applies Exponential Moving Average (EMA) for stability
  • Quality Assessment: Validates landmark accuracy and confidence

Landmark Normalization

Coordinate Normalization Implementation

def normalize_landmarks_correctly(landmarks_px, image_width=224, image_height=224, normalize_z_factor=1.0):
"""Follow MediaPipe's exact normalization from C++ source.
landmarks_px is a list/array of [x_px, y_px, z_px] coordinates.
"""
normalized_landmarks = []
for landmark in landmarks_px:
x, y, z = landmark[:3]

norm_x = x / image_width
norm_y = y / image_height
norm_z = (z / image_width) / normalize_z_factor # Z uses width and an optional normalization factor

normalized_landmarks.append([norm_x, norm_y, norm_z])
return normalized_landmarks

EMA Smoothing Algorithm

IoU Calculation for Tracking

def calculate_iou(box1, box2):
"""
Calculates Intersection over Union (IoU) between two bounding boxes.
Box format: [x_min, y_min, width, height] (normalized coordinates)
"""
x1_min, y1_min, w1, h1 = box1
x1_max, y1_max = x1_min + w1, y1_min + h1

x2_min, y2_min, w2, h2 = box2
x2_max, y2_max = x2_min + w2, y2_min + h2

inter_x_min = max(x1_min, x2_min)
inter_y_min = max(y1_min, y2_min)
inter_x_max = min(x1_max, x2_max)
inter_y_max = min(y1_max, y2_max)

inter_w = max(0, inter_x_max - inter_x_min)
inter_h = max(0, inter_y_max - inter_y_min)
intersection_area = inter_w * inter_h

area1 = w1 * h1
area2 = w2 * h2
union_area = area1 + area2 - intersection_area

if union_area == 0:
return 0.0
return intersection_area / union_area

Main Processing Function

Complete Landmark Processing with EMA

def lm_postprocess(current_region, inference, previous_frame_regions, 
alpha=0.6, iou_threshold=0.4, crop_width=224, crop_height=224):
"""Process landmark model output, applying EMA smoothing."""

# Hand presence score - Identity_1
current_region.lm_score = float(np.squeeze(inference['Identity_1']))

# Handedness (left/right) - Identity_2
current_region.handedness = float(np.squeeze(inference['Identity_2']))

# Use 'Identity' (output 0) for pixel coordinates from the 224x224 crop
pixel_landmarks_raw = np.squeeze(inference['Identity']) # Shape (63,)

# Reshape to (21, 3) for [x, y, z]
pixel_landmarks_reshaped = pixel_landmarks_raw.reshape(21, 3)

# Apply correct normalization
current_raw_normalized_landmarks = normalize_landmarks_correctly(
pixel_landmarks_reshaped,
image_width=crop_width,
image_height=crop_height
)

# --- EMA Smoothing ---
best_match_prev_region = None
max_iou = 0.0

if previous_frame_regions and hasattr(current_region, 'pd_box'):
for prev_region in previous_frame_regions:
if hasattr(prev_region, 'pd_box'):
# Ensure pd_box is in [x_min, y_min, width, height] format if not already
# decode_bboxes stores it as [x_center - w*0.5, cy - h*0.5, w, h]
# which is [x_min, y_min, w, h], so it should be fine.
iou = calculate_iou(current_region.pd_box, prev_region.pd_box)
if iou > max_iou:
max_iou = iou
best_match_prev_region = prev_region

final_landmarks_to_set = list(current_raw_normalized_landmarks) # Default to raw

if best_match_prev_region and max_iou >= iou_threshold and \
hasattr(best_match_prev_region, 'landmarks') and \
best_match_prev_region.landmarks is not None and \
len(best_match_prev_region.landmarks) == len(current_raw_normalized_landmarks):

prev_smoothed_landmarks = best_match_prev_region.landmarks

temp_smoothed_ema = []
for i in range(len(current_raw_normalized_landmarks)):
smooth_pt = [
alpha * current_raw_normalized_landmarks[i][k] + (1 - alpha) * prev_smoothed_landmarks[i][k]
for k in range(3) # x,y,z
]
temp_smoothed_ema.append(smooth_pt)
final_landmarks_to_set = temp_smoothed_ema

current_region.landmarks = final_landmarks_to_set
return current_region

Palm Detection Decision Logic

Tracking Quality Assessment

def should_run_palm_detection(tracked_regions, lm_score_threshold=0.7):
"""Determine if palm detection is needed."""
if not tracked_regions:
return True # No existing tracking, need to detect.

for region in tracked_regions:
if not hasattr(region, 'lm_score') or region.lm_score < lm_score_threshold:
return True # A region has low landmark score or no score, re-detect.

return False # All tracked regions have good scores.

if not validate_3d_landmark_consistency(transformed_coords): return None

return transformed_coords


## Performance Optimizations

### Batch Landmark Processing

```python
def batch_landmark_processing(warped_images, landmark_model):
"""
Process multiple hand images in batch for efficiency

Args:
warped_images: List of 224x224 hand images
landmark_model: Loaded landmark detection model

Returns:
List of processed landmark results
"""
if not warped_images:
return []

# Prepare batch input
batch_input = np.stack([
preprocess_image_for_inference(img) for img in warped_images
])

# Run batch inference
batch_outputs = landmark_model.infer(batch_input)

# Process each result
results = []
for i in range(len(warped_images)):
# Extract individual outputs
landmarks = batch_outputs['Identity'][i:i+1]
presence = batch_outputs['Identity_1'][i:i+1]
handedness = batch_outputs['Identity_2'][i:i+1]
world_coords = batch_outputs['Identity_3'][i:i+1]

# Post-process
processed = lm_postprocess(
landmarks, presence, handedness, None
)
results.append(processed)

return results

Configuration Parameters

Detection Settings

  • landmark_confidence_threshold: Minimum confidence for landmarks (default: 0.5)
  • hand_presence_threshold: Hand presence detection threshold (default: 0.7)
  • anatomical_validation: Enable pose validation (default: true)

Smoothing Parameters

  • ema_smoothing_factor: Base smoothing strength (default: 0.7)
  • iou_matching_threshold: Threshold for frame matching (default: 0.3)
  • motion_adaptation: Enable motion-adaptive smoothing (default: true)
  • max_smoothing_frames: Maximum frames for smoothing history (default: 5)

Quality Control

  • enable_world_coordinates: Process 3D world coordinates (default: true)
  • strict_anatomical_validation: Enforce strict anatomical constraints (default: false)
  • interpolate_missing_landmarks: Fill low-confidence landmarks (default: true)

Output Format

ProcessedLandmarks Structure

class ProcessedLandmarks:
def __init__(self, landmarks, normalized_landmarks, presence_score,
handedness_score, handedness_classification):
self.landmarks = landmarks # Original space coordinates
self.normalized_landmarks = normalized_landmarks # [0,1] normalized
self.presence_score = presence_score # Hand presence confidence
self.handedness_score = handedness_score # Handedness confidence
self.handedness = handedness_classification # 'left' or 'right'
self.world_landmarks = None # 3D world coordinates
self.quality_score = None # Overall quality metric
self.smoothing_applied = False # EMA smoothing flag
self.processing_time = None # Processing latency

Integration Points

Input Interface

  • Warped Images: Receives 224x224 hand images from Function 1
  • Transform Matrices: Uses coordinate transformation data
  • Previous Landmarks: Accesses previous frame data for smoothing

Output Interface

  • Landmark Coordinates: Provides 21-point hand structure
  • Confidence Scores: Reports detection and handedness confidence
  • Quality Metrics: Supplies processing quality information
  • Next Stage: Feeds into Model 3 (Gesture Embedder)

Next Steps

After landmark detection and smoothing, the pipeline proceeds to:

  • Model 3 (Gesture Embedder): Generate compact feature representations from landmarks
  • Coordinate Transformation: Convert landmarks to appropriate coordinate systems
  • Temporal Tracking: Maintain landmark identity across video frames
  • Quality Assessment: Monitor landmark stability and accuracy for downstream processing