Landmark Detection and Smoothing

Overview

Function 2 of the Hand Landmark Detection model performs the core landmark prediction and applies sophisticated smoothing algorithms for temporal stability. This function processes the warped 224x224 hand images from Function 1 to predict precise 21-point hand landmarks with sub-pixel accuracy.

Purpose and Functionality

This function handles the critical landmark processing tasks:

Model Inference: Predicts 21 hand landmarks within the cropped region
Coordinate Normalization: Converts pixel coordinates to normalized space
Temporal Smoothing: Applies Exponential Moving Average (EMA) for stability
Quality Assessment: Validates landmark accuracy and confidence

Landmark Normalization

Coordinate Normalization Implementation

def normalize_landmarks_correctly(landmarks_px, image_width=224, image_height=224, normalize_z_factor=1.0):
    """Follow MediaPipe's exact normalization from C++ source.
    landmarks_px is a list/array of [x_px, y_px, z_px] coordinates.
    """
    normalized_landmarks = []
    for landmark in landmarks_px:
        x, y, z = landmark[:3]
        
        norm_x = x / image_width
        norm_y = y / image_height  
        norm_z = (z / image_width) / normalize_z_factor # Z uses width and an optional normalization factor
        
        normalized_landmarks.append([norm_x, norm_y, norm_z])
    return normalized_landmarks

EMA Smoothing Algorithm

IoU Calculation for Tracking

def calculate_iou(box1, box2):
    """
    Calculates Intersection over Union (IoU) between two bounding boxes.
    Box format: [x_min, y_min, width, height] (normalized coordinates)
    """
    x1_min, y1_min, w1, h1 = box1
    x1_max, y1_max = x1_min + w1, y1_min + h1

    x2_min, y2_min, w2, h2 = box2
    x2_max, y2_max = x2_min + w2, y2_min + h2

    inter_x_min = max(x1_min, x2_min)
    inter_y_min = max(y1_min, y2_min)
    inter_x_max = min(x1_max, x2_max)
    inter_y_max = min(y1_max, y2_max)

    inter_w = max(0, inter_x_max - inter_x_min)
    inter_h = max(0, inter_y_max - inter_y_min)
    intersection_area = inter_w * inter_h

    area1 = w1 * h1
    area2 = w2 * h2
    union_area = area1 + area2 - intersection_area

    if union_area == 0:
        return 0.0
    return intersection_area / union_area

Main Processing Function

Complete Landmark Processing with EMA

def lm_postprocess(current_region, inference, previous_frame_regions, 
                   alpha=0.6, iou_threshold=0.4, crop_width=224, crop_height=224):
    """Process landmark model output, applying EMA smoothing."""
    
    # Hand presence score - Identity_1
    current_region.lm_score = float(np.squeeze(inference['Identity_1']))
    
    # Handedness (left/right) - Identity_2  
    current_region.handedness = float(np.squeeze(inference['Identity_2']))
    
    # Use 'Identity' (output 0) for pixel coordinates from the 224x224 crop
    pixel_landmarks_raw = np.squeeze(inference['Identity']) # Shape (63,)
    
    # Reshape to (21, 3) for [x, y, z]
    pixel_landmarks_reshaped = pixel_landmarks_raw.reshape(21, 3)
    
    # Apply correct normalization
    current_raw_normalized_landmarks = normalize_landmarks_correctly(
        pixel_landmarks_reshaped, 
        image_width=crop_width, 
        image_height=crop_height
    )
    
    # --- EMA Smoothing ---
    best_match_prev_region = None
    max_iou = 0.0

    if previous_frame_regions and hasattr(current_region, 'pd_box'):
        for prev_region in previous_frame_regions:
            if hasattr(prev_region, 'pd_box'):
                # Ensure pd_box is in [x_min, y_min, width, height] format if not already
                # decode_bboxes stores it as [x_center - w*0.5, cy - h*0.5, w, h]
                # which is [x_min, y_min, w, h], so it should be fine.
                iou = calculate_iou(current_region.pd_box, prev_region.pd_box)
                if iou > max_iou:
                    max_iou = iou
                    best_match_prev_region = prev_region
    
    final_landmarks_to_set = list(current_raw_normalized_landmarks) # Default to raw

    if best_match_prev_region and max_iou >= iou_threshold and \
       hasattr(best_match_prev_region, 'landmarks') and \
       best_match_prev_region.landmarks is not None and \
       len(best_match_prev_region.landmarks) == len(current_raw_normalized_landmarks):
        
        prev_smoothed_landmarks = best_match_prev_region.landmarks
        
        temp_smoothed_ema = []
        for i in range(len(current_raw_normalized_landmarks)):
            smooth_pt = [
                alpha * current_raw_normalized_landmarks[i][k] + (1 - alpha) * prev_smoothed_landmarks[i][k]
                for k in range(3) # x,y,z
            ]
            temp_smoothed_ema.append(smooth_pt)
        final_landmarks_to_set = temp_smoothed_ema
        
    current_region.landmarks = final_landmarks_to_set 
    return current_region

Palm Detection Decision Logic

Tracking Quality Assessment

def should_run_palm_detection(tracked_regions, lm_score_threshold=0.7):
    """Determine if palm detection is needed."""
    if not tracked_regions:
        return True  # No existing tracking, need to detect.
    
    for region in tracked_regions:
        if not hasattr(region, 'lm_score') or region.lm_score < lm_score_threshold:
            return True  # A region has low landmark score or no score, re-detect.
            
    return False # All tracked regions have good scores.

if not validate_3d_landmark_consistency(transformed_coords): return None

return transformed_coords

## Performance Optimizations

### Batch Landmark Processing

```python
def batch_landmark_processing(warped_images, landmark_model):
    """
    Process multiple hand images in batch for efficiency
    
    Args:
        warped_images: List of 224x224 hand images
        landmark_model: Loaded landmark detection model
        
    Returns:
        List of processed landmark results
    """
    if not warped_images:
        return []
    
    # Prepare batch input
    batch_input = np.stack([
        preprocess_image_for_inference(img) for img in warped_images
    ])
    
    # Run batch inference
    batch_outputs = landmark_model.infer(batch_input)
    
    # Process each result
    results = []
    for i in range(len(warped_images)):
        # Extract individual outputs
        landmarks = batch_outputs['Identity'][i:i+1]
        presence = batch_outputs['Identity_1'][i:i+1]
        handedness = batch_outputs['Identity_2'][i:i+1]
        world_coords = batch_outputs['Identity_3'][i:i+1]
        
        # Post-process
        processed = lm_postprocess(
            landmarks, presence, handedness, None
        )
        results.append(processed)
    
    return results

Configuration Parameters

Detection Settings

landmark_confidence_threshold: Minimum confidence for landmarks (default: 0.5)
hand_presence_threshold: Hand presence detection threshold (default: 0.7)
anatomical_validation: Enable pose validation (default: true)

Smoothing Parameters

ema_smoothing_factor: Base smoothing strength (default: 0.7)
iou_matching_threshold: Threshold for frame matching (default: 0.3)
motion_adaptation: Enable motion-adaptive smoothing (default: true)
max_smoothing_frames: Maximum frames for smoothing history (default: 5)

Quality Control

enable_world_coordinates: Process 3D world coordinates (default: true)
strict_anatomical_validation: Enforce strict anatomical constraints (default: false)
interpolate_missing_landmarks: Fill low-confidence landmarks (default: true)

Output Format

ProcessedLandmarks Structure

class ProcessedLandmarks:
    def __init__(self, landmarks, normalized_landmarks, presence_score, 
                 handedness_score, handedness_classification):
        self.landmarks = landmarks                    # Original space coordinates
        self.normalized_landmarks = normalized_landmarks  # [0,1] normalized
        self.presence_score = presence_score         # Hand presence confidence
        self.handedness_score = handedness_score     # Handedness confidence
        self.handedness = handedness_classification  # 'left' or 'right'
        self.world_landmarks = None                  # 3D world coordinates
        self.quality_score = None                    # Overall quality metric
        self.smoothing_applied = False               # EMA smoothing flag
        self.processing_time = None                  # Processing latency

Integration Points

Input Interface

Warped Images: Receives 224x224 hand images from Function 1
Transform Matrices: Uses coordinate transformation data
Previous Landmarks: Accesses previous frame data for smoothing

Output Interface

Landmark Coordinates: Provides 21-point hand structure
Confidence Scores: Reports detection and handedness confidence
Quality Metrics: Supplies processing quality information
Next Stage: Feeds into Model 3 (Gesture Embedder)

Next Steps

After landmark detection and smoothing, the pipeline proceeds to:

Model 3 (Gesture Embedder): Generate compact feature representations from landmarks
Coordinate Transformation: Convert landmarks to appropriate coordinate systems
Temporal Tracking: Maintain landmark identity across video frames
Quality Assessment: Monitor landmark stability and accuracy for downstream processing

Overview​

Purpose and Functionality​

Landmark Normalization​

Coordinate Normalization Implementation​

EMA Smoothing Algorithm​

IoU Calculation for Tracking​

Main Processing Function​

Complete Landmark Processing with EMA​

Palm Detection Decision Logic​

Tracking Quality Assessment​

Configuration Parameters​

Detection Settings​

Smoothing Parameters​

Quality Control​

Output Format​

ProcessedLandmarks Structure​

Integration Points​

Input Interface​

Output Interface​

Next Steps​