CVPR 2026

Real-Time Multimodal Fingertip Contact Detection
via Depth and Motion Fusion for
Vision-Based Human-Computer Interaction


Mukhiddin Toshpulatov1,2,4,5 · Wookey Lee2 · Suan Lee3 · Geehyuk Lee1

1SpaceTop, SoC, KAIST    2VoiceAI, BMSE, Inha University    3SoCS, Semyung University    4Dep. of CE, Gachon University, South Korea    5Jizzakh branch of the National University of Uzbekistan


VR Keyboard Demo — Real-time fingertip contact detection with depth estimation

Real-time virtual keyboard demo. Our system tracks all fingertips, estimates metric depth via fine-tuned Depth Anything V2, and detects contact events using a velocity-gated hysteresis state machine — all at 30 FPS.

VR Keyboard Demo

Real-time typing with depth estimation

Fingertip contact detection

Multi-finger virtual keyboard

94.2% contact detection accuracy  |  3.8 mm depth MAE (68% reduction)  |  45.6 WPM typing speed  |  30 FPS real-time on RTX 3060 Ti

Abstract

We present a real-time system for detecting fingertip contact events on flat surfaces using monocular depth estimation and motion analysis. Our approach fine-tunes Depth Anything V2 on a custom close-range dataset captured with an Intel RealSense D405 depth camera, reducing mean absolute depth error by 68%, from 12.3 mm to 3.84 mm in the 25–45 cm operating range. Contact detection employs a velocity-gated hysteresis state machine that fuses depth-based surface distance (4.5 mm entry / 6.0 mm exit thresholds) with fingertip motion velocity cues, achieving 94.2% accuracy and 94.4% F1-score. The system operates at 30 FPS on consumer hardware, enabling practical vision-based virtual keyboard interaction at 45.6 words per minute with 3.1% character error rate. We release a multi-user, multi-angle D405 depth dataset comprising 53,300 RGB-depth pairs from 15 participants at four camera angles, with per-frame hand landmark annotations and contact/hover labels.

Method

System Pipeline

System Pipeline. An RGB camera captures the typing scene. Our fine-tuned Depth Anything V2 estimates metric depth, from which fingertip-to-surface distances are computed. A velocity-gated hysteresis state machine fuses depth distance and fingertip motion to detect contact events in real-time.

Velocity-Gated Hysteresis State Machine

Contact entry threshold4.5 mm (fingertip-to-surface)
Contact exit threshold6.0 mm (hysteresis prevents flicker)
Cooldown period450 ms (~15 frames at 30 FPS)
Depth modelDA2-ViTS fine-tuned, max_depth=0.5 m

Results

Depth Estimation

ModelMAE (mm)δ1 (%)RMSE (mm)
DA2-ViTS (pre-trained)12.387.218.4
Ours (fine-tuned)3.895.964.8

Contact Detection

MethodAcc (%)F1 (%)FPR (%)
Depth only87.386.18.7
Velocity only89.188.56.3
Ours (fusion)94.294.44.2

Typing Performance

WPM45.6
CER3.1%
F1-Score94.4%
Inference30 FPS (RTX 3060 Ti)

Sample RGB-Depth Pairs

RGB and metric depth pair from D405

Left: RGB input from Intel RealSense D405. Right: Ground-truth metric depth (TURBO colormap, uint16 millimeters).

Dataset

Total frames53,300 RGB-depth pairs
Participants15 users
Camera angles30°, 45°, 60°, 90°
Splits42,640 / 5,330 / 5,330 (8:1:1)
Depth sensorIntel RealSense D405 (640×480, 30 FPS)
Depth accuracy<0.5 mm at 35 cm
Annotations21 hand landmarks + 5 fingertip depths per frame
LabelsPer-fingertip contact/hover state

Multi-Angle Recording Setup

Dataset samples at 30, 45, 60, and 90 degree camera angles

Data captured at four camera angles: 30° (top-left), 45° (top-right), 60° (bottom-left), 90° overhead (bottom-right).

Dataset Samples (RGB + Depth)

Dataset samples showing RGB and depth pairs

Three example frames from the dataset. Left: RGB input. Right: Metric depth from D405 (TURBO colormap). Hands at varying poses and distances from the keyboard surface.

Data Recording Setup

Recording setup
Camera angle view
D405 camera closeup
Typing scene

Recording setup with Intel RealSense D405 at various camera angles.

Recording demo

Data capture session

Camera setup view

Quick Access (QR Codes)

Scan to access project resources

QR: Project Page

Project Page

QR: GitHub

GitHub Code

QR: HF Model

HF Models

QR: HF Dataset

HF Dataset

BibTeX

@inproceedings{toshpulatov2026realtime,
  title={Real-Time Multimodal Fingertip Contact Detection via Depth and
         Motion Fusion for Vision-Based Human-Computer Interaction},
  author={Toshpulatov, Mukhiddin and Lee, Wookey and Lee, Suan and Lee, Geehyuk},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision
             and Pattern Recognition (CVPR)},
  year={2026}
}