先裁剪后增强:基于掩码恢复与特征增强的内镜图像鲁棒深度及里程计估计方法

Trimming-then-augmentation: A Mask-and-Recover Framework Towards Robust Depth and Odometry Estimation for Endoscopic Images

Junyang Wu, Yun Gu, Guang-Zhong Yang

Medical Image Analysis. 2026 Jan, 103736

Abstract


Depth and odometry estimation for endoscopic imaging is an essential task for robot assisted endoluminal intervention. Due to the difficulty of obtaining sufficient in vivo ground truth data, unsupervised learning is preferred in practical settings. Existing methods, however, are hampered by imaging artifacts and the paucity of unique anatomical markers, coupled with tissue motion and specular reflections, leading to the poor accuracy and generalizability. In this work, a trimming-then-augmentation framework is proposed. It uses a “mask-then-recover” training strategy to firstly mask out the artifact regions and then reconstruct the depth and pose information based on the global perception of a convolutional network. Subsequently, an augmentation module is used to provide stable correspondence between endoscopic image pairs. A task-specific loss function guides the augmentation module to adaptively establish stable feature pairs for improving the overall accuracy of subsequent 3D structural reconstruction. Detailed validation has been performed with results showing that the proposed method can significantly improve the accuracy of existing state-of-the-art unsupervised methods, demonstrating the effectiveness of the method and its resilience to image artifacts, in addition to its stability when applied to in vivo settings.


Introduction and Methods


Although the performance of SfM based methods has been steadily improved in recent years, there remain further challenges for in vivo applications. Given an image triplet, the core task of SfM is to align the consecutive frames by finding the matching patches or feature points. In contrast to natural images with rich and distinctive features, endoscopic scenes are often faced by tissue deformation, artifacts such as specular highlights and bleeding, and the paucity of distinctive image features.
Image artifacts and tissue motion: Imaging artifacts and motion blur are common problems in endoscopy videos. As shown in Fig. 1(a), the depth of a region with artifacts and motion blur cannot be correctly predicted. To circumvent this problem, Shen et al. proposed a conditional adversarial network to model these imaging artifacts. The method, however, requires detailed manual labeling of artifacts in image frames. Shao et al. proposed appearance-flow networks to specifically eliminate illumination changes by introducing additional modules. However, this method only resolves the illumination variations but fails to reduce the impact of other artifacts.
Paucity of distinctive features: For feature correspondence in endoscopic videos, it is difficult to extract sufficient features in video frames due to the paucity of distinctive anatomical landmarks. As shown in Fig. 1(a), we cannot extract sufficient feature points from the endoscopic images based on operators such as SIFT. Although feature points are not explicitly extracted when using deep learning, the same difficulty is faced since methods such as convolutional neural networks still inherently extract low-level information based on image features, and thus the difficulties remain.
The aforementioned issues can all affect the performance of depth and odometry estimation. To address these issues, we propose in this paper a “trimming-then-augmentation” framework for unsupervised depth and odometry estimation. This method consists of two main modules, namely the trimming and augmentation modules.
Trimming module: To alleviate the problem introduced by image artifacts, we introduce a “mask-then-recover” strategy to firstly mask out artifact regions and then estimate the depth of invisible regions through contextual features, thereby improving the network’s perception of global information. Specifically, by considering the specular artifacts and the uncertainty of models, an optimal masking strategy is proposed to better handle these artifacts. As shown in Fig. 1(c), despite the presence of significant artifacts, our method accurately estimates the depth of the lumens.
Augmentation module: In this work, we propose a novel loss-driven framework to enrich endoscopic image details to ensure stable correspondence. Specifically, a lightweight neural network is designed, which incorporating distinctive features to establish stable correspondences. As shown in Fig. 1(c), more keypoints can be extracted in our “augmented” image.
The proposed method is generic and can be incorporated into existing 3D depth and pose estimation frameworks. To demonstrate this, we adapted our “trimming-then-augmentation” method to six existing methods to illustrate the improvement that can be achieved after incorporating our module. Detailed experiments on multiple datasets also demonstrate the effectiveness of the method in practical settings.

 

Key Results and Conclusions


In this work, we have proposed a generic “Trimming-then-Augmentation” framework that can provide robust features for establishing stable correspondences. For the issue of artifacts, the “trimming” module efficiently masks artifact regions relied on pixel characteristic and uncertainty, mitigating the detrimental impact during the learning process of neural networks. To address the feature-paucity issue, the “augmentaion” module adaptively adds input images with distinctive features to establish stable correspondences. Different from previous work focused on specifically-designed neural network architectures, the “Trimming-then-Augmentation” framework is designed to be modular, making it adaptable to existing architectures without the need for extensive modifications. This versatility ensures that our method can be easily adopted across different endoscopic navigation systems, enhancing its potential for broad applicability. The framework’s adaptability has been extensively validated across multiple state-of-the-art models, consistently yielding substantial performance improvements. In addition to the promising results in depth and odometry estimation, the proposed strategy also demonstrates its potential for applications in other domains. For example, as demonstrated in Section 4.10, the “augmented” images exhibit the ability to extract rich and distinctive features, which can be applied to other fields such as 3D scene reconstruction and SLAM.
Despite the promising results achieved, it is worth noting that the demonstrated performance on datasets with strong artifacts can be further improved. This phenomenon is evident from experiments for the Lowcam dataset, wherein the presence of strong artifacts significantly compromises depth and odometry accuracy. These limitations can be attributed to the low-quality training procedure on the Lowcam dataset. During the training procedure, the strong artifacts limit the reliable information necessary for establishing accurate correspondences, causing the neural network to fail. Therefore, further work is required to focus on the construction of large-scale and high-quality datasets tailored for endoscopic odometry. With the development of large foundation models, the demand for large-scale datasets has grown rapidly. Despite the availability of numerous endoscopic datasets, the heterogeneity across them is significant with notable divergence in intrinsic camera parameters, posing a significant challenge to combine multiple datasets for training. As a result, creating a unified and comprehensive endoscopic dataset emerges as a crucial focus for future endeavors.
It should also be noted that the present module is constrained by the intrinsic scale-ambiguity problem inherent to monocular depth and odometry frameworks. Current experimental results need alignment with the ground truth, which may not be practical in actual clinical settings, indicating the need for improved scale accuracy. Prospective research needs to concentrate on the incorporation of multiple modalities. Utilizing pre-operative CT scans or intra-operative robotic control signals could serve as supplementary scale supervision measures, thereby enhancing the accuracy and applicability of depth and odometry estimation methods.
In summary, we have presented in this work an effective way for dealing with robust interference in the presence of artifacts and paucity of distinctive features for endoscopic navigation by using a “Trimming-then-Augmentation” strategy. The “trimming” module is designed to mask artifact regions in endoscopic images to alleviate the negative effect while the “augmentation” module is able to provide stable features for robust correspondences. Experimental results demonstrate the effectiveness of the proposed method both as a standalone and as an add-on module for improving the performance of self-supervised depth and pose estimation.

 


 

Fig. 1. Overview of “Trimming-then-Augmentation” for robust depth and odometry estimation. (a). Image artifacts and the paucity of distinctive features hamper the performance of traditional methods and (b). our proposed framework comprises two modules: the trimming module and the augmentation module allowing more accurate and resilient depth and pose estimation as shown as an example in (c).

 


 

Fig. 2. The overall workflow of our “Trimming-then-Augmentation” module. In the “Trimming” module, artifacts and uncertain regions are detected and masked, then Recon network reconstructs the structural information. In the “Augmentation” module, the AugNet extracts stable features, generating “Augmented” images. In the SfM training module, the depth and the pose are estimated, supervised by the photometric loss function.

 

 

Fig. 3. The process of estimating uncertainty in the proposed processing workflow. Different random masks are applied to the input image and the standard deviation of different depth maps represents the uncertainty of each pixel.

 

 

Fig. 4. Qualitative depth comparison on the SCARED dataset. The T&A module has been incorporated to six different state-of-the-art self-supervised depth and pose estimation methods, showing marked improvements in estimation results.

 

 

Fig. 5. Qualitative depth comparison on Lowcam and Lung datasets. After incorporating the T&A module, the depth maps are more accurate and the structures are clearer. Though incorporating T&A with EndoDAC, we can estimate accurate depth on challenging datasets.

 

 

Fig. 6. Qualitative pose comparison on three datasets. For each dataset, we visualize two trajectories for comparison.

 

 

Fig. 7. Visualization of the “Trimming” module. The five columns present the original image, the masked image, the uncertainty map, and the result of the depth estimation with/without the mask, respectively.

 

 

Fig. 8. The visualization of different components in the “augmentation” module. The original images have complex artifacts. The reconstructed images contains blurry structure information but lack details. The detail map extracts stable detail features, and the augmented images have stable features.

 

 

Fig. 9. Visualization of spatial correspondence on the SCARED and Lowcam dataset. The ORB keypoints are extracted and the match results are conducted using KNN matching algorithm.

 

https://www.sciencedirect.com/science/article/pii/S136184152500283X

Copyright © 2025上海交通大学医疗机器人研究院 版权所有 沪交ICP备20190057   流量统计