
BIASNet:双向特征对齐与语义引导的弱监督医学图像配准网络
BIASNet: A Bidirectional Feature Alignment and Semantics-guided Network for Weakly-Supervised Medical Image RegistrationHousheng Xie, Xiaoru Gao, Guoyan Zheng
Medical Image Analysis, 2025, 103913.Abstract
Medical image registration, which establishes spatial correspondences between different medical images, serves as a fundamental process in numerous clinical applications and diagnostic workflows. Despite significant advancement in unsupervised deep learning-based registration methods, these approaches consistently yield suboptimal results compared to their weakly-supervised counterparts. Recent advancements in universal segmentation models have made it easier to obtain anatomical labels from medical images. However, existing registration methods have not fully leveraged the rich anatomical and structural prior information provided by segmentation labels. To address this limitation, we propose a BIdirectional feature Alignment and Semantics-guided Network, referred to as BIASNet, for weakly-supervised image registration. Specifically, starting from multi-scale features extracted from the pre-trained VoCo, fine-tuned using Low-Rank Adaptation (LoRA), we propose a dual-attribute learning scheme, incorporating a novel BIdirectional Alignment and Fusion (BIAF) module for extracting both semantics-wise and intensity-wise features. These two types of features are subsequently fed into a semantics-guided progressive registration framework for accurate deformation field estimation. We further propose an anatomical region deformation consistency learning to regularize the target anatomical regions deformation. Comprehensive experiments conducted on three typical yet challenging datasets demonstrate that our method achieves consistently better results than other state-of-the-art deformable registration approaches. The source code is publicly available at https://github.com/xiehousheng/BIASNet.
Introduction and Methods
First, from the feature representation learning point of view, existing weakly-supervised registration methods primarily rely on intensity-wise features, learned from the registration task supervision alone, to achieve deformation prediction. However, as illustrated by our empirical study results shown in Fig. 1-(a), these intensity-wise features capture image edges and textures with significant gradient variations, exhibiting limited anatomically-meaningful semantic information, and are sensitive to low intensity contrast, noise, and modality-specific acquisition artifacts. In contrast, semantics-wise features obtained through additional segmentation task supervision provide representations with explicit semantic meaning, encoding discriminative information such as anatomical boundaries and organ structures, and are robust to noise, artifacts, and modality-specific intensity variations. When used separately for deformation prediction, semantics-wise features consistently outperform intensity-wise features (Fig. 1-(a)). Although recent studies have explored the fusion of semantics-wise and intensity-wise features to facilitate registration, these methods primarily focus on utilizing semantics-wise features as auxiliary information to enhance intensity-wise features, overlooking the advantages of semantics-wise features over intensity-wise features in registration tasks.
Second, from the perspective of feature fusion, existing registration architectures primarily rely on interpolation-based upsampling and skip connections for multi-scale feature fusion. However, fixed-position interpolation sampling potentially fails to maintain precise spatial correspondence between deep semantic features and shallow detailed features, causing progressive misalignment across decoder levels and resulting in semantic inconsistencies and blurred details (see the empirical investigation that will be detailed in section 4.6.4). This issue particularly has a negative impact on progressive registration methods that require deformation field prediction at each decoder level, where accurate cross-scale feature fusion is essential. For segmentation feature-assisted registration methods, feature fusion also occurs during the process of combining semantics-wise and intensity-wise features. BFM-Net relies on spatial attention to enhance semantics-wise and intensity-wise features before concatenation-based fusion, but this method does not explicitly model the correlations between different types of features. AC-DMiR directly utilizes cross-attention mechanisms for feature fusion, which may suffer from insufficient feature aggregation.
Third, from the way of exploring segmentation labels for weak supervision, existing weakly-supervised registration frameworks primarily use segmentation labels for Dice loss calculation in the label space (Fig. 1-(b) and Fig. 1-(c)). The supervision for registration learning in the image space relies solely on predefined similarity metric functions, which inherently lack explicit anatomical information. Therefore, how to efficiently leverage label information to provide additional supervisory signals in the image space to guide the registration learning process is still an open problem.
To address above issues, in this paper, we propose a BIdirectional feature Alignment and Semantics-guided network, referred to as BIASNet, for weakly-supervised image registration. Our framework comprises three key components: a) in the feature representation learning phase, we employ VoCo fine-tuned with Low-Rank Adaptation (LoRA) as our feature encoder. Subsequently, we design a dual-attribute feature decoding architecture to extract both semantics-wise and intensity-wise features. During the decoding process, we propose BIdirectional Alignment and Fusion (BIAF) modules to achieve dynamic cross-scale spatial alignment and mitigate misalignment generated during cross-scale feature fusion; b) additionally, we propose a semantics-guided progressive registration framework. Considering that semantics-wise features generate better registration performance than intensity-wise features, we explicitly utilize intensity-wise features to enhance semantics-wise features (see Fig. 1-(d)) through our carefully designed Mixed Attention Calibration (MAC) modules operating across multiple resolutions. The enhanced representation is then leveraged within a progressive registration paradigm to predict deformation fields at multiple resolutions and to output the deformation field at the highest resolution level; and c) finally, to further exploit the prior information within segmentation labels, we introduce an Anatomical Region Deformation Consistency (ARDC) learning strategy to provide additional label-based supervisory signals. After training, BIASNet requires no labels as input during inference and outputs a deformation field given a pair of input images. The main contributions of this work are summarized as follows:
• We propose BIASNet, a weakly-supervised image registration framework that establishes a semantics-centric registration framework through dual-attribute feature representation learning and a semantics-guided progressive registration process.
• We design two key modules, i.e., the BIAF module and the MAC module. In particular, the BIAF module dynamically establishes cross-scale positional correspondences between feature maps through bidirectional cross-scale feature alignment and fusion while the MAC module captures correlation information between intensity-wise and semantics-wise features, and incorporates a learnable gating mechanism to control information flow, enabling explicit enhancement of semantics-wise features guided by intensity-wise features for registration.
• We propose an anatomical region deformation consistency learning strategy that leverages prior information from segmentation labels to provide additional registration supervisory signals by establishing image-level consistency, facilitating accurate registration across diverse anatomical structures.
Key Results and Conclusions
It is important to acknowledge the limitations of the present study. First, BIASNet still relies on segmentation labels during training to learn semantics-wise features and leverage segmentation labels for supervisory signals, though, at testing stage, BIASNet does not require any segmentation label. Nevertheless, our ablation study results presented in Table 14 showed that even when automatically generated segmentation labels were used for training, BIASNet still achieved a mean DSC of 82.10% on the abdominal dataset, better than the performance obtained by the second-best method RDP as presented in Table 3, where ground truth segmentation labels were used for training. Such results demonstrate that BIASNet can be used together with a foundation model (e.g., TotalSegmentator) to learn semantics-awareness for accurate medical image registration, which is a clear advantage for future clinical adoption. Second, the results presented in Table 14 show that employing automatically generated labels does lead to modest registration performance degradation, as the automatically generated labels are not entirely accurate and adversely impact registration performance. One possible solution is to develop noise-robust training strategies for weakly supervised registration learning, enabling models to handle performance degradation caused by using imperfect segmentation labels as supervision. The last limitation lies in the selection of hyperparameters, which are empirically determined. Better results may be obtained if a grid search of best hyperparameters would be performed. Nonetheless, even with the empirically determined hyperparameters, BIASNet still achieves consistently better results than other SOTA approaches on all three datasets.
In this paper, we proposed a new bidirectional feature alignment and semantics-guided network, referred to as BIASNet, for weakly-supervised image registration. It adopted VoCo fine-tuned with low-rank adaptation for efficacious feature extraction. Subsequently, we designed a dual-attribute feature decoding architecture to extract both semantics-wise and intensity-wise features. The extracted features were then fed to a semantics-guided progressive registration framework. To alleviate feature misalignment issues arising from repeated upsampling operations and skip connections during progressive registration, we proposed a bidirectional alignment and fusion module to achieve precise bidirectional cross-scale feature alignment and fusion. Considering that semantics-wise features generated better registration performance than intensity-wise features, we designed mixed attention calibration module to explicitly utilize intensity-wise features to enhance semantics-wise features. The enhanced representation was then leveraged within a progressive registration paradigm to predict deformation fields at multiple resolution levels and to output the deformation field at the highest resolution level. To further exploit the prior information within segmentation labels, we introduced an anatomical region deformation consistency learning strategy to provide additional label-based supervisory signals. Experiments on three typical yet challenging datasets demonstrated the superior performance of BIASNet over other SOTA methods.

Fig. 1. Conceptual design of the proposed method. (a) Results of an empirical study (details will be presented in section 4.6.3) comparing the performance (measured by average Dice Similarity Coefficient (DSC)) of intensity-wise features with that of semantics-wise features for deformation prediction on three datasets; (b) Conventional weakly-supervised registration methods using intensity-wise features with Dice loss for registration learning; (c) Segmentation feature-assisted registration framework treats semantics-wise features as auxiliary information to enhance intensity-wise features for deformation prediction, supervised by Dice loss; and (d) Overall architecture of our method.

Fig. 2. A schematic illustration of the network architecture of the proposed framework, where we first leverage a weight-shared pre-trained VoCo, fine-tuned using Low-Rank Adaptation (LoRA) during training, to extract encoded features and respectively from the input moving and fixed images (Im and If). The feature decoding incorporates dual decoders: a semantics-wise decoder (Dsem) and an intensity-wise decoder (Dint), which respectively process the encoded features to generate semantics-wise and intensity-wise feature representations. These multi-level features are subsequently integrated into the semantics-guided progressive registration process to estimate the deformation field Φ. The final registered image Iw is obtained by warping the moving image Im using the predicted deformation field Φ. To improve registration accuracy within target anatomical regions, we propose a novel anatomical region deformation consistency learning strategy. In this strategy, masked versions of the input images ( and ) are first processed through an identical network architecture. Then, consistency constraints are imposed to enhance registration accuracy.

Fig. 3. A schematic illustration of our dual-attribute feature representation learning framework. The LoRA finetuned VoCo extracts multi-scale feature representations and from the input image pairs. These features are then processed through a dual-decoder architecture, comprising a semantics-wise decoder Dsem and an intensity-wise decoder Dint. Both decoders incorporate our proposed BIAF module, which enables precise cross-scale feature fusion within a multi-resolution framework. The Dsem generates semantics-wise features (, ), while the Dint produces intensity-wise features (, ), forming a comprehensive feature representation for accurate registration.

Fig. 4. A schematic illustration of the proposed semantics-guided progressive registration, which operates hierarchically, starting from the lowest resolution level. At each resolution level i, semantics-wise features (, ) and intensity-wise features (, ) are processed through a MAC module to generate refined features (, ). These enhanced features are then utilized to estimate the residual deformation field ϕi, which is subsequently fused with the accumulated deformation fields from the previous resolution level to estimate Φi of the current resolution level. The composite deformation field guides the warping of moving image features at the subsequent resolution level. This progressive refinement continues until reaching the highest resolution level, where the final deformation field Φ is computed, enabling precise registration.

Fig. 5. A schematic illustration of the proposed anatomical region deformation consistency learning strategy. The original image pairs and their anatomically masked variants are processed through a weight-shared network to generate corresponding deformation fields Φ and ΦM. A consistency loss is computed between images and , obtained respectively by warping using Φ and ΦM.

Fig. 6. Qualitative comparison of our method (Ours) with other competing approaches on two cases from the OASIS dataset. For each case, the comparison is organized in three rows for different methods. Top: Warped moving images; Middle: Warped segmentation labels of the moving images; Bottom: Visualization of the deformation fields obtained by each method.

Fig. 7. Detailed analysis of results achieved by different registration methods on the OASIS dataset. (a) Box plots illustrating the DSC performance of four anatomical structures (the cerebrospinal fluid (CSF), the gray matter (GM), the white matter (WM), and the cortex (CORT)) achieved by different methods, along with the overall DSC performance. (b) Per-organ scatter plots comparing the proposed approach against the second-best method (NICE-Trans). The x-axis represents the DSC (%) performance of NICE-Trans, while the y-axis represents the DSC (%) performance of our approach (Ours). The dashed diagonal line denotes parity (y = x), where points above the line indicate superior performance of our method over NICE-Trans.

Fig. 8. Qualitative comparison of our method (Ours) with other competing approaches on two cases from the abdominal dataset. For each case and for each method, the comparison is presented in two rows with the first row showing the warped moving image overlaid with its corresponding warped segmentation label and the second row visualizing the estimated deformation field.

Fig. 9. Detailed analysis of results achieved by different registration methods on the abdominal dataset. (a) Box plots illustrating the DSC performance of four anatomical structures (the liver, the spleen, the right kidney, and the left kidney), along with the overall DSC performance. (b) Per-organ scatter plots comparing the proposed approach against the second-best method (RDP). The x-axis represents the DSC (%) performance of RDP, while the y-axis represents the DSC (%) performance of our approach (Ours). The dashed diagonal line denotes parity (y = x), where points above the line indicate superior performance of our method over RDP.

Fig. 10. Qualitative comparison of our method (Ours) with other competing approaches on two cases from the hip dataset. For each case and for each method, the comparison is presented in two rows with the first row showing the warped moving image overlaid with its corresponding warped segmentation label and the second row visualizing the estimated deformation field.

Fig. 11. Detailed analysis of results achieved by different registration methods on the hip dataset. (a) Box plots illustrating the DSC performance of three anatomical structures (the pelvis, the right femur, and the left femur), along with the overall DSC performance. (b) Per-organ scatter plots comparing the proposed approach against the second-best method (GroupMorph). The x-axis represents the DSC (%) performance of GroupMorph, while the y-axis represents the DSC (%) performance of our approach (Ours). The dashed diagonal line denotes parity (y = x), where points above the line indicate superior performance of our method over GroupMorph.

Fig. 12. Learning curves of the proposed approach with (w/) and without (w/o) incorporating the BIAF module on (a) the OASIS dataset, (b) the abdominal dataset, and (c) the hip dataset.

Fig. 13. Stratified analysis of the performance changes obtained by incorporating the BIAF module across samples of varying deformation magnitudes. (a) Evaluation on the OASIS dataset; (b) Evaluation on the abdominal dataset; and (c) Evaluation on the hip dataset. Each dot denotes one test sample. The x-axis represents the average deformation magnitude, which is obtained for each test sample by computing the average deformation magnitude of organ masks in the corresponding regions of the deformation field from the ground truth (binary segmentation labels). The y-axis reports the DSC change yielded by incorporating the BIAF module. The red solid line of each sub-figure is the least-squares fitting line, indicating the trend of the DSC changes with respect to the average deformation magnitudes.

Fig. 14. Detailed analysis of failure cases. (a) Input image pairs and DSC changes of all failure cases of the abdominal dataset; and (b) Coronal view of three testing images from the abdominal dataset. Top: sample S1 contains bilateral kidney atrophy (red bounding box); middle: sample S2 has an abnormal mass lesion (red bounding box) superior to the spleen which may displace the spleen out of its normal position; and bottom: sample S7 has an abnormal spleen (red bounding box) which may displace the left kidney out of its normal position.

Fig. 15. Visualization of the semantic feature maps of a fixed image at the highest resolution, learned by the proposed approach with (w/) and without (w/o) incorporating the BIAF module. Here we show 8 randomly selected channels of features. Top: feature maps learned by the proposed approach with the BIAF module; Middle: feature maps learned by the proposed approach without incorporating the BIAF module; Bottom: difference maps computed as the element-wise subtraction between the top and the middle feature maps, highlighting the impact of incorporating the BIAF module on feature representation learning.

Fig. 16. Visualization of 8 randomly selected channels of the input and the output feature maps of the MAC module at the highest resolution for both moving (top) and fixed (bottom) images, where and are input intensity-wise features; and are input semantics-wise features; and and are calibrated output features of the MAC module.

Fig. 17. Qualitative results of analyzing the progressive registration process of the proposed method. Here we show an input image pairs from each dataset as well as the the deformation fields (Φ3, Φ2, Φ1, and Φ0) generated at different resolution levels and the corresponding warped images. At each level, the deformation field is upsampled to the same resolution as the original image, then used to warp the moving image for visualization.
https://www.sciencedirect.com/science/article/pii/S1361841525004591