0 Comments

[ad_1]

As the core carrier of syndrome differentiation and treatment principles, Traditional Chinese Medicine Electronic Medical Records (TCM-EMRs) play a crucial role in knowledge inheritance and clinical decision-making. While nested named entity recognition (NNER) serves as a fundamental task in EMR analysis, it faces several challenges, including ambiguous entity boundaries, complex semantic nesting, and diverse domain-specific terminology. Traditional sequence labeling methods are often constrained by single-granularity features, significantly limiting their semantic representation capability and generalization performance, particularly when applied to mixed corpora containing ancient TCM texts and modern EMRs.

To address these limitations, this study proposes DG-SpanTCM, a novel NNER model specifically designed for TCM-EMRs, with its technical framework illustrated in Fig. 1. The model employs a multi-dimensional feature fusion mechanism, where character-level contextual semantic features are first captured using the BERT pre-trained language model, while SoftLexicon is incorporated to integrate domain-specific lexicon information, enabling a more comprehensive representation of TCM terminology at both the character and word levels. Additionally, an adversarial training strategy is introduced to enhance model robustness by generating perturbations, along with a multiplicative attention mechanism that dynamically focuses on key semantic units. Finally, a span-based classification architecture is designed to model nested entity relationships through multi-level feature interaction, while a class-balanced loss function is incorporated to mitigate label imbalance issues effectively. This framework ensures improved nested entity recognition accuracy, enhanced semantic understanding of TCM texts, and greater robustness in handling diverse and complex TCM-EMR corpora.

Coding layer

In the semantic coding stage of Chinese medicine electronic medical records, we designed a dual-granularity semantic enhancement module for words and phrases to address the characteristics of multi-granularity nesting of terminology (e.g., the hierarchical containment relationship between “yin deficiency of the liver and kidney” and “hypertension with yin deficiency of the liver and kidney”). This module realizes the accurate perception of the boundary of TCM entities through the deep fusion of the pre-trained language model and the domain dictionary, effectively alleviating the ambiguity and hierarchical nesting of terminological expressions.

Fig. 1
figure 1

Overall architecture of the proposed DG-SpanTCM model. The encoder integrates BERT with SoftLexicon for character-level semantic and lexical embedding, while the decoder performs span classification and entity extraction.

Traditional Chinese Medicine (TCM) texts are characterized by rich semantic ambiguity, hierarchical expressions, and domain-specific terminology. These characteristics pose challenges for traditional static embedding methods, such as Word2Vec or GloVe, which assign the same vector to a word regardless of context. In contrast, BERT generates contextualized embeddings that dynamically adjust based on surrounding text, enabling more accurate semantic representation, even for the same character appearing in different contexts. This makes BERT particularly well-suited for named entity recognition tasks in TCM electronic medical records.

Specifically, given a TCM text sequence S={S1,S2,… ,Sn}, a domain-adapted BERT model is used to obtain character-level contextual dynamic representations. Each character Si is mapped to a dense vector that captures its semantic features within the TCM context, as defined in Eq. (1), where d denotes the hidden dimension size:

$$\:\text{H}=\text{B}\text{E}\text{R}\text{T}\left(\text{S}\right),\:\text{H}\in\:{\mathbb{R}}^{\text{n}\times\:\text{d}}$$

(1)

During BERT training, the input embeddings are formed by summing the Token Embedding, Position Embedding, and Segment Embedding, which together encode both lexical meaning and sequential information. This embedding process is illustrated in Fig. 2, which shows how the input sequence is transformed into contextualized representations for downstream entity recognition.

Fig. 2
figure 2

Input sequence construction for the BERT model. Each input token is represented by the sum of its Token Embedding, Segment Embedding, and Position Embedding.

To further enhance the boundary-awareness of TCM terms, we innovatively combine the SoftLexicon algorithm with the National Standard Dictionary of Classification and Codes of TCM Diseases to construct a quaternary participle feature set for each character Si:

1) B-word set: TCM terms starting with Si.

2) M-word set: TCM compound words containing Si.

3) E-word set: evidence expressions ending in Si.

4) S-word set: single-word TCM terms that stand alone.

With TCM dictionary matching, we construct the feature mapping function:

which indicates a specialized dictionary of Chinese medicine. For example, for the character “肠” in the phrase “right half colon cancer”, the lexicon is as follows: B={“intestinal cancer”},M={“colon cancer “},E={“colon”, “hemicolon”, “right hemicolon”},S={“intestine”}.

Aiming at the long-tailed distribution characteristics of TCM terms, we propose a weight decay compression algorithm based on TF-IDF to compute the domain saliency weights of the matched terms:

The weighted lexicon embeddings are integrated with character-level representations through a concatenation strategy. Specifically, for each character, word-level features from matched lexicon entries are aggregated using BMES-based attention weights, as shown in Eq. (3). The resulting word-level representation \(\:{\text{X}}_{\text{i}}^{\text{w}}\)​is then concatenated with the BERT-generated character embedding \(\:{\text{X}}_{\text{i}}^{\text{c}}\), and the fused vector is passed through a Layer Normalization operation, as defined in Eq. (4), to obtain the final dual-granularity semantic representation. This approach enables effective encoding of both contextual semantics and lexicon-level structure within Chinese medical texts.

$$\:{\alpha\:}_{w}=\frac{{\text{T}\text{F}-\text{I}\text{D}\text{F}}_{\text{T}\text{C}\text{M}}\left(w\right)}{\sum\:_{{w}^{{\prime\:}}\in\:L\left({S}_{i}\right)}{\text{T}\text{F}-\text{I}\text{D}\text{F}}_{\text{T}\text{C}\text{M}}\left({w}^{{\prime\:}}\right)}$$

(2)

$$\:\begin{array}{r}{X}_{i}^{w}=\sum\:_{c\in\:\{B,M,E,S\}}\sum\:_{w\in\:c}{\alpha\:}_{w}\cdot\:\text{E}\text{m}\text{b}\text{e}\text{d}\left(w\right)\end{array}$$

(3)

$$\:{X}_{i}^{\text{f}\text{i}\text{n}\text{a}\text{l}}=\text{L}\text{a}\text{y}\text{e}\text{r}\text{N}\text{o}\text{r}\text{m}\left({X}_{i}^{c}\oplus\:{X}_{i}^{w}\right)$$

(4)

The method significantly improves the accuracy of TCM entity recognition through the synergistic enhancement of word-phrase features, and effectively solves the semantic differences between ancient and modern terminology, the hierarchical nesting of evidence and diseases, and the ambiguity of mapping dialects to standard terminology.

Adversarial training

In the field of traditional Chinese medicine (TCM), textual information is rich and complex. To accurately extract key information and construct robust models, it is essential to fully encode TCM text features and further enhance the model’s generalization and robustness. Given the small sample size, high annotation cost, and diverse word expressions of TCM corpora, adversarial training is adopted as an effective strategy. Specifically, we apply the Fast Gradient Method (FGM) at the embedding layer to generate adversarial samples by adding small perturbations to the word vectors. This helps the model better handle noisy or ambiguous inputs. The generation process is illustrated in Fig. 3.

Fig. 3
figure 3

Adversarial training mechanism used in the proposed model. During training, controlled perturbations are added to the input embeddings to simulate worst-case variations.

Specifically, given a set of original TCM text samples X and their corresponding labels y, the objective of adversarial training is to identify an optimal perturbation ΔX that maximizes the model’s loss function while keeping the model parameters \(\:\theta\) fixed:

$$\:\triangle\:X=arg\underset{\parallel\:\triangle\:X\parallel\:\le\:\epsilon}{max}\mathcal{L}\left(X+\triangle\:X,y;\theta\:\right)$$

(5)

In this formulation, \(\:\mathcal{L}(\cdot)\) denotes the loss function (e.g., cross-entropy), and \(\:{\triangle X}\) represents a small perturbation added to the input. The constraint \(\:{\parallel{\triangle}X\parallel{\leq}}\)ϵ, where ϵ is a positive constant, ensures that the perturbation is limited in magnitude to maintain semantic consistency with the original input. By training the model to correctly classify both original and perturbed inputs, adversarial training helps the model learn more robust and stable representations. This, in turn, reduces sensitivity to minor semantic shifts or variations in word order, improving generalization performance.

After splicing the above feature vectors generated by adversarial training, semantic fusion is performed using Bidirectional Long Short-Term Memory Network (BiLSTM), a powerful deep learning model that can effectively capture bidirectional contextual information in text sequences. In the processing of TCM electronic medical records, BiLSTM can process the spliced feature vectors to further extract the semantic associations between words in the text. Through the forward and backward loop structure, BiLSTM can fully utilize the historical and future information of the text, thus obtaining a dual-grained text feature-enhanced sentence coding representation that fuses word and phrase features.

Specifically, let \(\:{X}_{t}^{final}\) be the feature vector of the tth character after fusion, then the forward propagation and backward propagation process of BiLSTM can be expressed as:

$$\:\overrightarrow{{h}_{t}}={\text{L}\text{S}\text{T}\text{M}}_{\text{f}\text{w}\text{d}}\left({X}_{t}^{\text{f}\text{i}\text{n}\text{a}\text{l}},\overrightarrow{{h}_{t-1}}\right)$$

(6)

$$\:\overleftarrow{{h}_{t}}={\text{L}\text{S}\text{T}\text{M}}_{\text{b}\text{w}\text{d}}\left({X}_{t}^{\text{f}\text{i}\text{n}\text{a}\text{l}},\overleftarrow{{h}_{t+1}}\right)$$

(7)

$$\:{H}_{t}=\left[\overrightarrow{{h}_{t}};\overleftarrow{{h}_{t}}\right]$$

(8)

Ultimately, the vector \(\:{H}_{t}\) output from the coding layer represents the contextual semantic representation of the tth character in the sentence fused with the word features. After the fusion of characters and word vectors obtained from the text sequence of TCM electronic medical records, the characters and their corresponding lexical information can be fused one-to-one, which enables the model to more accurately understand the semantic information in the TCM electronic medical records, and provides powerful support for the subsequent tasks of diagnosis of the disease and treatment recommendations.

Span generation and decoding

In TCM electronic medical records, symptom descriptions are often complex and have nested entities, e.g., “red tongue with little moss” contains the subject of “tongue” and features such as “red” and “little moss”. “less moss” and other features. To address this challenge, after obtaining sentence-level vector representations from BiLSTM, we further introduce a spanning classification strategy that dynamically focuses on key semantic units using Multiplicative Attention to more accurately identify and model the nested relationships of complex entities in TCM EHRs.

We first apply two independent feedforward neural networks (FFNNs) to compute the representation of each character as the start and end of the entity, respectively:

$$\:{h}_{s}={\text{F}\text{F}\text{N}\text{N}}_{\text{s}\text{t}\text{a}\text{r}\text{t}}\left(H\right),\:{h}_{e}={\text{F}\text{F}\text{N}\text{N}}_{\text{e}\text{n}\text{d}}\left(H\right)$$

(9)

where H is the BiLSTM encoded text feature representation. Based on this, we construct the set \(\:\mathcal{H}\) of all possible spans in the sentence, i.e., all possible (start, end) index pairs. For example, for the sentence “left₁lower₂limb₃”, the possible spans include. \(\:\mathcal{H}=\{\left(\text{1,1}\right),\left(\text{1,2}\right),\left(\text{1,3}\right),\left(\text{2,2}\right),\left(\text{2,3}\right),\left(\text{3,3}\right)\}\).

The span matching algorithm is described as shown in Algorithm 1.

Algorithm 1
figure a

After obtaining the set of possible entity spans, we validate each span and predict its type. In order to better understand the semantics of the text, we introduce the multiplicative attention mechanism, which can dynamically focus on the key semantic units and strengthen the association between key information by giving different weights to different words. The mechanism first identifies and weights key words, such as “pulse” and “number” in “pulse count” to focus on the “pulse” entity; at the same time, it captures the “pulse” and “number” entities. “At the same time, remote dependencies are captured to ensure that nested entities such as “red tongue with little moss” are correctly recognized as “tongue image” rather than incorrectly categorized. This dynamic focusing ability enhances the model’s ability to understand and categorize complex texts in TCM. The multiplicative attention mechanism is shown in Fig. 4.

Fig. 4
figure 4

Multiplicative attention mechanism for span classification. Element-wise multiplication is used to compute relevance between spans and context, enhancing nested entity recognition.

Specifically, for the span (i, j), we compute its attention-enhancing representation:

$$\:{A}_{ij}=\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left(\frac{{\left({W}_{q}{h}_{s}^{i}\right)}^{{\top\:}}\left({W}_{k}{h}_{e}^{j}\right)}{\sqrt{d}}\right)$$

(10)

where \(\:{W}_{q}\) and \(\:{W}_{k}\) are trainable parameters and d is the hidden layer dimension. Ultimately, the span-level contextual features are constructed by combining the attentively weighted representations.

After obtaining enhanced span representations, we use a multilayer perceptron (MLP) to categorize each span and predict its entity class (including non-entities):

The multilayer perceptron is able to learn complex features in the spanning representation, modeling entity nesting relationships through multilayer feature interactions.Ultimately, the most likely category is selected as the prediction result through the argmax operation:

$$\:\text{S}\text{c}\text{o}\text{r}\text{c}\left(i,j\right)=\text{M}\text{L}\text{P}\left({A}_{ij}\right)$$

(11)

$$\:{W}_{ij}=\text{a}\text{r}\text{g}\text{m}\text{a}\text{x}\text{S}\text{c}\text{o}\text{r}\text{c}\left(i,j,c\right)$$

(12)

where \(\:{W}_{ij}\) represents the category index of span (i, j) in the decoding matrix.

In the decoding phase, we parse out all legitimate entity spans and their categories based on the upper triangular part of the prediction matrix W. The upper triangular elements of the matrix correspond to a total of n(n + 1)/2 valid spans, and the positional coordinates of each element are used together with its label for the final entity identification. For example, in Fig. 5, if W3,4 corresponds to the category “nature”, it means that the text “puffiness” is correctly recognized as a “nature” entity.

Our decoding strategy is structurally similar to the biaffine span decoder32, which models span-level interactions via a scoring matrix. However, unlike the standard biaffine formulation, we do not explicitly apply a bilinear transformation between token pairs. Instead, our model directly leverages span-level classification scores derived from multi-level contextual embeddings, and integrates entity-specific attention signals. This design reduces computational complexity while preserving span discriminability, and is better suited for nested entity recognition tasks in long-text TCM narratives.

Fig. 5
figure 5

Example of span decoding for nested entity recognition. The model identifies valid start–end index pairs to extract nested entities from TCM texts.

Category imbalance losses

In the task of entity recognition in TCM electronic medical records, data categories often show long-tailed distribution, i.e., some entity categories (e.g., common disease names, common symptoms) occupy a large number of samples, while data of rare categories (e.g., special TCM symptoms, rare causes of disease) are relatively scarce. This data imbalance phenomenon easily leads to overfitting of high-frequency categories and insufficient recognition of low-frequency categories during model training, which affects the overall performance and generalization ability. To address this problem, we propose the Adaptive Class Weight Loss (ACW Loss) function, which optimizes the model’s classification ability on long-tailed distribution data by dynamically adjusting the class weights, and combines with a noise-aware training strategy to reduce the impact of mislabeled samples, thereby improving the robustness of entity recognition.

The traditional cross-entropy loss function assigns the same weight to all categories, resulting in the gradient of high-frequency categories dominating the training process, making it difficult for entities in low-frequency categories to be recognized correctly. For this reason, we design a dynamic category weight adjustment mechanism to enable the model to adaptively focus on rare categories and improve the classification accuracy of low-frequency categories.The ACW Loss is calculated as follows:

$$\:L=-\sum\:_{i=1}^{C}{w}_{i}^{{\prime\:}}{y}_{i}\text{l}\text{o}\text{g}\left({\text{l}\text{o}\text{g}\text{i}\text{t}\text{s}}_{i}\right)$$

(13)

Where: C is the total number of categories; yi∈{0,1} is the ground-truth indicator for category i(i.e., yi=1 if the true label is category i, and 0 otherwise); \(\:log\left({logits}_{i}\right)\) is the unnormalized model output (i.e., before softmax) for category i; and \(\:{w}_{i}^{{\prime\:}}\) is the adjusted category weight that compensates for label imbalance.

To compute \(\:{w}_{i}^{{\prime\:}}\), we first define the initial weight \(\:{\text{w}}_{\text{i}}\) based on the inverse category frequency:

$$\:{w}_{i}=\frackesehatan{{f}_{i}^{\gamma\:}}$$

(14)

where fi is the frequency of category i in the training data, and γ∈[0,1] is a hyperparameter controlling the degree of frequency-based adjustment. A small γ leads to minor reweighting, while a larger γ gives significantly higher weights to rare categories.

The final normalized weight \(\:{w}_{i}^{{\prime\:}}\) is then computed as:

$$\:{w}_{i}^{{\prime\:}}=\frac{{w}_{i}}{\sum\:_{j=1}^{C}{w}_{j}}$$

(15)

This normalization ensures that the total contribution of all class weights remains constant. This adaptive weight assignment makes the low-frequency categories get higher gradient weights, which enhances the model’s ability to learn rare entities and avoid them being ignored in the training process. As shown in Fig. 6, ACW Loss demonstrates faster convergence than Cross-Entropy, highlighting its effectiveness in stabilizing training under class imbalance.

Labeling inconsistencies and data noise (e.g., different expressions of the same concept, differences in doctors’ labeling styles) are more common in TCM EHRs. If the model is over-sensitive to the learning of these noisy samples, it may affect the final classification effect. Therefore, we incorporate Noise-aware Training (NAT) strategy in the training process to reduce the impact of mislabeled data on the model. Specifically, at the beginning of training, we assign the same weight to all samples; as training proceeds, if the prediction confidence of a sample continues to fluctuate, we reduce its gradient contribution to avoid the interference of mislabeled data on the model.

Fig. 6
figure 6

Loss convergence comparison between ACW Loss and Cross-Entropy Loss. The ACW Loss shows faster convergence across training epochs, highlighting its effectiveness in stabilizing model training for imbalanced entity distributions.

[ad_2]

Nested named entity recognition in traditional Chinese medicine electronic medical records via dual-granularity feature augmentation and span classification

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts