# Chinese EMR ICD Classification

CCL2025 Chinese Electronic Medical Record ICD Diagnosis Coding Evaluation Competition

## Background

In recent years, with the increasing aging population and rising health awareness, healthcare systems are facing growing service pressure. The widespread adoption of Electronic Medical Records (EMR) in medical informatization offers new possibilities to address these challenges. To achieve standardized management and sharing of medical data, the World Health Organization (WHO) has developed the International Classification of Diseases (ICD). This standard converts tens of thousands of diseases and their combinations into a standardized alphanumeric coding system, laying the foundation for cross-regional and cross-institutional medical data exchange and analysis.

However, manually converting EMR text into ICD codes is time-consuming, labor-intensive, and prone to human errors. Developing an automated ICD coding system can improve coding efficiency and consistency while providing more reliable data support for disease research and medical management.

To advance this goal, this competition has constructed a special dataset for evaluating automated ICD coding in Chinese EMRs. The dataset is built using de-identified medical records, covering five primary diagnosis ICD-10 codes and 53 additional diagnosis ICD-10 codes, with a total of 1,485 records.

## Task Description

### Task Objective

The goal of this task is to utilize Natural Language Processing (NLP) techniques to analyze patients’ clinical symptoms from electronic medical record texts, extract key symptom information, and determine both primary diagnosis codes and additional diagnosis codes. This assists doctors or medical coders in accurately completing ICD coding.

### Task Definition

Given a text description of a patient's medical history, chief complaint, current illness history, past medical history, and discharge status, the model needs to predict:

- Primary Diagnosis Code (only one ICD-10 code per record)
- Additional Diagnosis Codes (one or more ICD-10 codes per record)

### Task Input & Output

- Input: A string (str type) containing clinical information extracted from medical records.
- Output: A list where the primary diagnosis code and additional diagnosis codes are separated by a `"|"` delimiter, and the additional diagnosis codes are separated by a `";"` delimiter.

    - Example Output Format:

        ```css
        ["Primary_Diagnosis_Code|Additional_Diagnosis_Code1;Additional_Diagnosis_Code2;..."]
        ```

    - Primary Diagnosis Code: Exactly one code from the predefined list.
    - Additional Diagnosis Codes: One or more codes from the predefined list.
 
### Available ICD Code Categories

Primary Diagnosis Codes:

`I10.x00x032`, `I20.000`, `I20.800x007`, `I21.401`, `I50.900x018`

Additional Diagnosis Codes (Partial List):

`E04.101`, `E04.102`, `E11.900`, `E14.900x001`, `E72.101`, `E78.500`, `E87.600`, `I10.x00x023`, `I10.x00x024`, `I10.x00x027`, `I10.x00x028`, `I10.x00x031`, `I10.x00x032`, `I20.000`, `I25.102`, `I25.103`, `I25.200`, `I31.800x004`, `I38.x01`, `I48.x01`, `I48.x02`, `I49.100x001`, `I49.100x002`, `I49.300x001`, `I49.300x002`, `I49.400x002`, `I49.400x003`, `I49.900`, `I50.900x007`, `I50.900x008`, `I50.900x010`, `I50.900x014`, `I50.900x015`, `I50.900x016`, `I50.900x018`, `I50.907`, `I63.900`, `I67.200x011`, `I69.300x002`, `I70.203`, `I70.806`, `J18.900`, `J98.414`, `K76.000`, `K76.807`, `N19.x00x002`, `N28.101`, `Q24.501`, `R42.x00x004`, `R91.x00x003`, `Z54.000x033`, `Z95.501`, `Z98.800x612`

## Submission Format

The competition has two leaderboard phases:

- A-Leaderboard (Validation Set): Evaluated on the Alibaba Cloud Tianchi platform. Participants can submit up to 5 times per day (failed submissions do not count towards this limit).
- B-Leaderboard (Test Set): Likely used for final ranking.

Example Submission:

```json
[
    {
        "Record_ID": "ZYxxxxxxx",
        "Prediction": "[Primary_Diagnosis_Code|Additional_Diagnosis_Code1;Additional_Diagnosis_Code2;...]"
    },
    {
        "Record_ID": "ZYyyyyyyy",
        "Prediction": "[Primary_Diagnosis_Code|Additional_Diagnosis_Code1;Additional_Diagnosis_Code2;...]"
    }
]
```

## Dataset Description

The evaluation dataset is built based on de-identified hospital medical records and contains 1,485 records. It is divided into:

- Training set: 800 records
- Validation set: 200 records
- Test set: 485 records (not publicly available)

Only the training set and validation set are publicly available, while the test set remains private.

### Dataset Files

- `ICD-Coding-train.json`: Annotated training data.
- `ICD-Coding-test-A.json`: Validation set (A-Leaderboard test set).
- `ICD-Coding-A.json`: Submission example for the A-Leaderboard.
- `ICD-Coding-test-B.json`: Test set (B-Leaderboard). This dataset is not publicly available.

### Field Descriptions

Each record in the dataset is stored in JSON format and contains the following fields:

- Case ID (`病案标识`): The unique patient case identifier in the hospital.
- Chief Complaint (`主诉`): The primary symptom described by the patient during consultation, typically summarized in a short sentence.
- Present Illness History (`现病史`): A detailed description of the patient's current illness, including onset, symptom characteristics, disease progression, past treatments, and response to therapy.
- Past Medical History (`既往史`): The patient’s previous health conditions, major diseases, surgeries, injuries, and allergies.
- Personal History (`个人史`): The patient’s lifestyle habits, occupational exposure, and epidemiological history.
- Marital History (`婚姻史`): Information on marital status, age at marriage, spouse’s health status, and number of children.
- Family History (`家族史`): Family history of hereditary or specific diseases among direct relatives.
- Admission Condition (`入院情况`): A summary of the patient's symptoms, signs, and overall condition at the time of hospital admission.
- Admission Diagnosis (`入院诊断`): The initial diagnosis made by the physician upon hospital admission based on the medical history and tests.
- Treatment Course (`诊疗经过`): Detailed records of the patient’s examinations, treatments, and disease progression during hospitalization.
- Discharge Condition (`出院情况`): A brief description of the patient’s health status at discharge.
- Discharge Instructions (`出院医嘱`): Guidelines provided by the physician regarding medication, follow-up, and lifestyle adjustments after discharge.
- Primary Diagnosis Code (`主诊断编码`): The primary ICD-10 code corresponding to the main diagnosis during hospitalization.
- Additional Diagnosis Codes (`其他诊断编码`): One or more ICD-10 codes corresponding to other diagnoses.

Sample Data (JSON Format):

```json
{
    "病案标识": "ZY020000982397",
    "主诉": "胸闷、喘7天。",
    "现病史": "患者于7天前无明显诱因出现胸闷、喘，呈阵发性，活动及情绪激动后明显加重，不能从事日常活动...",
    "既往史": "有“冠状动脉粥样硬化性心脏病”10余年，2021年于****行“冠状动脉移植术”（具体不详）",
    "个人史": "生长于原籍，否认疫区及地方病流行区长期居住史，生活规律...",
    "婚姻史": "适龄结婚，育有1子，配偶及孩子身体健康。",
    "家族史": "父母已逝，具体不详。否认家族性遗传病及传染病史。",
    "入院情况": "患者老年男性，76岁，因“胸闷、喘7天”入院...",
    "入院诊断": "1.急性失代偿性心力衰竭心功能II级（Killip分级）2.肺炎3.急性呼吸衰竭（I型）",
    "诊疗经过": "入院后完善相关辅助检查，凝血常规：凝血酶原时间：13.1秒...",
    "出院情况": "双侧瞳孔等大等圆，对光反射及调节反射存在...",
    "出院医嘱": "1、低盐低脂饮食，注意休息，避免劳累，按时服药...",
    "主诊断编码": "J81.x00x002",
    "其他诊断编码": "I50.907; I50.903; I25.103; I20.000; I49.900; I48.x01;E11.900"
}
```

## Evaluation Metrics

The competition evaluates ICD coding accuracy using the accuracy (Acc) metric, calculated as:

$$
\text{Acc} = \frac{1}{N} \sum_{i=1}^{N} \lbrace 0.5 \cdot I(\hat{y}_{\text{main}} == y_{\text{main}}) + 0.5 \cdot \frac{\text{NUM}(y_{\text{other}} \cap \hat{y}_{\text{other}})}{\text{NUM}(y_{\text{other}})} \rbrace_{i}
$$

Where $I(\cdot)$ is an indicator function that returns `1` if the condition is met and `0`otherwise. $\hat{y}_{\text{main}}$ and $y_{\text{main}}$ represent the predicted label and true label of the main diagnosis code, respectively. $\text{NUM}(x)$ represents a quantity function that is used to calculate the number of $x$ . $\hat{y}_{\text{other}}$ and $y_{\text{other}}$ represent the predicted label set and true label set of other diagnosis codes, respectively. $N$ is the number of test samples. $\lbrace \cdot \rbrace_i$ is the prediction accuracy of the `i-th` Chinese electronic medical record.

## Baseline

Performance of baseline model:

|  acc   |
|--------|
| 41.34% |