- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Extraction of social determinants of health from electronic...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Extraction of social determinants of health from electronic health records using natural language processing Chen, Zhenghua
Abstract
Purpose: Social Determinants of Health (SDoH) have a significant impact on hu- man health outcomes and disparities. Collecting SDoH from electronic health records can facilitate decision-making and downstream research. With thousands of clinical records, automated extraction methods using Artificial Intelligence (AI) would be more efficient and cost-effective. This study aims to autonomously extract comprehensive SDoH details from Electronic Health Records (EHR) using a Natural Language Processing (NLP) based AI pipeline. Methods: One thousand documents from BC Cancer with concentrated SDoH infor- mation were carefully selected and labeled to provide the ground truth for training and assessing the NLP models. Two pipelines were applied for SDoH extraction: an open- source pipeline trained on the BC Cancer dataset and an industrial pre-trained solution used as a benchmark. To optimize the performance of the first pipeline, three experiments were conducted to justify the effect of including subtype word positions during training on the extraction performance. The results of two pipelines were compared and the best- performing one was subsequently employed for the extraction of SDoH information from a total of 13,258 oncology documents. Results: The open-source pipeline gained an average F1 score of 0.88 on the validation dataset for extracting 13 SDoH factors, outperforming the benchmark by 5%. This pipeline also demonstrated a notably superior capability to extract detailed subtypes compared with the benchmark. The benchmark was advantageous in identifying rarely documented SDoH types in the data for extraction in this work. Subsequently, the pipeline was applied to the oncology documents from BC Cancer to extract 60,717 SDoH factors and associated details. The most frequently extracted SDoH were Tobacco Use, Employment Status, Marital Status, Alcohol Consumption, and Living Status, which occurred from 8k to 12k times. Conclusion: The NLP pipeline successfully extracted a wide array of SDoH factors from clinical notes, achieving commendable performance despite being trained on a rela- tively small labeled dataset.
Item Metadata
Title |
Extraction of social determinants of health from electronic health records using natural language processing
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2023
|
Description |
Purpose: Social Determinants of Health (SDoH) have a significant impact on hu- man health outcomes and disparities. Collecting SDoH from electronic health records can facilitate decision-making and downstream research. With thousands of clinical records, automated extraction methods using Artificial Intelligence (AI) would be more efficient and cost-effective. This study aims to autonomously extract comprehensive SDoH details from Electronic Health Records (EHR) using a Natural Language Processing (NLP) based AI pipeline.
Methods: One thousand documents from BC Cancer with concentrated SDoH infor- mation were carefully selected and labeled to provide the ground truth for training and assessing the NLP models. Two pipelines were applied for SDoH extraction: an open- source pipeline trained on the BC Cancer dataset and an industrial pre-trained solution used as a benchmark. To optimize the performance of the first pipeline, three experiments were conducted to justify the effect of including subtype word positions during training on the extraction performance. The results of two pipelines were compared and the best- performing one was subsequently employed for the extraction of SDoH information from a total of 13,258 oncology documents.
Results: The open-source pipeline gained an average F1 score of 0.88 on the validation dataset for extracting 13 SDoH factors, outperforming the benchmark by 5%. This pipeline also demonstrated a notably superior capability to extract detailed subtypes compared with the benchmark. The benchmark was advantageous in identifying rarely documented SDoH types in the data for extraction in this work. Subsequently, the pipeline was applied to the oncology documents from BC Cancer to extract 60,717 SDoH factors and associated details. The most frequently extracted SDoH were Tobacco Use, Employment Status, Marital Status, Alcohol Consumption, and Living Status, which occurred from 8k to 12k times.
Conclusion: The NLP pipeline successfully extracted a wide array of SDoH factors from clinical notes, achieving commendable performance despite being trained on a rela- tively small labeled dataset.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2024-01-03
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0438396
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2024-02
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International