Extraction of social determinants of health from electronic health records using natural language processing

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Extraction of social determinants of health from electronic health records using natural language processing Chen, Zhenghua

Abstract

Purpose: Social Determinants of Health (SDoH) have a significant impact on hu- man health outcomes and disparities. Collecting SDoH from electronic health records can facilitate decision-making and downstream research. With thousands of clinical records, automated extraction methods using Artificial Intelligence (AI) would be more efficient and cost-effective. This study aims to autonomously extract comprehensive SDoH details from Electronic Health Records (EHR) using a Natural Language Processing (NLP) based AI pipeline. Methods: One thousand documents from BC Cancer with concentrated SDoH infor- mation were carefully selected and labeled to provide the ground truth for training and assessing the NLP models. Two pipelines were applied for SDoH extraction: an open- source pipeline trained on the BC Cancer dataset and an industrial pre-trained solution used as a benchmark. To optimize the performance of the first pipeline, three experiments were conducted to justify the effect of including subtype word positions during training on the extraction performance. The results of two pipelines were compared and the best- performing one was subsequently employed for the extraction of SDoH information from a total of 13,258 oncology documents. Results: The open-source pipeline gained an average F1 score of 0.88 on the validation dataset for extracting 13 SDoH factors, outperforming the benchmark by 5%. This pipeline also demonstrated a notably superior capability to extract detailed subtypes compared with the benchmark. The benchmark was advantageous in identifying rarely documented SDoH types in the data for extraction in this work. Subsequently, the pipeline was applied to the oncology documents from BC Cancer to extract 60,717 SDoH factors and associated details. The most frequently extracted SDoH were Tobacco Use, Employment Status, Marital Status, Alcohol Consumption, and Living Status, which occurred from 8k to 12k times. Conclusion: The NLP pipeline successfully extracted a wide array of SDoH factors from clinical notes, achieving commendable performance despite being trained on a rela- tively small labeled dataset.

Item Metadata

Title	Extraction of social determinants of health from electronic health records using natural language processing
Creator	Chen, Zhenghua
Supervisor	Rajapakshe, Rasika; Lasserre, Patricia
Publisher	University of British Columbia
Date Issued	2023
Description	Purpose: Social Determinants of Health (SDoH) have a significant impact on hu- man health outcomes and disparities. Collecting SDoH from electronic health records can facilitate decision-making and downstream research. With thousands of clinical records, automated extraction methods using Artificial Intelligence (AI) would be more efficient and cost-effective. This study aims to autonomously extract comprehensive SDoH details from Electronic Health Records (EHR) using a Natural Language Processing (NLP) based AI pipeline. Methods: One thousand documents from BC Cancer with concentrated SDoH infor- mation were carefully selected and labeled to provide the ground truth for training and assessing the NLP models. Two pipelines were applied for SDoH extraction: an open- source pipeline trained on the BC Cancer dataset and an industrial pre-trained solution used as a benchmark. To optimize the performance of the first pipeline, three experiments were conducted to justify the effect of including subtype word positions during training on the extraction performance. The results of two pipelines were compared and the best- performing one was subsequently employed for the extraction of SDoH information from a total of 13,258 oncology documents. Results: The open-source pipeline gained an average F1 score of 0.88 on the validation dataset for extracting 13 SDoH factors, outperforming the benchmark by 5%. This pipeline also demonstrated a notably superior capability to extract detailed subtypes compared with the benchmark. The benchmark was advantageous in identifying rarely documented SDoH types in the data for extraction in this work. Subsequently, the pipeline was applied to the oncology documents from BC Cancer to extract 60,717 SDoH factors and associated details. The most frequently extracted SDoH were Tobacco Use, Employment Status, Marital Status, Alcohol Consumption, and Living Status, which occurred from 8k to 12k times. Conclusion: The NLP pipeline successfully extracted a wide array of SDoH factors from clinical notes, achieving commendable performance despite being trained on a rela- tively small labeled dataset.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2024-01-03
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0438396
URI	http://hdl.handle.net/2429/87045
Degree	Master of Science - MSc
Program	Computer Science
Affiliation	Science, Irving K. Barber Faculty of (Okanagan); Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan)
Degree Grantor	University of British Columbia
Graduation Date	2024-02
Campus	UBCO
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Extraction of social determinants of health from electronic health records using natural language processing Chen, Zhenghua

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights