UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Extraction of social determinants of health from electronic health records using natural language processing Chen, Zhenghua

Abstract

Purpose: Social Determinants of Health (SDoH) have a significant impact on hu- man health outcomes and disparities. Collecting SDoH from electronic health records can facilitate decision-making and downstream research. With thousands of clinical records, automated extraction methods using Artificial Intelligence (AI) would be more efficient and cost-effective. This study aims to autonomously extract comprehensive SDoH details from Electronic Health Records (EHR) using a Natural Language Processing (NLP) based AI pipeline. Methods: One thousand documents from BC Cancer with concentrated SDoH infor- mation were carefully selected and labeled to provide the ground truth for training and assessing the NLP models. Two pipelines were applied for SDoH extraction: an open- source pipeline trained on the BC Cancer dataset and an industrial pre-trained solution used as a benchmark. To optimize the performance of the first pipeline, three experiments were conducted to justify the effect of including subtype word positions during training on the extraction performance. The results of two pipelines were compared and the best- performing one was subsequently employed for the extraction of SDoH information from a total of 13,258 oncology documents. Results: The open-source pipeline gained an average F1 score of 0.88 on the validation dataset for extracting 13 SDoH factors, outperforming the benchmark by 5%. This pipeline also demonstrated a notably superior capability to extract detailed subtypes compared with the benchmark. The benchmark was advantageous in identifying rarely documented SDoH types in the data for extraction in this work. Subsequently, the pipeline was applied to the oncology documents from BC Cancer to extract 60,717 SDoH factors and associated details. The most frequently extracted SDoH were Tobacco Use, Employment Status, Marital Status, Alcohol Consumption, and Living Status, which occurred from 8k to 12k times. Conclusion: The NLP pipeline successfully extracted a wide array of SDoH factors from clinical notes, achieving commendable performance despite being trained on a rela- tively small labeled dataset.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International