Integrating heterogeneous environmental permitting spreadsheets through LLM-assisted schema matching and deterministic normalization

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Integrating heterogeneous environmental permitting spreadsheets through LLM-assisted schema matching and deterministic normalization Dehghani, Dorna

Abstract

Environmental assessment and permitting processes generate large volumes of administrative data, typically stored in project-specific spreadsheets that record iterative comment–response exchanges across multiple review rounds. Although these spreadsheets contain semantically similar information—such as comment text, response text, dates, agencies, and review status—their schemas, layouts, and value encodings vary across projects. This structural heterogeneity prevents systematic cross-project analysis and limits trans- parency, reproducibility, and scalability. This thesis formulates the integration of heterogeneous permitting spreadsheets as a schema matching and normalization problem and presents a deterministic end-to-end pipeline that transforms project-level Excel files into a unified long-format dataset. The pipeline combines column-level schema alignment with value-level normalization. Instruction-tuned large language models (LLMs) generate enriched column descriptions and support interpretable semantic equivalence decisions within a structured matching framework, while preserving auditability by serving as contextual representations rather than autonomous transformation logic. Following schema alignment, rule-based transformations standardize dates, agency names, compound identifiers, null placeholders, and wide-format review rounds, producing a consistent representation in which each record corresponds to a single comment–response pair with associated metadata. Evaluation is conducted using a project-level train–test split to assess generalization to previously unseen spreadsheet structures. Results indicate that LLM-assisted semantic matching combined with deterministic preprocessing enables robust schema alignment across heterogeneous files while maintaining interpretability and reproducibility. Question answering over the normalized dataset further demonstrates that integration reduces analytical effort for cross-project queries. This work provides a practical and reproducible framework for structured data integration in regulatory and administrative contexts under local, single-GPU deployment constraints.

Item Metadata

Title	Integrating heterogeneous environmental permitting spreadsheets through LLM-assisted schema matching and deterministic normalization
Creator	Dehghani, Dorna
Supervisor	Lakshmanan, Laks V. S., 1959-
Publisher	University of British Columbia
Date Issued	2026
Description	Environmental assessment and permitting processes generate large volumes of administrative data, typically stored in project-specific spreadsheets that record iterative comment–response exchanges across multiple review rounds. Although these spreadsheets contain semantically similar information—such as comment text, response text, dates, agencies, and review status—their schemas, layouts, and value encodings vary across projects. This structural heterogeneity prevents systematic cross-project analysis and limits trans- parency, reproducibility, and scalability. This thesis formulates the integration of heterogeneous permitting spreadsheets as a schema matching and normalization problem and presents a deterministic end-to-end pipeline that transforms project-level Excel files into a unified long-format dataset. The pipeline combines column-level schema alignment with value-level normalization. Instruction-tuned large language models (LLMs) generate enriched column descriptions and support interpretable semantic equivalence decisions within a structured matching framework, while preserving auditability by serving as contextual representations rather than autonomous transformation logic. Following schema alignment, rule-based transformations standardize dates, agency names, compound identifiers, null placeholders, and wide-format review rounds, producing a consistent representation in which each record corresponds to a single comment–response pair with associated metadata. Evaluation is conducted using a project-level train–test split to assess generalization to previously unseen spreadsheet structures. Results indicate that LLM-assisted semantic matching combined with deterministic preprocessing enables robust schema alignment across heterogeneous files while maintaining interpretability and reproducibility. Question answering over the normalized dataset further demonstrates that integration reduces analytical effort for cross-project queries. This work provides a practical and reproducible framework for structured data integration in regulatory and administrative contexts under local, single-GPU deployment constraints.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2026-04-15
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0451952
URI	http://hdl.handle.net/2429/94079
Degree (Theses)	Master of Science - MSc
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2026-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Integrating heterogeneous environmental permitting spreadsheets through LLM-assisted schema matching and deterministic normalization Dehghani, Dorna

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights