Open Collections will undergo scheduled maintenance on the following dates: On Monday, April 27th, 2026, the site will not be available from 7:00 AM – 9:00 AM PST and on Tuesday, April 28th, 2026, the site will remain accessible from 7:00 AM – 9:00 AM PST, however item images and media will not be available during this time.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Integrating heterogeneous environmental permitting spreadsheets through LLM-assisted schema matching and deterministic normalization Dehghani, Dorna

Abstract

Environmental assessment and permitting processes generate large volumes of administrative data, typically stored in project-specific spreadsheets that record iterative comment–response exchanges across multiple review rounds. Although these spreadsheets contain semantically similar information—such as comment text, response text, dates, agencies, and review status—their schemas, layouts, and value encodings vary across projects. This structural heterogeneity prevents systematic cross-project analysis and limits trans- parency, reproducibility, and scalability. This thesis formulates the integration of heterogeneous permitting spreadsheets as a schema matching and normalization problem and presents a deterministic end-to-end pipeline that transforms project-level Excel files into a unified long-format dataset. The pipeline combines column-level schema alignment with value-level normalization. Instruction-tuned large language models (LLMs) generate enriched column descriptions and support interpretable semantic equivalence decisions within a structured matching framework, while preserving auditability by serving as contextual representations rather than autonomous transformation logic. Following schema alignment, rule-based transformations standardize dates, agency names, compound identifiers, null placeholders, and wide-format review rounds, producing a consistent representation in which each record corresponds to a single comment–response pair with associated metadata. Evaluation is conducted using a project-level train–test split to assess generalization to previously unseen spreadsheet structures. Results indicate that LLM-assisted semantic matching combined with deterministic preprocessing enables robust schema alignment across heterogeneous files while maintaining interpretability and reproducibility. Question answering over the normalized dataset further demonstrates that integration reduces analytical effort for cross-project queries. This work provides a practical and reproducible framework for structured data integration in regulatory and administrative contexts under local, single-GPU deployment constraints.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International