Open Collections will undergo scheduled maintenance on the following dates: On Monday, April 27th, 2026, the site will not be available from 7:00 AM – 9:00 AM PST and on Tuesday, April 28th, 2026, the site will remain accessible from 7:00 AM – 9:00 AM PST, however item images and media will not be available during this time.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Integrating heterogeneous environmental permitting...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Integrating heterogeneous environmental permitting spreadsheets through LLM-assisted schema matching and deterministic normalization Dehghani, Dorna
Abstract
Environmental assessment and permitting processes generate large volumes of administrative data, typically stored in project-specific spreadsheets that record iterative comment–response exchanges across multiple review rounds. Although these spreadsheets contain semantically similar information—such as comment text, response text, dates, agencies, and review status—their schemas, layouts, and value encodings vary across projects. This structural heterogeneity prevents systematic cross-project analysis and limits trans- parency, reproducibility, and scalability. This thesis formulates the integration of heterogeneous permitting spreadsheets as a schema matching and normalization problem and presents a deterministic end-to-end pipeline that transforms project-level Excel files into a unified long-format dataset. The pipeline combines column-level schema alignment with value-level normalization. Instruction-tuned large language models (LLMs) generate enriched column descriptions and support interpretable semantic equivalence decisions within a structured matching framework, while preserving auditability by serving as contextual representations rather than autonomous transformation logic. Following schema alignment, rule-based transformations standardize dates, agency names, compound identifiers, null placeholders, and wide-format review rounds, producing a consistent representation in which each record corresponds to a single comment–response pair with associated metadata.
Evaluation is conducted using a project-level train–test split to assess generalization to previously unseen spreadsheet structures. Results indicate that LLM-assisted semantic matching combined with deterministic preprocessing enables robust schema alignment across heterogeneous files while maintaining interpretability and reproducibility. Question answering over the normalized dataset further demonstrates that integration reduces analytical effort for cross-project queries.
This work provides a practical and reproducible framework for structured data integration in regulatory and administrative contexts under local, single-GPU deployment constraints.
Item Metadata
| Title |
Integrating heterogeneous environmental permitting spreadsheets through LLM-assisted schema matching and deterministic normalization
|
| Creator | |
| Supervisor | |
| Publisher |
University of British Columbia
|
| Date Issued |
2026
|
| Description |
Environmental assessment and permitting processes generate large volumes of administrative data, typically stored in project-specific spreadsheets that record iterative comment–response exchanges across multiple review rounds. Although these spreadsheets contain semantically similar information—such as comment text, response text, dates, agencies, and review status—their schemas, layouts, and value encodings vary across projects. This structural heterogeneity prevents systematic cross-project analysis and limits trans- parency, reproducibility, and scalability. This thesis formulates the integration of heterogeneous permitting spreadsheets as a schema matching and normalization problem and presents a deterministic end-to-end pipeline that transforms project-level Excel files into a unified long-format dataset. The pipeline combines column-level schema alignment with value-level normalization. Instruction-tuned large language models (LLMs) generate enriched column descriptions and support interpretable semantic equivalence decisions within a structured matching framework, while preserving auditability by serving as contextual representations rather than autonomous transformation logic. Following schema alignment, rule-based transformations standardize dates, agency names, compound identifiers, null placeholders, and wide-format review rounds, producing a consistent representation in which each record corresponds to a single comment–response pair with associated metadata.
Evaluation is conducted using a project-level train–test split to assess generalization to previously unseen spreadsheet structures. Results indicate that LLM-assisted semantic matching combined with deterministic preprocessing enables robust schema alignment across heterogeneous files while maintaining interpretability and reproducibility. Question answering over the normalized dataset further demonstrates that integration reduces analytical effort for cross-project queries.
This work provides a practical and reproducible framework for structured data integration in regulatory and administrative contexts under local, single-GPU deployment constraints.
|
| Genre | |
| Type | |
| Language |
eng
|
| Date Available |
2026-04-15
|
| Provider |
Vancouver : University of British Columbia Library
|
| Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
| DOI |
10.14288/1.0451952
|
| URI | |
| Degree (Theses) | |
| Program (Theses) | |
| Affiliation | |
| Degree Grantor |
University of British Columbia
|
| Graduation Date |
2026-05
|
| Campus | |
| Scholarly Level |
Graduate
|
| Rights URI | |
| Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International