A Natural language processing based machine learning approach on building material eco-label databases wrangling

Duy Hoang Pham; Sojin Park; Yonghan Ahn

doi:10.22712/susb.20240026

Preview

General Article

International Journal of Sustainable Building Technology and Urban Development. 30 September 2024. 367-380
https://doi.org/10.22712/susb.20240026

A Natural language processing based machine learning approach on building material eco-label databases wrangling

Duy Hoang Pham¹

Sojin Park²

Yonghan Ahn³^*

¹Post-doc, Center for Ai Technology in Construction, Hanyang University ERICA Campus, Gyeonggi, South Korea

²Ph.D Student, Department of Smart City Engineering, Hanyang University ERICA Campus, Gyeonggi, South Korea

³Professor, Department of Architecture Engineering, Hanyang University ERICA Campus, Gyeonggi, South Korea

^{*Corresponding Author}

License (open-access, https://creativecommons.org/licenses/by-nc/4.0/):

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non- commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

ABSTRACT

In recent years, databases promoting eco-labeled building materials (EBM) have received increasing attention. However, extracting valuable values from vast and disparate EBM databases remains a challenge due to inconsistencies in data formats, terminology, and organization. This research proposes a natural language processing (NLP) based machine learning (ML) approach to streamline the wrangling of EBM databases. An investigation of EBM databases, developing web-scraping and web-crawling to collect data from these databases, resulting in a refined dataset of 64,350 data points. The study then leverages NLP techniques and ML algorithms to standardize terminology, resolve inconsistencies, and integrate diverse EBM databases into a cohesive database. The Random Forest algorithm consistently emerged as a top-performing classifier, achieving high AUC scores in models such as “PBTs”, “Crate to Gate”, and the “UL GREENGUARD label”. For many ecolabels, the RF algorithm consistently delivered commendable performance, exemplified by its F1-scores for attributes like “PBTs” (94.19% in cross-validation, 94.72% in testing) and “C2Gate” (92.78% in cross-validation, 93.25% in testing). This structured representation facilitates efficient querying and analysis, enabling stakeholders to make informed decisions about EBM selection and utilization. By automating the labor-intensive process of EBM data wrangling, our research contributes to the advancement of sustainable construction practices and the broader goal of environmental stewardship in the built environment.

Keywords

green building materials

natural language processing

machine learning

data wrangling

text classification

MAIN

Introduction
Literature review
Research methods
Data Collecting
NLP Data Preparation
Training, cross-validation and selection the models
Models Validation
Results and Discussion
EBM Databases investigation and collected data
Potential of NLP based ML on Building Material Eco-Label Databases Wrangling
Conclusions
Appendix

Introduction

The construction industry, a major contributor to global greenhouse gas emissions and resource consumption, is increasingly embracing sustainable practices to mitigate its environmental impact [1]. Green building materials (GBM), derived from renewable resources or recycled waste, offer a promising avenue for reducing the carbon footprint of buildings and promoting circular economies. However, the widespread adoption of GBMs is hindered by the lack of comprehensive and accessible information about their properties, performance, and environmental impacts [2].

Green building (GB) eco-label material databases serve as comprehensive information repositories, simplifying the comparison and selection of sustainable building materials [3]. They enhance transparency and credibility through eco-labels and certifications, fostering market growth for environmentally friendly alternatives. These databases play a crucial role in informing eco-labels, providing information in a user-friendly format. However, the current landscape of EBM databases is fragmented, with data scattered across various platforms, maintained by different organizations, and presented in diverse formats [4]. This fragmentation poses challenges for stakeholders attempting to compare and select EBMs based on specific criteria. Inconsistent terminology, data organization, and quality further complicate the extraction of meaningful insights.

On other hand, the combination of NLP and ML has the potential to revolutionize data management tasks such as error detection, data cleaning, integration, and query inference within databases [5]. By automating the extraction and codification of scientific literature, NLP techniques can generate rich datasets essential for data science and ML applications [6]. Additionally, NLP-based ML models have demonstrated effectiveness in analyzing incident narratives and classifying data accurately. In conclusion, the synergy between NLP and ML presents a powerful approach for enhancing databases’ efficiency and accuracy through automated data extraction, management, and analysis.

Considering these challenges, this research Leveraging NLP advantages to address these challenges, this research aims to introduce a conceptual model that combines natural language processing (NLP) and machine learning (ML) to streamline the wrangling of EBM databases. NLP techniques enable the extraction of relevant information from unstructured text, such as product descriptions, technical specifications, and environmental certifications. ML algorithms, on the other hand, can learn patterns in the data to standardize terminology, resolve inconsistencies, and integrate disparate information sources. The following sections of this manuscript will delve into a comprehensive literature review of existing research on GBM databases and the application of NLP and ML in data management. Subsequently, the research methods employed in this study, including data collection, NLP data preparation, and the training and validation of ML models, will be elucidated. The results and discussion section will present the findings of the research, including the analysis of collected data and the performance of the ML-NLP models in predicting eco-labels. Finally, the conclusion will summarize the key findings, discuss the implications of the research, and suggest potential avenues for future research.

Literature review

Recently, various studies have been conducted to develop a automation materials information management process by applying Information Technology (IT) [2]. Among them, BIM has been identified as a potential candidate that can assist in integrating the fragmented Architecture, Engineering, and Construction (AEC) industry by eliminating inefficiencies and redundancies, improving collaboration and communication, and enhancing overall productivity [7]. Thus, studies popularly have focused on managing and retrieving material information to perform calculations aimed at optimizing the sustainability of the building envelope in BIM models [8, 9]. These optimizations aim to achieve the goal of optimizing a building’s carbon emissions reduction [10, 11, 12], life cycle costs [3], or assessing compatibility with standard GB systems [8, 13]. Some studies also conducted thermal performance analyses using various sustainability simulation tools, applying material data extracted from BIM models [14]. With the conventional catalog-based method, regularly accessing the latest information to the BIM models poses a challenge. This is because it necessitates distinct actions to upload fresh data to ensure the product information remains up to date. This ongoing need to update SBMs management often leads to a time-consuming and labor-intensive process, resulting in an increased workload for the project team.

In summary, existing research on improving automation management of SBM Information (Table 1), often integrated with Building Information Modeling (BIM), are highly valued for their role in enhancing design and construction efficiency. The effectiveness of these approaches is significantly influenced by the quality of gathering and analyzing material data. However, significant challenges exist in identifying and collecting data of this process due to inconsistencies across databases. Numerous studies have aimed to address this issue by developing strategic guidelines, database management standards, and automated systems for information collection. Despite these efforts, the rapidly growing field of sustainable building materials (SBM) and the ever-evolving nature of SBM data make it challenging to keep these methods current. Combining various methods does not ensure access to the latest information, as updating the data requires substantial human resources, which can be both costly and time-consuming.

Table 1.

Existing research approaches for improving the collection and management of SBM Information

Approach	Main purpose	Limitation
Centralizing Data Repositories (Utilizing database and BIM Data)	- Utilize APIs from various databases to look up information, leveraging resources from different databases [15]. - Integration of databases and computational tools for real-time assessment in BIM [8, 16] and updating baselines for simulation data to ensure accurate and up-to-date simulations [17]	- Uncertain results [18] due to various participants. Quantity and quality of BIM databases is limited [1]. - Interoperability gap between BIM platforms and tools [19]. - The potential to operate a centralized platform is hindered by the large and continuously growing volume of SBM data.
Strategies to Creating Extensive Databases	- Standardizing data structures and formats to ensure consistency and compatibility across different databases [20, 21].	- Due to variations in local scenarios, no universally adopted global guideline exists [4], leading to inconsistencies [15].
Automating Data Classification	- Propose rules-based ontologies for automated data classification, reducing dependency on labor [2, 12]	- For each data source objects, rule-based need to be manually modified [2]

On the other hand, ML-NLP classifiers outperform ontologies rule-based systems by continuously learning and adjusting to new and evolving scenarios. NLP models have shown their effectiveness in classifying and identifying information within data from various fields, such as medical [22] , safety engineering [23], finance and others [24]. Therefore, this study proposes the utilization of WebCrawler and ML – NLP(ML-NLP) models for increasing collection and analysis of SBM Information from various sources.

Then, the gathered data will be evaluated by a rule-based assessment according to the LEED requirements, which are crucial for ongoing evaluation needs in the project. This approach is designed to mimic the project team’s processes of collecting, analyzing, and managing SBM information but will automatically collect and analyze data from various sources, thus minimizing the risk of overlooking available SBM information.

Research methods

In this study, web-crawling techniques and NLP with ML were applied. Interviews with nine experts in GB consulting were conducted to identify the primary steps in sourcing and evaluating sustainable construction material design options. The selected experts, chosen for this study, each possess a minimum of 10 years of experience. They are employed at LEED consulting firms, developer companies, or the certificate authorities (GBCI). After that, the Web crawlers were developed to collect specified data on sustainable building materials (SBM). And the NLP-ML models were also trained to recognize and distinguish the sustainability properties of these materials. The data collected by the web crawlers were analyzed to the attributes of the materials. These attributes (or eco- label) were categorized based on the guidelines of the LEED standard system, which is highly recognized and shares similarities with other well-known GB standard systems. Pseudocode for this process was presented in Table 2.

Table 2.

Pseudocode for NLP-Based ML EBM Database Wrangling

## 1. Data Collection
1. **Initialize:**
* `target_websites` = List of URLs (EBM databases, manufacturers, etc.)
* `ebm_data` = Empty list to store EBM data
2. **Crawl and Scrape (using BeautifulSoup):**
* **FOR** each `url` in `target_websites`:
* `html_content` = `fetch_html(url)`
* `extracted_data` = `scrape_ebm_data(html_content)`
* Append `extracted_data` to `ebm_data`
3. **Save Data:**
* Save `ebm_data` as CSV file (e.g., "ebm_raw_data.csv")
## 2. NLP Data Preparation
1. **Load Data:**
* `ebm_df` = Load "ebm_raw_data.csv" into pandas DataFrame
2. **Preprocessing:**
* **FOR** each row in `ebm_df`:
* `text_data` = row['text_column']
* `text_data` = `clean_text(text_data)` # Remove HTML tags, punctuation, lowercase, etc.
* `text_data` = `tokenize(text_data)`
* `text_data` = `remove_stopwords(text_data)`
* `text_data` = `lemmatize(text_data)`
* Update row['text_column'] with `text_data`
3. **TF-IDF Vectorization:**
* `vectorizer` = TfidfVectorizer()
* `tfidf_matrix` = `vectorizer.fit_transform(ebm_df['text_column'])`
## 3. ML Training and Validation
1. **Choose Models:**
* `ner_model` = NER Model (e.g., SpaCy)
* `re_model` = RE Model (Rule-based or ML-based)
* `classifier_models` = [Naive Bayes, Random Forest, Decision Tree, SVM, kNN]
2. **Train Models:**
* Split data into `train_data` and `validation_data`
* **FOR** each `classifier_model` in `classifier_models`:
* `model` = `classifier_model.fit(train_data['tfidf_matrix'], train_data['labels'])`
* `predictions` = `model.predict(validation_data['tfidf_matrix'])`
* Evaluate and store performance metrics (precision, recall, F1-score) on `validation_data`
3. **Select Best Model:**
* Choose the classifier with the highest F1-score (or other preferred metric) on the `validation_data`.
## 4. Testing and Evaluation
1. **Load Unseen Test Data:**
* `ebm_test_df` = Load new EBM data
* Preprocess `ebm_test_df` (same as training data)
2. **Predict:**
* `test_predictions` = `best_classifier_model.predict(ebm_test_df['tfidf_matrix'])`
3. **Evaluate:**
* Calculate precision, recall, F1-score on `test_predictions`

Data Collecting

The study’s foundation is built upon a robust dataset of EBM amassed through web crawling and scraping. This data collection process targets diverse sources to ensure comprehensive coverage, which was recommended by the interviewed experts. Web crawlers systematically navigate these sites, following links to index pages containing EBM data. The specific databases included in this study are:

ㆍhttps://building-material-scout.com

ㆍhttps://transparencycatalog.com/

ㆍhttps://www.originmaterials.com/

ㆍhttps://www.ecomedes.com/

ㆍhttps://spot.ul.com/

ㆍhttps://www.energystar.gov/products

ㆍhttps://www.epeat.net/

ㆍhttps://www.epa.gov/watersense

Data collection from these databases involves extracting relevant information, including material names, descriptions, technical specifications, environmental certifications, manufacturer details, and any other available attributes related to sustainability, performance, and compliance. Upon identifying relevant pages, web scraping tools extract specific data elements. The data is collected in both structured (e.g., tables, databases) and unstructured (e.g., text descriptions) formats. The extracted data is then meticulously organized and stored in structured formats like CSV. This organized data serves as the input for subsequent NLP and ML analyses.

NLP Data Preparation

The raw EBM data collected through web crawling and scraping requires meticulous preparation to be suitable for subsequent NLP and ML analysis. This preparation stage involves several crucial steps that transform unstructured text into a clean and usable format.

-Data Cleaning: This step involves eliminating redundancies, such as duplicate entries, and rectifying any errors or inconsistencies present in the raw data. Missing values are handled appropriately, either by imputation or removal, depending on their extent and potential impact on analysis.

-Text Normalization: Textual data is standardized by converting all characters to lowercase, removing punctuation marks, and unifying measurement units. This step ensures consistency and eliminates variations that could hinder accurate analysis.

-Tokenization: The normalized text is then divided into smaller units, such as words or phrases, called tokens. This segmentation facilitates further processing and analysis by enabling the identification of individual semantic units within the text.

-Stop Word Removal: Common words, such as “the,” “and”, “of,” which carry little semantic meaning, are removed. This step reduces noise and focuses the analysis on more informative terms.

-Stemming/Lemmatization: Words are reduced to their base or root form. Stemming involves removing suffixes, while lemmatization considers the context to derive the dictionary form of the word. This step reduces dimensionality and groups related words together.

-Feature Extraction: The prepared text is transformed into numerical representations that machine learning models can interpret. Common techniques include bag-of-words, TF-IDF was use, and word embeddings. These representations capture semantic relationships between words and enable ML algorithms to learn patterns within the data.

These NLP data preparation steps were used to ensure that the EBM data is clean, consistent, and suitable for extracting the available eco-labels. This lays the groundwork for training robust ML models capable of extracting meaningful insights from the data and wrangling EBM databases effectively.

Training, cross-validation and selection the models

Supervised learning algorithms were examined in the study since they are commonly used for material classification tasks. Supervised learning algorithms that can be used for material classification include artificial neural networks (ANN), and support vector machines (SVM), Naive Bayes (NB), and so on [25]. Details, NB is a probabilistic algorithm that applies Bayes’ theorem with the assumption of independence between features. Despite its simplicity, NB has been effective in text classification tasks such as spam filtering and sentiment analysis [26] SVM are commonly used in text classification tasks such as sentiment analysis and spam detection, as they effectively separate different classes in feature space by finding an optimal hyperplane [27]. Decision tree algorithms offer several advantages, including effectiveness in classification, high speed, easy interpretability, and the ability to handle both classification and regression problems [28]. Random Forest is an ensemble learning algorithm that combines multiple decision trees to make predictions. It has been successfully applied to NLP classification tasks, including sentiment analysis, topic classification, and text [23]. The algorithms offer advantages such as interpretability, efficiency, and the ability to handle both categorical and numerical features [25] . However, the choice of algorithm should consider the specific requirements of the task, the characteristics of the dataset, and the desired interpretability of the model. On the other hand, the information trends of eco-labels differ due to different concepts, markets, and material objects, so this study examined 5 algorithms mentioned for training and cross-validation dataset of each eco-labels, then selected the most appropriate algorithm based on accuracy and F1 score for developing the final workflow.

Models Validation

In binary classification, classifiers’ predictive performance relies on four key statistics: TP, FN, FP, and TN. These statistics, in balanced class scenarios, enable the computation of evaluation metrics like Accuracy, P, R, and F1 [29]. However, with imbalanced class distributions, such metrics may mislead. For instance, in a dataset with 90 negatives and 10 positives, a classifier predicting only negatives would still achieve 90% accuracy, rendering it seemingly effective when it’s not. In such scenarios, it’s vital to consider both the TP rate and FP rate [30]. TP rate gauges correctly classified positives against actual positives, while FPR measures incorrectly classified negatives against true negatives. Due to the risks of skewed predictions from imbalanced data, the Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) offers a robust evaluation, considering both sensitivity and specificity, and remaining consistent across varied base rates [31]. ROC graphs visually represent classifier performance through TP rate and FP rate, with the curve’s proximity to the upper left corner indicating better prediction. An ideal ROC curve sees the Area Under the ROC (AUROC) near unity, signifying impeccable classification. AUROC values categorize as: 0.9-1.0 excellent, 0.8-0.9 good, 0.7-0.8 fair, 0.6-0.7 poor, and 0.5-0.6 as failure [32]. Consequently, the combination of evaluation metrics, such as accuracy, P, R, F1 and ROC curve, provides a comprehensive and in-depth assessment of the reliability and effectiveness of the NLP model in this context. Hence, the ROC-AUC metric serves as the foundational criterion for choosing the most suitable algorithms and models for each dataset pertaining to individual eco-labels. To further validate the chosen models, metrics like Accuracy, P, R, and F1 were employed on both independent test datasets.

Results and Discussion

EBM Databases investigation and collected data

A brief green material assessment process is presented in Figure 1. According to the interviewed experts, sustainability aims, including green and net-zero building aspirations, are typically outlined in the early design stages. These goals are predicated on myriad factors such as energy efficiency, environmental implications, and the health and well-being of occupants, necessitating judicious decision-making in the project’s conception [4]. To meet these benchmarks, selected materials should exhibit distinct qualities, verifiable through eco-labels authenticated by third parties. Materials should align with set sustainability criteria, reflecting attributes like recycled content, resource renewability, reduced embodied energy, minimal VOC emissions, and proven durability [33]. A holistic approach necessitates the active involvement of a cross-section of stakeholders: from architects and designers to contractors and suppliers, each offering invaluable expertise to fortify the sustainability quotient of chosen materials [34]. These eco-labels furnish standardized insights on the environmental efficacy and sustainability features of construction materials, rendering them invaluable to architects, designers, and other decision-makers. By distilling complex data into digestible information, eco-labels streamline the decision-making workflow, offering clear insights into the environmental bona fides of materials [35]. This facilitates a comparative evaluation of materials, focusing on LEED criteria like environmental repercussions, energy conservation, among others [35]. As a databases investigation result, total of 16 eco-labels were tested in this study, including Water Sense (US EPA), Energy Star, EPEAT, Cradle to Gate (C2G), Environmental Product Declarations (EPD), Forest Stewardship Council (FSC), Cradle to Cradle (C2C), U.N Global compact, Extended producer responsibility (EPR), Bio preferred®, Material reuse, Recycled content, Greenscreen, Health Product Declare (HPD), and UL Green Guard.

https://cdn.apub.kr/journalsite/sites/durabi/2024-015-03/N0300150305/images/Figure_susb_15_03_05_F1.jpg

Figure 1.

The assistance of eco-labels in determining its sustainable attributes.

From January 2021 to March 2021, extensive data collection was undertaken using developed crawlers, amassing over 300,000 initial observations. Following qualification procedures, the dataset was manually refined to 64,350 data points. A significant portion of the initial observations were removed due to duplication or a lack of essential information. These discrepancies align with concerns noted in earlier studies regarding inherent deficiencies in current databases. Such shortcomings include data duplication and information format inconsistencies, often stemming from variations in the proficiency of data-entry personnel. A collected data statistical analysis (Table 3) highlights the issues arising from the unbalanced data typical of existing databases. This imbalance underscores the emerging need for hybrid models to accurately discern material information in the future, rather than solely depending on databases. Within the databases, most sustainable construction materials are distributed among the following material groups: Building Finishes (91,291), Building Furniture (41,782), Office Electronics (35,151), HVAC/Mechanical (16,466), and Appliances (100,048). Regarding eco-label representation, dominant labels include ENERGY STAR® Certified (84,964), UL GREENGUARD (79,811), Environmental Product Declaration (EPD) (31,037), Biopreferred® (16,869), and Green Label Plus (19,383). These figures indirectly provide insights into the prevailing sustainable product market dynamics, suggesting a tilt towards premium product categories, notably within the realms of Interior and Electrical Appliances. Evaluating the eco-labels market panorama, there’s a discernible inclination towards attributes such as energy efficiency and user health protection, with diminished emphasis on pivotal sustainability facets like material origin and emissions. Grasping these market tendencies equips project teams with the requisite acumen to judiciously select suitable materials for their endeavors. The unbalance in the number of these eco-labels can lead to the problem of lack of data in training models. Thus, the special nature of the data requires caution in the process of validating the effectiveness of the NLP-ML models built in this study.

Table 3.

Number of samples collected for each eco-label and sustainable attributes

Criteria	Eco-labels	Code	Eco-labels
Persistent Bio accumulative toxic	Free Mercury	E1	-
Persistent Bio accumulative toxic	No Lead, Cadmium, and Copper	E2	5,231
Building Product Disclosure and Optimization - Environmental Product Declarations	Cradle to Gate	E3	3543
	Cradle to Grave	E4	1,314
	EPD	E5	31,037
Building Product Disclosure and Optimization - Sourcing of Raw Materials	CSR	E6	591
	Extended producer responsibility	E7	570
	UN Global Compact	E8	1,080
	FSC	E9	5,382
	Recycled content	E10	5,557
Building Product Disclosure and Optimization - Material Ingredients	HPD	E11	35,804
	Cradle to Cradle	E12	567
	Green Screen	E13	4,336
	REACH	E14	-
Indoor Air Quality	Low-Emitting Materials (UL)	E15	64,350
Energy Performance	ENERGY STAR	E16	100,525
Energy Performance	EPEAT	E17	7,964
Indoor Water Use Reduction	Water sense	E18	39,625

Potential of NLP based ML on Building Material Eco-Label Databases Wrangling

During the training and cross-validation of the models, analysis of the ROC-AUC scores (Details in the Figure 2) [32] for the 16 NLP-ML predictive models across various ecolabels revealed distinct performance trends. In all groups of models trained for predicting eco-labels, the majority of models suggested good model predictive performance when ROC-AUC values lie between 0.70 and 1.0 [36]. The Random Forest (RF) algorithm consistently emerged as a top-performing classifier, achieving notably high AUC scores in models such as “PBTs”, “Crate to Gate”, and the “UL GREENGUARD label”. Conversely, the performance of the SVM algorithm varied, excelling in ecolabels like “Water sense” but lagging with an AUC of 0.483 for “UN Global Compact”. Recall, and F1-score for various eco-labels as outlined in Appendix 1, there emerged two algorithms that consistently demonstrated superior outcomes. Specifically, the RF and SVM algorithms showcased remarkable reliability across the datasets.

https://cdn.apub.kr/journalsite/sites/durabi/2024-015-03/N0300150305/images/Figure_susb_15_03_05_F2.jpg

Figure 2.

ROC-AUC chart during training and cross-validation.

During cross-validation to select models that match the data of the lunacy eco-labels (The results are shown in Appendix 1), and the evaluation process on the test dataset. The metrics assessed encompass recall, precision, and F1-score across both cross-validation and testing datasets (See in Table 4). For many ecolabels, the RF algorithm consistently delivered commendable performance, exemplified by its F1-scores for attributes like “PBTs” (94.19% in cross-validation, 94.72% in testing) and “C2Gate” (92.78% in cross-validation, 93.25% in testing). Its capacity to yield high F1-scores suggests an adept balance of precision and recall, indicative of its proficiency in accurately identifying true cases while minimizing inaccuracies. Conversely, the SVM algorithm showcased exceptional performance for specific attributes. Notably, it achieved perfection for the “UL” ecolabel during cross- validation and reported an F1-score of 96.56% for the “WS” ecolabel in the same phase. RF has consistently exhibited robust performance across multiple attributes. The ability to achieve high F1-scores suggests that the RF algorithm can effectively balance precision and recall, making it adept at identifying true positive cases while minimizing false positives and negatives. The SVM has demonstrated exemplary performance in specific attributes. Notably, for the “UL” attribute, SVM achieved perfection in cross-validation across all performance metrics. Attributes such as “PBTs” and “C2Gate” showcased such consistency, with minor differences in F1-scores between the two datasets. This consistency is indicative of the model’s generalization capabilities, suggesting its potential applicability in real-world scenarios beyond the confines of the training dataset. Variability Among Attributes: There exists a degree of variability in performance among different attributes, even within the same algorithm. Such variability implies the intrinsic complexities and unique characteristics each attribute might present.

Table 4.

Recall, Precision and F-measure results of selected ML-NLP models on test data set

Attributes	Code	Algo.	n	Cross-validation (%)			Testing (%)
Attributes	Code	Algo.	n	R	P	F1	R	P	F1
Persistent Bioaccumulate toxic	PBTs	RF	759	94.23	94.16	94.19	94.67	94.8	94.72
Environmental Product Declarations	C2Gate	RF	825	94.02	91.8	92.78	92.52	94.11	93.25
	C2Grave	RF	987	91.82	84.84	87.76	87.31	91.2	89.07
	EPD	RF	428	87.93	80.31	83.34	81.19	87.15	83.69
Sourcing of Raw Materials	CSR	RF	1537	98.75	97.56	98.14	98.29	99.11	98.68
	EPR	RF	427	98.11	98.97	98.54	98.97	98.11	98.54
	U.N. GC	RF	761	91.16	86.8	88.74	89.33	91.04	84.35
	FSC	RF	1515	92.26	91.42	91.44	90.63	91.45	90.65
	RC	RF	1584	97.38	97.45	97.41	96.4	96.39	96.39
Material Ingredients	HPD	RF	1500	92.84	92.8	92.8	92.2	92.34	92.2
	C2C	RF	293	90.26	86.9	88.35	88.56	92.07	90.07
	GS	RF	427	99.14	96.47	97.74	99.86	99.43	99.64
Indoor Air Quality	UL	SVM	1392	100	100	100	99.94	99.93	99.93
Energy Performance	ES	RF	5617	96.36	96.34	96.35	96.57	96.51	96.54
Energy Performance	EPEAT	SVM	540	82.14	80.13	80.91	84.76	87.5	85.8
Indoor Water Use Reduction	WS	SVM	1630	96.57	96.57	96.56	95.77	95.77	95.77
Indoor Water Use Reduction	ES	RF	5617	96.36	96.34	96.35	96.57	96.51	96.54

Conclusions

This study successfully applied natural language processing (NLP) and machine learning (ML) techniques to the challenge of consolidating and extracting valuable information from disparate eco-labeled building materials (EBM) databases. By automating the standardization of terminology, resolution of inconsistencies, and integration of diverse data sources, our approach offers a streamlined solution to the labor-intensive task of EBM data wrangling.

The curated dataset of 64,350 data points, compiled through investigation of EBM databases and web scraping/crawling, served as the foundation for subsequent analysis. Leveraging NLP and ML algorithms, we demonstrated the effective classification of EBM attributes across various ecolabels, with the Random Forest algorithm consistently exhibiting superior performance. Notably, high AUC scores and F1-scores for attributes like “PBTs” and “Crate to Gate” showcase the methodological efficacy in accurately categorizing EBM data.

The resulting structured EBM database may serve as a resource for stakeholders to efficiently query and analyze relevant information, thereby informing decision-making processes in the selection and utilization of sustainable building materials. By enhancing the accessibility and organization of EBM data, this research contributes to the broader efforts towards promoting sustainable construction practices within the built environment.

However, this study has some limitations. The research focused on a specific set of eco-labels and databases, which may not be exhaustive. Additionally, the performance of the ML-NLP models varied across different eco-labels, indicating the need for further refinement and optimization. Future research could explore the inclusion of additional eco-labels and databases, as well as the development of more sophisticated ML-NLP models to improve the accuracy and consistency of SBM information extraction and analysis. Furthermore, the integration of this framework with Building Information Modeling (BIM) could enhance the practical application of the findings in real-world construction projects.

Acknowledgements

This work was supported by Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the Korea government (MOTIE) (20202020800030, Development of Smart Hybrid Envelope Systems for Zero Energy Buildings through Holistic Performance Test and Evaluation Methods and Fields Verifications).

Appendix

Appendix 1.

Result of the training-Cross validation to select the ML prediction models.

Code	P	R	n	n	Algo.	Code	Code	P	R	n
E2	91.83%	91.81%	91.82%	759	SVM	E10	94.88%	94.86%	94.87%	1584
	90.05%	90.06%	89.92%		Naïve Bayes		94.35%	94.47%	94.38%
	91.55%	91.62%	91.77%		Decision Trees		92.74%	92.58%	92.65%
	94.23%	94.16%	94.19%		RF		97.38%	97.45%	97.41%
	84.11%	75.29%	74.34%		KNN		87.25%	87.08%	86.74%
E3	91.92%	91.15%	91.51%	825	SVM	E11	90.15%	90.14%	90.13%	1500
	16.85%	50.00%	25.21%		Naïve Bayes		87.81%	87.80%	87.80%
	89.54%	88.89%	89.20%		Decision Trees		87.33%	87.34%	87.34%
	94.02%	91.80%	92.78%		RF		92.84%	92.80%	92.80%
	85.87%	69.96%	71.86%		KNN		81.73%	76.20%	75.12%
E4	88.61%	86.81%	87.67%	987	SVM	E12	89.70%	85.72%	87.39%	293
	83.73%	85.44%	84.37%		Naïve Bayes		82.28%	84.34%	83.17%
	85.81%	86.62%	86.21%		Decision Trees		86.38%	83.94%	85.01%
	91.82%	84.84%	87.76%		RF		90.26%	86.90%	88.35%
	91.74%	67.90%	72.54%		KNN		75.99%	81.36%	75.99%
E5	84.68%	83.11%	83.86%	428	SVM	E13	99.00%	95.88%	97.35%	427
	84.68%	83.11%	83.86%		Naïve Bayes		96.63%	95.69%	96.14%
	82.62%	79.00%	80.60%		Decision Trees		98.51%	96.33%	97.37%
	87.93%	80.31%	83.34%		RF		99.14%	96.47%	97.74%
	88.57%	63.83%	67.39%		KNN		96.34%	84.12%	88.66%
E6	98.68%	98.09%	98.38%	1537	SVM	E15	100.00%	100.00%	100.00%	1392
	95.25%	95.80%	95.52%		Naïve Bayes		98.12%	98.12%	98.12%
	98.51%	97.33%	97.91%		Decision Trees		99.87%	99.85%	99.86%
	98.75%	97.56%	98.14%		RF		99.86%	99.86%	99.86%
	96.16%	79.77%	85.33%		KNN		92.66%	90.13%	90.56%
E7	98.68%	99.12%	98.89%	427	SVM	E16	96.36%	96.34%	96.35%	5617
	81.70%	92.11%	84.94%		Naïve Bayes		95.80%	95.67%	95.72%
	98.68%	99.12%	98.89%		Decision Trees		94.59%	94.62%	94.61%
	98.11%	98.97%	98.54%		RF		97.01%	97.03%	97.02%
	95.37%	79.17%	84.42%		KNN		92.13%	92.27%	91.99%
E8	91.14%	85.94%	88.20%	761	SVM	E17	82.14%	80.13%	80.91%	540
	78.83%	83.85%	80.79%		Naïve Bayes		76.57%	78.21%	76.93%
	87.42%	85.49%	86.40%		Decision Trees		81.61%	80.41%	80.92%
	91.16%	86.80%	88.74%		RF		81.51%	76.65%	78.00%
	74.56%	81.73%	76.66%		KNN		75.41%	66.12%	66.66%
E9	91.11%	91.08%	91.09%	1515	SVM	E18	96.57%	96.57%	96.56%	1630
	91.11%	91.08%	91.09%		Naïve Bayes		96.75%	96.75%	96.75%
	88.78%	88.78%	88.78%		Decision Trees		95.10%	95.09%	95.09%
	92.26%	91.42%	91.44%		RF		96.57%	96.57%	96.56%
	74.76%	73.21%	72.88%		KNN		93.03%	92.48%	92.38%

References

M. Najjar, K. Figueiredo, M. Palumbo, and A. Haddad, Integration of BIM and LCA: Evaluating the environmental impacts of building materials at an early stage of designing a typical office building. Journal of Building Engineering. 14 (2017), pp. 115-126.

10.1016/j.jobe.2017.10.005

S.-H. Hong, S.-K. Lee, and J.-H. Yu, Automated management of green building material information using web crawling and ontology. Automation in Construction. 102 (2019), pp. 230-244.

10.1016/j.autcon.2019.01.015

J.P. Carvalho, I. Alecrim, L. Bragança, and R. Mateus, Integrating BIM-Based LCA and Building Sustainability Assessment. Sustainability. 12(18) (2020).

10.3390/su12187468

S. Wang, S. Tae, and R. Kim, Development of a green building materials integrated platform based on materials and resources in G-SEED in South Korea. Sustainability. 11(23) (2019), 6532.

10.3390/su11236532

B.-E. Laure, B. Angela, and M. Tova, Machine Learning to Data Management: A Round Trip. 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France. (2018), pp. 1735-1738, DOI: 10.1109/ICDE.2018.00226.

10.1109/ICDE.2018.00226

E.A. Olivetti, J.M. Cole, E. Kim, O. Kononova, G. Ceder, T.Y.J. Han, and A.M. Hiszpanski, Data-Driven Materials Research Enabled by Natural Language Processing and Information Extraction. Applied Physics Reviews. 7(4) (2020).

10.1063/5.0021106

A.Z. Sampaio, BIM capacities improved with VR technology in the building project. In Multi Conference on Computer Science and Information Systems, MCCSIS 2019-Proceedings of the International Conferences on Big Data Analytics. Data Mining and Computational Intelligence 2019 and Theory and Practice in Modern Computing 2019. (2019), pp. 214-218.

10.33965/tpmc2019_201907C028

F. Jalaei and A. Jrade, Integrating building information modeling (BIM) and LEED system at the conceptual design stage of sustainable buildings. Sustainable Cities and Society. 18 (2015), pp. 95-107.

10.1016/j.scs.2015.06.007

Y.W. Lim, T.E. Seghier, M.F. Harun, M.H. Ahmad, A.A. Samah, and H.A. Majid, Computational BIM for Building Envelope Sustainability Optimization. Matec Web of Conferences. (2019).

10.1051/matecconf/201927804001

İ. Yüksek, The Evaluation of Building Materials in Terms of Energy Efficiency. Periodica Polytechnica Civil Engineering. 59(1) (2015).

10.3311/PPci.7050

P.-H. Lin, C.-C. Chang, Y.-H. Lin, and W.-L. Lin, Green BIM assessment applying for energy consumption and comfort in the traditional public market: A case study. Sustainability. 11(17) (2019), 4636.

10.3390/su11174636

S. Yang, S. Wi, J.H. Park, H.M. Cho, and S. Kim, Framework for developing a building material property database using web crawling to improve the applicability of energy simulation tools. Renewable and Sustainable Energy Reviews. 121 (2020), 109665.

10.1016/j.rser.2019.109665

D.T. Doan, A. GhaffarianHoseini, N. Naismith, A. Ghaffarianhoseini, T. Zhang, and J. Tookey, An empirical examination of Green Star certification uptake and its relationship with BIM adoption in New Zealand. Smart and Sustainable Built Environment. 12(1) (2023), pp. 84-104.

10.1108/SASBE-05-2021-0093

Y.W. Lim, Building Information Modeling for Indoor Environmental Performance Analysis. American Journal of Environmental Sciences. (2015).

10.3844/ajessp.2015.55.61

D. Zhuang, X. Zhang, Y. Lu, C. Wang, X. Jin, X. Zhou, A performance data integrated BIM framework for building life-cycle energy efficiency and environmental optimization design. Automation in Construction. 127 (2021), 103712.

10.1016/j.autcon.2021.103712

F. Rezaei, C. Bulle, and P. Lesage, Integrating building information modeling and life cycle assessment in the early and detailed building design stages. Building and Environment. 153 (2019), pp. 158-167.

10.1016/j.buildenv.2019.01.034

J. Zhao, R. Plagge, N.M.M. Ramos, M.L. Simões, and J. Grunewald, Concept for development of stochastic databases for building performance simulation-A material database pilot project. Building and environment. 84 (2015), pp. 189-203.

10.1016/j.buildenv.2014.10.030

M.K. Ansah, X. Chen, H. Yang, L. Lu, and P.T.I. Lam, A review and outlook for integrated BIM application in green building assessment. Sustainable Cities and Society. 48 (2019), 101576.

10.1016/j.scs.2019.101576

J. Carvalho, L. Bragança, and R. Mateus, Sustainable building design: Analysing the feasibility of BIM platforms to support practical building sustainability assessment. Computers in Industry. 127 (2021), 103400.

10.1016/j.compind.2021.103400

R.K. Soman and J.K. Whyte, Codification challenges for data science in construction. Journal of Construction Engineering and Management. 146(7) (2020), 04020072.

10.1061/(ASCE)CO.1943-7862.0001846

A. Khalil and S. Stravoravdis, Digital building data longevity and interoperability challenges in the documentation of heritage buildings. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. 46 (2022), pp. 283-289.

10.5194/isprs-archives-XLVI-2-W1-2022-283-2022

S. Sheikhalishahi, R. Miotto, J.T. Dudley, A. Lavelli, F. Rinaldi, and V. Osmani, Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review. JMIR Medical Informatics. 7(2) (2019), e12239.

10.2196/12239

A.J.-P. Tixier, M.R. Hallowell, B. Rajagopalan, and D. Bowman, Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports. Automation in Construction. 62 (2016), pp. 45-56.

10.1016/j.autcon.2015.11.001

S.C. Fanni, M. Febi, G. Aghakhanyan, and E. Neri, Natural language processing. In Introduction to Artificial Intelligence. 2023, Cham: Springer International Publishing, pp. 87-99.

10.1007/978-3-031-25928-9_5

W.S. Alaloul and A.H. Qureshi, Material classification via machine learning techniques: Construction projects progress monitoring. In Deep Learning Applications. IntechOpen, (2021).

B. He, W. Li, and Z. Jin, Research on Text Classification Based on Deep Learning. Scientific Journal of Technology, 8(3) (2022), pp. 1-8.

Y. Zhang, J. Song, W. Peng, D. Guo, and T. Song, A Machine Learning Classification Algorithm for Vocabulary Grading in Chinese Language Teaching. Tehnicki Vjesnik - Technical Gazette. 28(3) (2021), pp. 845-855.

10.17559/TV-20210128043310

N. Patel and S. Pawar An Optimized Classifier Frame Work Based on Rough Set and Random Tree. International Journal of Computer Applications. 160(9) (2017), pp. 1-6.

10.5120/ijca2017912844

M. Buckland and F. Gey, The relationship between recall and precision. Journal of the American society for information science. 45(1) (1994), pp. 12-19.

10.1002/(SICI)1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L

D. Tian, M. Li, Q. Ren, X. Zhang, S. Han, and Y. Shen, Intelligent question answering method for construction safety hazard knowledge based on deep semantic mining. Automation in Construction. 145 (2023), 104670.

10.1016/j.autcon.2022.104670

L.M. Helmus and K.M. Babchishin, Primer on risk assessment and the statistics used to evaluate its accuracy. Criminal Justice and Behavior. 44(1) (2017), pp. 8-25.

10.1177/0093854816678898

T. Fawcett, An introduction to ROC analysis. Pattern recognition letters. 27(8) (2006), pp. 861-874.

10.1016/j.patrec.2005.10.010

K. Mason, Justice in Building, Building in Justice: The Reconstruction of Intragenerational Equity in Framings of Sustainability in the Eco-Building Movement. Environmental Values. 23(1) (2014), pp. 99-118.

10.3197/096327114X13851122269124

I. Umar, J.J. Lembi, and L.C. Emechebe, Assessment of Awareness of Architects on Sustainable Building Materials in Minna, Nigeria. American Journal of Construction and Building Materials. 5(2) (2021).

10.11648/j.ajcbm.20210502.12

W.S.E. Ismaee, Assessing and Developing the Application of LEED Green Building Rating System as a Sustainable Project Management and Market Tool in the Italian Context. Journal of Engineering Project and Production Management. 6(2) (2016), pp. 136-152.

10.32738/JEPPM.201607.0006

A.W. Flores, K. Bechtel, and C.T. Lowenkamp, False positives, false negatives, and false analyses: A rejoinder to machine bias: There's software used across the country to predict future criminals. and it's biased against blacks. Fed. Probation. 80 (2016), 38.

International Journal of Sustainable Building Technology and Urban Development ISSN:2093-761X(Print) 2093-7628(Online)

Preview

A Natural language processing based machine learning approach on building material eco-label databases wrangling

ABSTRACT

MAIN

Table 1.

Existing research approaches for improving the collection and management of SBM Information

Table 2.

Pseudocode for NLP-Based ML EBM Database Wrangling

Figure 1.

The assistance of eco-labels in determining its sustainable attributes.

Table 3.

Number of samples collected for each eco-label and sustainable attributes

Figure 2.

ROC-AUC chart during training and cross-validation.

Table 4.

Recall, Precision and F-measure results of selected ML-NLP models on test data set

Acknowledgements

Appendix

Appendix 1.

Result of the training-Cross validation to select the ML prediction models.

References