General Article

International Journal of Sustainable Building Technology and Urban Development. 30 September 2024. 367-380
https://doi.org/10.22712/susb.20240026

ABSTRACT


MAIN

  • Introduction

  • Literature review

  • Research methods

  •   Data Collecting

  •   NLP Data Preparation

  •   Training, cross-validation and selection the models

  •   Models Validation

  • Results and Discussion

  •   EBM Databases investigation and collected data

  •   Potential of NLP based ML on Building Material Eco-Label Databases Wrangling

  • Conclusions

  • Appendix

Introduction

The construction industry, a major contributor to global greenhouse gas emissions and resource consumption, is increasingly embracing sustainable practices to mitigate its environmental impact [1]. Green building materials (GBM), derived from renewable resources or recycled waste, offer a promising avenue for reducing the carbon footprint of buildings and promoting circular economies. However, the widespread adoption of GBMs is hindered by the lack of comprehensive and accessible information about their properties, performance, and environmental impacts [2].

Green building (GB) eco-label material databases serve as comprehensive information repositories, simplifying the comparison and selection of sustainable building materials [3]. They enhance transparency and credibility through eco-labels and certifications, fostering market growth for environmentally friendly alternatives. These databases play a crucial role in informing eco-labels, providing information in a user-friendly format. However, the current landscape of EBM databases is fragmented, with data scattered across various platforms, maintained by different organizations, and presented in diverse formats [4]. This fragmentation poses challenges for stakeholders attempting to compare and select EBMs based on specific criteria. Inconsistent terminology, data organization, and quality further complicate the extraction of meaningful insights.

On other hand, the combination of NLP and ML has the potential to revolutionize data management tasks such as error detection, data cleaning, integration, and query inference within databases [5]. By automating the extraction and codification of scientific literature, NLP techniques can generate rich datasets essential for data science and ML applications [6]. Additionally, NLP-based ML models have demonstrated effectiveness in analyzing incident narratives and classifying data accurately. In conclusion, the synergy between NLP and ML presents a powerful approach for enhancing databases’ efficiency and accuracy through automated data extraction, management, and analysis.

Considering these challenges, this research Leveraging NLP advantages to address these challenges, this research aims to introduce a conceptual model that combines natural language processing (NLP) and machine learning (ML) to streamline the wrangling of EBM databases. NLP techniques enable the extraction of relevant information from unstructured text, such as product descriptions, technical specifications, and environmental certifications. ML algorithms, on the other hand, can learn patterns in the data to standardize terminology, resolve inconsistencies, and integrate disparate information sources. The following sections of this manuscript will delve into a comprehensive literature review of existing research on GBM databases and the application of NLP and ML in data management. Subsequently, the research methods employed in this study, including data collection, NLP data preparation, and the training and validation of ML models, will be elucidated. The results and discussion section will present the findings of the research, including the analysis of collected data and the performance of the ML-NLP models in predicting eco-labels. Finally, the conclusion will summarize the key findings, discuss the implications of the research, and suggest potential avenues for future research.

Literature review

Recently, various studies have been conducted to develop a automation materials information management process by applying Information Technology (IT) [2]. Among them, BIM has been identified as a potential candidate that can assist in integrating the fragmented Architecture, Engineering, and Construction (AEC) industry by eliminating inefficiencies and redundancies, improving collaboration and communication, and enhancing overall productivity [7]. Thus, studies popularly have focused on managing and retrieving material information to perform calculations aimed at optimizing the sustainability of the building envelope in BIM models [8, 9]. These optimizations aim to achieve the goal of optimizing a building’s carbon emissions reduction [10, 11, 12], life cycle costs [3], or assessing compatibility with standard GB systems [8, 13]. Some studies also conducted thermal performance analyses using various sustainability simulation tools, applying material data extracted from BIM models [14]. With the conventional catalog-based method, regularly accessing the latest information to the BIM models poses a challenge. This is because it necessitates distinct actions to upload fresh data to ensure the product information remains up to date. This ongoing need to update SBMs management often leads to a time-consuming and labor-intensive process, resulting in an increased workload for the project team.

In summary, existing research on improving automation management of SBM Information (Table 1), often integrated with Building Information Modeling (BIM), are highly valued for their role in enhancing design and construction efficiency. The effectiveness of these approaches is significantly influenced by the quality of gathering and analyzing material data. However, significant challenges exist in identifying and collecting data of this process due to inconsistencies across databases. Numerous studies have aimed to address this issue by developing strategic guidelines, database management standards, and automated systems for information collection. Despite these efforts, the rapidly growing field of sustainable building materials (SBM) and the ever-evolving nature of SBM data make it challenging to keep these methods current. Combining various methods does not ensure access to the latest information, as updating the data requires substantial human resources, which can be both costly and time-consuming.

Table 1.

Existing research approaches for improving the collection and management of SBM Information

Approach Main purpose Limitation
Centralizing Data Repositories
(Utilizing database and BIM Data)
- Utilize APIs from various databases to look up information, leveraging resources from different databases [15]. - Integration of databases and computational tools for real-time assessment in BIM [8, 16] and updating baselines for simulation data to ensure accurate and up-to-date simulations [17] - Uncertain results [18] due to various participants. Quantity and quality of BIM databases is limited [1]. - Interoperability gap between BIM platforms and tools [19]. - The potential to operate a centralized platform is hindered by the large and continuously growing volume of SBM data.
Strategies to Creating Extensive Databases - Standardizing data structures and formats to ensure consistency and compatibility across different databases [20, 21]. - Due to variations in local scenarios, no universally adopted global guideline exists [4], leading to inconsistencies [15].
Automating Data Classification - Propose rules-based ontologies for automated data classification, reducing dependency on labor [2, 12] - For each data source objects, rule-based need to be manually modified [2]

On the other hand, ML-NLP classifiers outperform ontologies rule-based systems by continuously learning and adjusting to new and evolving scenarios. NLP models have shown their effectiveness in classifying and identifying information within data from various fields, such as medical [22] , safety engineering [23], finance and others [24]. Therefore, this study proposes the utilization of WebCrawler and ML – NLP(ML-NLP) models for increasing collection and analysis of SBM Information from various sources.

Then, the gathered data will be evaluated by a rule-based assessment according to the LEED requirements, which are crucial for ongoing evaluation needs in the project. This approach is designed to mimic the project team’s processes of collecting, analyzing, and managing SBM information but will automatically collect and analyze data from various sources, thus minimizing the risk of overlooking available SBM information.

Research methods

In this study, web-crawling techniques and NLP with ML were applied. Interviews with nine experts in GB consulting were conducted to identify the primary steps in sourcing and evaluating sustainable construction material design options. The selected experts, chosen for this study, each possess a minimum of 10 years of experience. They are employed at LEED consulting firms, developer companies, or the certificate authorities (GBCI). After that, the Web crawlers were developed to collect specified data on sustainable building materials (SBM). And the NLP-ML models were also trained to recognize and distinguish the sustainability properties of these materials. The data collected by the web crawlers were analyzed to the attributes of the materials. These attributes (or eco- label) were categorized based on the guidelines of the LEED standard system, which is highly recognized and shares similarities with other well-known GB standard systems. Pseudocode for this process was presented in Table 2.

Table 2.

Pseudocode for NLP-Based ML EBM Database Wrangling

## 1. Data Collection
1. **Initialize:**
* `target_websites` = List of URLs (EBM databases, manufacturers, etc.)
* `ebm_data` = Empty list to store EBM data
2. **Crawl and Scrape (using BeautifulSoup):**
* **FOR** each `url` in `target_websites`:
* `html_content` = `fetch_html(url)`
* `extracted_data` = `scrape_ebm_data(html_content)`
* Append `extracted_data` to `ebm_data`
3. **Save Data:**
* Save `ebm_data` as CSV file (e.g., "ebm_raw_data.csv")
## 2. NLP Data Preparation
1. **Load Data:**
* `ebm_df` = Load "ebm_raw_data.csv" into pandas DataFrame
2. **Preprocessing:**
* **FOR** each row in `ebm_df`:
* `text_data` = row['text_column']
* `text_data` = `clean_text(text_data)` # Remove HTML tags, punctuation, lowercase, etc.
* `text_data` = `tokenize(text_data)`
* `text_data` = `remove_stopwords(text_data)`
* `text_data` = `lemmatize(text_data)`
* Update row['text_column'] with `text_data`
3. **TF-IDF Vectorization:**
* `vectorizer` = TfidfVectorizer()
* `tfidf_matrix` = `vectorizer.fit_transform(ebm_df['text_column'])`
## 3. ML Training and Validation
1. **Choose Models:**
* `ner_model` = NER Model (e.g., SpaCy)
* `re_model` = RE Model (Rule-based or ML-based)
* `classifier_models` = [Naive Bayes, Random Forest, Decision Tree, SVM, kNN]
2. **Train Models:**
* Split data into `train_data` and `validation_data`
* **FOR** each `classifier_model` in `classifier_models`:
* `model` = `classifier_model.fit(train_data['tfidf_matrix'], train_data['labels'])`
* `predictions` = `model.predict(validation_data['tfidf_matrix'])`
* Evaluate and store performance metrics (precision, recall, F1-score) on `validation_data`
3. **Select Best Model:**
* Choose the classifier with the highest F1-score (or other preferred metric) on the `validation_data`.
## 4. Testing and Evaluation
1. **Load Unseen Test Data:**
* `ebm_test_df` = Load new EBM data
* Preprocess `ebm_test_df` (same as training data)
2. **Predict:**
* `test_predictions` = `best_classifier_model.predict(ebm_test_df['tfidf_matrix'])`
3. **Evaluate:**
* Calculate precision, recall, F1-score on `test_predictions`

Data Collecting

The study’s foundation is built upon a robust dataset of EBM amassed through web crawling and scraping. This data collection process targets diverse sources to ensure comprehensive coverage, which was recommended by the interviewed experts. Web crawlers systematically navigate these sites, following links to index pages containing EBM data. The specific databases included in this study are:

ㆍhttps://building-material-scout.com

ㆍhttps://transparencycatalog.com/

ㆍhttps://www.originmaterials.com/

ㆍhttps://www.ecomedes.com/

ㆍhttps://spot.ul.com/

ㆍhttps://www.energystar.gov/products

ㆍhttps://www.epeat.net/

ㆍhttps://www.epa.gov/watersense

Data collection from these databases involves extracting relevant information, including material names, descriptions, technical specifications, environmental certifications, manufacturer details, and any other available attributes related to sustainability, performance, and compliance. Upon identifying relevant pages, web scraping tools extract specific data elements. The data is collected in both structured (e.g., tables, databases) and unstructured (e.g., text descriptions) formats. The extracted data is then meticulously organized and stored in structured formats like CSV. This organized data serves as the input for subsequent NLP and ML analyses.

NLP Data Preparation

The raw EBM data collected through web crawling and scraping requires meticulous preparation to be suitable for subsequent NLP and ML analysis. This preparation stage involves several crucial steps that transform unstructured text into a clean and usable format.

-Data Cleaning: This step involves eliminating redundancies, such as duplicate entries, and rectifying any errors or inconsistencies present in the raw data. Missing values are handled appropriately, either by imputation or removal, depending on their extent and potential impact on analysis.

-Text Normalization: Textual data is standardized by converting all characters to lowercase, removing punctuation marks, and unifying measurement units. This step ensures consistency and eliminates variations that could hinder accurate analysis.

-Tokenization: The normalized text is then divided into smaller units, such as words or phrases, called tokens. This segmentation facilitates further processing and analysis by enabling the identification of individual semantic units within the text.

-Stop Word Removal: Common words, such as “the,” “and”, “of,” which carry little semantic meaning, are removed. This step reduces noise and focuses the analysis on more informative terms.

-Stemming/Lemmatization: Words are reduced to their base or root form. Stemming involves removing suffixes, while lemmatization considers the context to derive the dictionary form of the word. This step reduces dimensionality and groups related words together.

-Feature Extraction: The prepared text is transformed into numerical representations that machine learning models can interpret. Common techniques include bag-of-words, TF-IDF was use, and word embeddings. These representations capture semantic relationships between words and enable ML algorithms to learn patterns within the data.

These NLP data preparation steps were used to ensure that the EBM data is clean, consistent, and suitable for extracting the available eco-labels. This lays the groundwork for training robust ML models capable of extracting meaningful insights from the data and wrangling EBM databases effectively.

Training, cross-validation and selection the models

Supervised learning algorithms were examined in the study since they are commonly used for material classification tasks. Supervised learning algorithms that can be used for material classification include artificial neural networks (ANN), and support vector machines (SVM), Naive Bayes (NB), and so on [25]. Details, NB is a probabilistic algorithm that applies Bayes’ theorem with the assumption of independence between features. Despite its simplicity, NB has been effective in text classification tasks such as spam filtering and sentiment analysis [26] SVM are commonly used in text classification tasks such as sentiment analysis and spam detection, as they effectively separate different classes in feature space by finding an optimal hyperplane [27]. Decision tree algorithms offer several advantages, including effectiveness in classification, high speed, easy interpretability, and the ability to handle both classification and regression problems [28]. Random Forest is an ensemble learning algorithm that combines multiple decision trees to make predictions. It has been successfully applied to NLP classification tasks, including sentiment analysis, topic classification, and text [23]. The algorithms offer advantages such as interpretability, efficiency, and the ability to handle both categorical and numerical features [25] . However, the choice of algorithm should consider the specific requirements of the task, the characteristics of the dataset, and the desired interpretability of the model. On the other hand, the information trends of eco-labels differ due to different concepts, markets, and material objects, so this study examined 5 algorithms mentioned for training and cross-validation dataset of each eco-labels, then selected the most appropriate algorithm based on accuracy and F1 score for developing the final workflow.

Models Validation

In binary classification, classifiers’ predictive performance relies on four key statistics: TP, FN, FP, and TN. These statistics, in balanced class scenarios, enable the computation of evaluation metrics like Accuracy, P, R, and F1 [29]. However, with imbalanced class distributions, such metrics may mislead. For instance, in a dataset with 90 negatives and 10 positives, a classifier predicting only negatives would still achieve 90% accuracy, rendering it seemingly effective when it’s not. In such scenarios, it’s vital to consider both the TP rate and FP rate [30]. TP rate gauges correctly classified positives against actual positives, while FPR measures incorrectly classified negatives against true negatives. Due to the risks of skewed predictions from imbalanced data, the Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) offers a robust evaluation, considering both sensitivity and specificity, and remaining consistent across varied base rates [31]. ROC graphs visually represent classifier performance through TP rate and FP rate, with the curve’s proximity to the upper left corner indicating better prediction. An ideal ROC curve sees the Area Under the ROC (AUROC) near unity, signifying impeccable classification. AUROC values categorize as: 0.9-1.0 excellent, 0.8-0.9 good, 0.7-0.8 fair, 0.6-0.7 poor, and 0.5-0.6 as failure [32]. Consequently, the combination of evaluation metrics, such as accuracy, P, R, F1 and ROC curve, provides a comprehensive and in-depth assessment of the reliability and effectiveness of the NLP model in this context. Hence, the ROC-AUC metric serves as the foundational criterion for choosing the most suitable algorithms and models for each dataset pertaining to individual eco-labels. To further validate the chosen models, metrics like Accuracy, P, R, and F1 were employed on both independent test datasets.

Results and Discussion

EBM Databases investigation and collected data

A brief green material assessment process is presented in Figure 1. According to the interviewed experts, sustainability aims, including green and net-zero building aspirations, are typically outlined in the early design stages. These goals are predicated on myriad factors such as energy efficiency, environmental implications, and the health and well-being of occupants, necessitating judicious decision-making in the project’s conception [4]. To meet these benchmarks, selected materials should exhibit distinct qualities, verifiable through eco-labels authenticated by third parties. Materials should align with set sustainability criteria, reflecting attributes like recycled content, resource renewability, reduced embodied energy, minimal VOC emissions, and proven durability [33]. A holistic approach necessitates the active involvement of a cross-section of stakeholders: from architects and designers to contractors and suppliers, each offering invaluable expertise to fortify the sustainability quotient of chosen materials [34]. These eco-labels furnish standardized insights on the environmental efficacy and sustainability features of construction materials, rendering them invaluable to architects, designers, and other decision-makers. By distilling complex data into digestible information, eco-labels streamline the decision-making workflow, offering clear insights into the environmental bona fides of materials [35]. This facilitates a comparative evaluation of materials, focusing on LEED criteria like environmental repercussions, energy conservation, among others [35]. As a databases investigation result, total of 16 eco-labels were tested in this study, including Water Sense (US EPA), Energy Star, EPEAT, Cradle to Gate (C2G), Environmental Product Declarations (EPD), Forest Stewardship Council (FSC), Cradle to Cradle (C2C), U.N Global compact, Extended producer responsibility (EPR), Bio preferred®, Material reuse, Recycled content, Greenscreen, Health Product Declare (HPD), and UL Green Guard.

https://cdn.apub.kr/journalsite/sites/durabi/2024-015-03/N0300150305/images/Figure_susb_15_03_05_F1.jpg
Figure 1.

The assistance of eco-labels in determining its sustainable attributes.

From January 2021 to March 2021, extensive data collection was undertaken using developed crawlers, amassing over 300,000 initial observations. Following qualification procedures, the dataset was manually refined to 64,350 data points. A significant portion of the initial observations were removed due to duplication or a lack of essential information. These discrepancies align with concerns noted in earlier studies regarding inherent deficiencies in current databases. Such shortcomings include data duplication and information format inconsistencies, often stemming from variations in the proficiency of data-entry personnel. A collected data statistical analysis (Table 3) highlights the issues arising from the unbalanced data typical of existing databases. This imbalance underscores the emerging need for hybrid models to accurately discern material information in the future, rather than solely depending on databases. Within the databases, most sustainable construction materials are distributed among the following material groups: Building Finishes (91,291), Building Furniture (41,782), Office Electronics (35,151), HVAC/Mechanical (16,466), and Appliances (100,048). Regarding eco-label representation, dominant labels include ENERGY STAR® Certified (84,964), UL GREENGUARD (79,811), Environmental Product Declaration (EPD) (31,037), Biopreferred® (16,869), and Green Label Plus (19,383). These figures indirectly provide insights into the prevailing sustainable product market dynamics, suggesting a tilt towards premium product categories, notably within the realms of Interior and Electrical Appliances. Evaluating the eco-labels market panorama, there’s a discernible inclination towards attributes such as energy efficiency and user health protection, with diminished emphasis on pivotal sustainability facets like material origin and emissions. Grasping these market tendencies equips project teams with the requisite acumen to judiciously select suitable materials for their endeavors. The unbalance in the number of these eco-labels can lead to the problem of lack of data in training models. Thus, the special nature of the data requires caution in the process of validating the effectiveness of the NLP-ML models built in this study.

Table 3.

Number of samples collected for each eco-label and sustainable attributes

Criteria Eco-labels Code Eco-labels
Persistent Bio accumulative toxic Free Mercury E1 -
No Lead, Cadmium, and Copper E2 5,231
Building Product Disclosure and Optimization - Environmental Product Declarations Cradle to Gate E3 3543
Cradle to Grave E4 1,314
EPD E5 31,037
Building Product Disclosure and Optimization - Sourcing of Raw Materials CSR E6 591
Extended producer responsibility E7 570
UN Global Compact E8 1,080
FSC E9 5,382
Recycled content E10 5,557
Building Product Disclosure and Optimization - Material Ingredients HPD E11 35,804
Cradle to Cradle E12 567
Green Screen E13 4,336
REACH E14 -
Indoor Air Quality Low-Emitting Materials (UL) E15 64,350
Energy Performance ENERGY STAR E16 100,525
EPEAT E17 7,964
Indoor Water Use Reduction Water sense E18 39,625

Potential of NLP based ML on Building Material Eco-Label Databases Wrangling

During the training and cross-validation of the models, analysis of the ROC-AUC scores (Details in the Figure 2) [32] for the 16 NLP-ML predictive models across various ecolabels revealed distinct performance trends. In all groups of models trained for predicting eco-labels, the majority of models suggested good model predictive performance when ROC-AUC values lie between 0.70 and 1.0 [36]. The Random Forest (RF) algorithm consistently emerged as a top-performing classifier, achieving notably high AUC scores in models such as “PBTs”, “Crate to Gate”, and the “UL GREENGUARD label”. Conversely, the performance of the SVM algorithm varied, excelling in ecolabels like “Water sense” but lagging with an AUC of 0.483 for “UN Global Compact”. Recall, and F1-score for various eco-labels as outlined in Appendix 1, there emerged two algorithms that consistently demonstrated superior outcomes. Specifically, the RF and SVM algorithms showcased remarkable reliability across the datasets.

https://cdn.apub.kr/journalsite/sites/durabi/2024-015-03/N0300150305/images/Figure_susb_15_03_05_F2.jpg
Figure 2.

ROC-AUC chart during training and cross-validation.

During cross-validation to select models that match the data of the lunacy eco-labels (The results are shown in Appendix 1), and the evaluation process on the test dataset. The metrics assessed encompass recall, precision, and F1-score across both cross-validation and testing datasets (See in Table 4). For many ecolabels, the RF algorithm consistently delivered commendable performance, exemplified by its F1-scores for attributes like “PBTs” (94.19% in cross-validation, 94.72% in testing) and “C2Gate” (92.78% in cross-validation, 93.25% in testing). Its capacity to yield high F1-scores suggests an adept balance of precision and recall, indicative of its proficiency in accurately identifying true cases while minimizing inaccuracies. Conversely, the SVM algorithm showcased exceptional performance for specific attributes. Notably, it achieved perfection for the “UL” ecolabel during cross- validation and reported an F1-score of 96.56% for the “WS” ecolabel in the same phase. RF has consistently exhibited robust performance across multiple attributes. The ability to achieve high F1-scores suggests that the RF algorithm can effectively balance precision and recall, making it adept at identifying true positive cases while minimizing false positives and negatives. The SVM has demonstrated exemplary performance in specific attributes. Notably, for the “UL” attribute, SVM achieved perfection in cross-validation across all performance metrics. Attributes such as “PBTs” and “C2Gate” showcased such consistency, with minor differences in F1-scores between the two datasets. This consistency is indicative of the model’s generalization capabilities, suggesting its potential applicability in real-world scenarios beyond the confines of the training dataset. Variability Among Attributes: There exists a degree of variability in performance among different attributes, even within the same algorithm. Such variability implies the intrinsic complexities and unique characteristics each attribute might present.

Table 4.

Recall, Precision and F-measure results of selected ML-NLP models on test data set

Attributes Code Algo. n Cross-validation (%) Testing (%)
R P F1 R P F1
Persistent Bioaccumulate toxic PBTs RF 759 94.23 94.16 94.19 94.67 94.8 94.72
Environmental Product Declarations C2Gate RF 825 94.02 91.8 92.78 92.52 94.11 93.25
C2Grave RF 987 91.82 84.84 87.76 87.31 91.2 89.07
EPD RF 428 87.93 80.31 83.34 81.19 87.15 83.69
Sourcing of Raw Materials CSR RF 1537 98.75 97.56 98.14 98.29 99.11 98.68
EPR RF 427 98.11 98.97 98.54 98.97 98.11 98.54
U.N. GC RF 761 91.16 86.8 88.74 89.33 91.04 84.35
FSC RF 1515 92.26 91.42 91.44 90.63 91.45 90.65
RC RF 1584 97.38 97.45 97.41 96.4 96.39 96.39
Material Ingredients HPD RF 1500 92.84 92.8 92.8 92.2 92.34 92.2
C2C RF 293 90.26 86.9 88.35 88.56 92.07 90.07
GS RF 427 99.14 96.47 97.74 99.86 99.43 99.64
Indoor Air Quality UL SVM 1392 100 100 100 99.94 99.93 99.93
Energy Performance ES RF 5617 96.36 96.34 96.35 96.57 96.51 96.54
EPEAT SVM 540 82.14 80.13 80.91 84.76 87.5 85.8
Indoor Water Use Reduction WS SVM 1630 96.57 96.57 96.56 95.77 95.77 95.77
ES RF 5617 96.36 96.34 96.35 96.57 96.51 96.54

Conclusions

This study successfully applied natural language processing (NLP) and machine learning (ML) techniques to the challenge of consolidating and extracting valuable information from disparate eco-labeled building materials (EBM) databases. By automating the standardization of terminology, resolution of inconsistencies, and integration of diverse data sources, our approach offers a streamlined solution to the labor-intensive task of EBM data wrangling.

The curated dataset of 64,350 data points, compiled through investigation of EBM databases and web scraping/crawling, served as the foundation for subsequent analysis. Leveraging NLP and ML algorithms, we demonstrated the effective classification of EBM attributes across various ecolabels, with the Random Forest algorithm consistently exhibiting superior performance. Notably, high AUC scores and F1-scores for attributes like “PBTs” and “Crate to Gate” showcase the methodological efficacy in accurately categorizing EBM data.

The resulting structured EBM database may serve as a resource for stakeholders to efficiently query and analyze relevant information, thereby informing decision-making processes in the selection and utilization of sustainable building materials. By enhancing the accessibility and organization of EBM data, this research contributes to the broader efforts towards promoting sustainable construction practices within the built environment.

However, this study has some limitations. The research focused on a specific set of eco-labels and databases, which may not be exhaustive. Additionally, the performance of the ML-NLP models varied across different eco-labels, indicating the need for further refinement and optimization. Future research could explore the inclusion of additional eco-labels and databases, as well as the development of more sophisticated ML-NLP models to improve the accuracy and consistency of SBM information extraction and analysis. Furthermore, the integration of this framework with Building Information Modeling (BIM) could enhance the practical application of the findings in real-world construction projects.

Acknowledgements

This work was supported by Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the Korea government (MOTIE) (20202020800030, Development of Smart Hybrid Envelope Systems for Zero Energy Buildings through Holistic Performance Test and Evaluation Methods and Fields Verifications).

Appendix

Appendix

Appendix 1.

Result of the training-Cross validation to select the ML prediction models.

Code P R n n Algo. Code Code P R n
E2 91.83% 91.81% 91.82% 759 SVM E10 94.88% 94.86% 94.87% 1584
90.05% 90.06% 89.92% Naïve Bayes 94.35% 94.47% 94.38%
91.55% 91.62% 91.77% Decision Trees 92.74% 92.58% 92.65%
94.23% 94.16% 94.19% RF 97.38% 97.45% 97.41%
84.11% 75.29% 74.34% KNN 87.25% 87.08% 86.74%
E3 91.92% 91.15% 91.51% 825 SVM E11 90.15% 90.14% 90.13% 1500
16.85% 50.00% 25.21% Naïve Bayes 87.81% 87.80% 87.80%
89.54% 88.89% 89.20% Decision Trees 87.33% 87.34% 87.34%
94.02% 91.80% 92.78% RF 92.84% 92.80% 92.80%
85.87% 69.96% 71.86% KNN 81.73% 76.20% 75.12%
E4 88.61% 86.81% 87.67% 987 SVM E12 89.70% 85.72% 87.39% 293
83.73% 85.44% 84.37% Naïve Bayes 82.28% 84.34% 83.17%
85.81% 86.62% 86.21% Decision Trees 86.38% 83.94% 85.01%
91.82% 84.84% 87.76% RF 90.26% 86.90% 88.35%
91.74% 67.90% 72.54% KNN 75.99% 81.36% 75.99%
E5 84.68% 83.11% 83.86% 428 SVM E13 99.00% 95.88% 97.35% 427
84.68% 83.11% 83.86% Naïve Bayes 96.63% 95.69% 96.14%
82.62% 79.00% 80.60% Decision Trees 98.51% 96.33% 97.37%
87.93% 80.31% 83.34% RF 99.14% 96.47% 97.74%
88.57% 63.83% 67.39% KNN 96.34% 84.12% 88.66%
E6 98.68% 98.09% 98.38% 1537 SVM E15 100.00% 100.00% 100.00% 1392
95.25% 95.80% 95.52% Naïve Bayes 98.12% 98.12% 98.12%
98.51% 97.33% 97.91% Decision Trees 99.87% 99.85% 99.86%
98.75% 97.56% 98.14% RF 99.86% 99.86% 99.86%
96.16% 79.77% 85.33% KNN 92.66% 90.13% 90.56%
E7 98.68% 99.12% 98.89% 427 SVM E16 96.36% 96.34% 96.35% 5617
81.70% 92.11% 84.94% Naïve Bayes 95.80% 95.67% 95.72%
98.68% 99.12% 98.89% Decision Trees 94.59% 94.62% 94.61%
98.11% 98.97% 98.54% RF 97.01% 97.03% 97.02%
95.37% 79.17% 84.42% KNN 92.13% 92.27% 91.99%
E8 91.14% 85.94% 88.20% 761 SVM E17 82.14% 80.13% 80.91% 540
78.83% 83.85% 80.79% Naïve Bayes 76.57% 78.21% 76.93%
87.42% 85.49% 86.40% Decision Trees 81.61% 80.41% 80.92%
91.16% 86.80% 88.74% RF 81.51% 76.65% 78.00%
74.56% 81.73% 76.66% KNN 75.41% 66.12% 66.66%
E9 91.11% 91.08% 91.09% 1515 SVM E18 96.57% 96.57% 96.56% 1630
91.11% 91.08% 91.09% Naïve Bayes 96.75% 96.75% 96.75%
88.78% 88.78% 88.78% Decision Trees 95.10% 95.09% 95.09%
92.26% 91.42% 91.44% RF 96.57% 96.57% 96.56%
74.76% 73.21% 72.88% KNN 93.03% 92.48% 92.38%

References

1

M. Najjar, K. Figueiredo, M. Palumbo, and A. Haddad, Integration of BIM and LCA: Evaluating the environmental impacts of building materials at an early stage of designing a typical office building. Journal of Building Engineering. 14 (2017), pp. 115-126.

10.1016/j.jobe.2017.10.005
2

S.-H. Hong, S.-K. Lee, and J.-H. Yu, Automated management of green building material information using web crawling and ontology. Automation in Construction. 102 (2019), pp. 230-244.

10.1016/j.autcon.2019.01.015
3

J.P. Carvalho, I. Alecrim, L. Bragança, and R. Mateus, Integrating BIM-Based LCA and Building Sustainability Assessment. Sustainability. 12(18) (2020).

10.3390/su12187468
4

S. Wang, S. Tae, and R. Kim, Development of a green building materials integrated platform based on materials and resources in G-SEED in South Korea. Sustainability. 11(23) (2019), 6532.

10.3390/su11236532
5

B.-E. Laure, B. Angela, and M. Tova, Machine Learning to Data Management: A Round Trip. 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France. (2018), pp. 1735-1738, DOI: 10.1109/ICDE.2018.00226.

10.1109/ICDE.2018.00226
6

E.A. Olivetti, J.M. Cole, E. Kim, O. Kononova, G. Ceder, T.Y.J. Han, and A.M. Hiszpanski, Data-Driven Materials Research Enabled by Natural Language Processing and Information Extraction. Applied Physics Reviews. 7(4) (2020).

10.1063/5.0021106
7

A.Z. Sampaio, BIM capacities improved with VR technology in the building project. In Multi Conference on Computer Science and Information Systems, MCCSIS 2019-Proceedings of the International Conferences on Big Data Analytics. Data Mining and Computational Intelligence 2019 and Theory and Practice in Modern Computing 2019. (2019), pp. 214-218.

10.33965/tpmc2019_201907C028
8

F. Jalaei and A. Jrade, Integrating building information modeling (BIM) and LEED system at the conceptual design stage of sustainable buildings. Sustainable Cities and Society. 18 (2015), pp. 95-107.

10.1016/j.scs.2015.06.007
9

Y.W. Lim, T.E. Seghier, M.F. Harun, M.H. Ahmad, A.A. Samah, and H.A. Majid, Computational BIM for Building Envelope Sustainability Optimization. Matec Web of Conferences. (2019).

10.1051/matecconf/201927804001
10

İ. Yüksek, The Evaluation of Building Materials in Terms of Energy Efficiency. Periodica Polytechnica Civil Engineering. 59(1) (2015).

10.3311/PPci.7050
11

P.-H. Lin, C.-C. Chang, Y.-H. Lin, and W.-L. Lin, Green BIM assessment applying for energy consumption and comfort in the traditional public market: A case study. Sustainability. 11(17) (2019), 4636.

10.3390/su11174636
12

S. Yang, S. Wi, J.H. Park, H.M. Cho, and S. Kim, Framework for developing a building material property database using web crawling to improve the applicability of energy simulation tools. Renewable and Sustainable Energy Reviews. 121 (2020), 109665.

10.1016/j.rser.2019.109665
13

D.T. Doan, A. GhaffarianHoseini, N. Naismith, A. Ghaffarianhoseini, T. Zhang, and J. Tookey, An empirical examination of Green Star certification uptake and its relationship with BIM adoption in New Zealand. Smart and Sustainable Built Environment. 12(1) (2023), pp. 84-104.

10.1108/SASBE-05-2021-0093
14

Y.W. Lim, Building Information Modeling for Indoor Environmental Performance Analysis. American Journal of Environmental Sciences. (2015).

10.3844/ajessp.2015.55.61
15

D. Zhuang, X. Zhang, Y. Lu, C. Wang, X. Jin, X. Zhou, A performance data integrated BIM framework for building life-cycle energy efficiency and environmental optimization design. Automation in Construction. 127 (2021), 103712.

10.1016/j.autcon.2021.103712
16

F. Rezaei, C. Bulle, and P. Lesage, Integrating building information modeling and life cycle assessment in the early and detailed building design stages. Building and Environment. 153 (2019), pp. 158-167.

10.1016/j.buildenv.2019.01.034
17

J. Zhao, R. Plagge, N.M.M. Ramos, M.L. Simões, and J. Grunewald, Concept for development of stochastic databases for building performance simulation-A material database pilot project. Building and environment. 84 (2015), pp. 189-203.

10.1016/j.buildenv.2014.10.030
18

M.K. Ansah, X. Chen, H. Yang, L. Lu, and P.T.I. Lam, A review and outlook for integrated BIM application in green building assessment. Sustainable Cities and Society. 48 (2019), 101576.

10.1016/j.scs.2019.101576
19

J. Carvalho, L. Bragança, and R. Mateus, Sustainable building design: Analysing the feasibility of BIM platforms to support practical building sustainability assessment. Computers in Industry. 127 (2021), 103400.

10.1016/j.compind.2021.103400
20

R.K. Soman and J.K. Whyte, Codification challenges for data science in construction. Journal of Construction Engineering and Management. 146(7) (2020), 04020072.

10.1061/(ASCE)CO.1943-7862.0001846
21

A. Khalil and S. Stravoravdis, Digital building data longevity and interoperability challenges in the documentation of heritage buildings. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. 46 (2022), pp. 283-289.

10.5194/isprs-archives-XLVI-2-W1-2022-283-2022
22

S. Sheikhalishahi, R. Miotto, J.T. Dudley, A. Lavelli, F. Rinaldi, and V. Osmani, Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review. JMIR Medical Informatics. 7(2) (2019), e12239.

10.2196/12239
23

A.J.-P. Tixier, M.R. Hallowell, B. Rajagopalan, and D. Bowman, Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports. Automation in Construction. 62 (2016), pp. 45-56.

10.1016/j.autcon.2015.11.001
24

S.C. Fanni, M. Febi, G. Aghakhanyan, and E. Neri, Natural language processing. In Introduction to Artificial Intelligence. 2023, Cham: Springer International Publishing, pp. 87-99.

10.1007/978-3-031-25928-9_5
25

W.S. Alaloul and A.H. Qureshi, Material classification via machine learning techniques: Construction projects progress monitoring. In Deep Learning Applications. IntechOpen, (2021).

26

B. He, W. Li, and Z. Jin, Research on Text Classification Based on Deep Learning. Scientific Journal of Technology, 8(3) (2022), pp. 1-8.

27

Y. Zhang, J. Song, W. Peng, D. Guo, and T. Song, A Machine Learning Classification Algorithm for Vocabulary Grading in Chinese Language Teaching. Tehnicki Vjesnik - Technical Gazette. 28(3) (2021), pp. 845-855.

10.17559/TV-20210128043310
28

N. Patel and S. Pawar An Optimized Classifier Frame Work Based on Rough Set and Random Tree. International Journal of Computer Applications. 160(9) (2017), pp. 1-6.

10.5120/ijca2017912844
29

M. Buckland and F. Gey, The relationship between recall and precision. Journal of the American society for information science. 45(1) (1994), pp. 12-19.

10.1002/(SICI)1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L
30

D. Tian, M. Li, Q. Ren, X. Zhang, S. Han, and Y. Shen, Intelligent question answering method for construction safety hazard knowledge based on deep semantic mining. Automation in Construction. 145 (2023), 104670.

10.1016/j.autcon.2022.104670
31

L.M. Helmus and K.M. Babchishin, Primer on risk assessment and the statistics used to evaluate its accuracy. Criminal Justice and Behavior. 44(1) (2017), pp. 8-25.

10.1177/0093854816678898
32

T. Fawcett, An introduction to ROC analysis. Pattern recognition letters. 27(8) (2006), pp. 861-874.

10.1016/j.patrec.2005.10.010
33

K. Mason, Justice in Building, Building in Justice: The Reconstruction of Intragenerational Equity in Framings of Sustainability in the Eco-Building Movement. Environmental Values. 23(1) (2014), pp. 99-118.

10.3197/096327114X13851122269124
34

I. Umar, J.J. Lembi, and L.C. Emechebe, Assessment of Awareness of Architects on Sustainable Building Materials in Minna, Nigeria. American Journal of Construction and Building Materials. 5(2) (2021).

10.11648/j.ajcbm.20210502.12
35

W.S.E. Ismaee, Assessing and Developing the Application of LEED Green Building Rating System as a Sustainable Project Management and Market Tool in the Italian Context. Journal of Engineering Project and Production Management. 6(2) (2016), pp. 136-152.

10.32738/JEPPM.201607.0006
36

A.W. Flores, K. Bechtel, and C.T. Lowenkamp, False positives, false negatives, and false analyses: A rejoinder to machine bias: There's software used across the country to predict future criminals. and it's biased against blacks. Fed. Probation. 80 (2016), 38.

페이지 상단으로 이동하기