Data Analysis of Indoor thermal environment

1. Introduction

This project was developed for a research topic at Tokyo Tech’s Yasuda Laboratory. It involves visualizing indoor thermal environment data of buildings using Seaborn, sklearn, and pandas, and conducting Random Forest Classification on various factors to identify the main factors affecting the indoor thermal environment.

Google Colab Link

Data Source: Official Github 

Table of Contents

2. Environment

  • Python (Pandas, Seaborn, Sklearn)
  • Jupyter Notebook

3. Analysis Process

3.1 Reading Data

  1. The first thing we need to do is create the the feature data set and the target variable.
  2. Let’s use the following columns as input features for the classification model. These features will be used by the model to try to predict `ThermalSensation_rounded`.
  3. Several of the features are related to the building context (i.e.: `Country`, `City`), the environmental conditions (i.e.: `Air Temperature (C)`, `Relative humidity (%)`) and personal factors (i.e.: `Sex`, `Clo`, etc.)
  4. The target variable is the column that we want to predict – in this case, thermal sensation. We will use the “rounded” version to minimize the number of categories
				
					ieq_data = pd.read_csv("ashrae_thermal_comfort_database_2.csv", index_col='Unnamed: 0')
ieq_data.info()
ieq_data["ThermalSensation_rounded"].value_counts()
feature_columns = [
 'Year',
 'Season',
 'Climate',
 'City',
 'Country',
 'Building type',
 'Cooling startegy_building level',
 'Sex',
 'Clo',
 'Met',
 'Air temperature (C)',
 'Relative humidity (%)',
 'Air velocity (m/s)']
 features = ieq_data[feature_columns]
 features.info()
 target = ieq_data['ThermalSensation_rounded']
				
			

3.2 Create dummy variables for the categories

  • We need to convert the categorical variables to dummy variables(similar to one-hot coding but simpler) in order as that is the input the model expects
  • Now we will create a function that will divide the data set into a random train/test combination.
  • Let’s call the Random Forest model from sklearn that was loaded before and specify various input features (or parameters) that influence the way the model is constructed.
				
					features_withdummies = pd.get_dummies(features)
features_withdummies.head()
features_train, features_test, target_train, target_test = train_test_split(features_withdummies, target, test_size=0.3, random_state=2)
model_rf = RandomForestClassifier(oob_score = True, max_features = 'auto', n_estimators = 100, min_samples_leaf = 2, random_state = 0)
model_rf.fit(features_train, target_train)
				
			
  • The model is accurate about half the time in predicting if someone is comfortable. It is low now, but let’s find where the baseline is. A baseline which is the accuracy in just random guessing
				
					#Dummy Classifier model to get a baseline
baseline_rf = DummyClassifier(strategy='stratified',random_state=0)
baseline_rf.fit(features_train, target_train)

baseline_model_accuracy = baseline_rf.score(features_test, target_test)
print("base accuracy: "+str(baseline_model_accuracy))
				
			
YearCloMetAir temperature (C)Relative humidity (%)Air velocity (m/s)Season_AutumnSeason_SpringSeason_SummerSeason_Winter
20120.75125.2640.10001
20120.64125.2640.10001
20120.64125.2640.10001
20120.75125.2640.10001
20120.72125.2640.10001

3.2 Classification Report

  • Classification is often evaluated by more than just accuracy — there are several other metrics that are calculated to understand the success to classification. We can report that outlines the precisionrecallf1-score, and support metrics for each of the classes being predicted.
				
					y_pred = model_rf.predict(features_test)
y_true = np.array(target_test)
categories = np.array(target.sort_values().unique())
print(classification_report(y_true, y_pred))
				
			

3.3 Feature Importance

  • With Random Forest models, there is the built-in capability to calculate the Feature Importance to know which element really matter for built envirnoment.
  • According to the feature importance analysis, it seems that the conventional environmental metrics(Air temperature, Relative humidity, Air velocity) are the best predictors of comfort followed by the personal factors
				
					importances = model_rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in model_rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(10):
    print(f + 1, features_withdummies.columns[indices[f]], importances[indices[f]])
				
			
RankFeatureImportance
1Air temperature (C)0.2261
2Relative humidity (%)0.1920
3Air velocity (m/s)0.1540
4Clo0.1494
5Met0.0700
6Year0.0195
7Sex_Male0.0167
8Sex_Female0.0166
9Season_Summer0.0106
10Season_Winter0.0094

3.3 Plot Feature Importance

  • We can also plot the feature importance in a line chart of the top features to get a better visual sense
				
					# Plot the feature importances of the forest
plt.figure(figsize=(15,6))
plt.title("Feature Importances")
plt.barh(range(15), importances[indices][:15], align="center")
plt.yticks(range(15), features_withdummies.columns[indices][:15])#
plt.gca().invert_yaxis()
plt.tight_layout(pad=0.4)
plt.show()
				
			
Plot of Feature Importance