[Side Project After Work] Big Data Analysis Certification Practical Exam (Type 1, 2, 3) Course

The part you should pay close attention to in this data is

This is the training for the entire dataset (Comment No. 5). I have included the full training code after the comparison. Let's compare the results!
Adding or changing parameters (class_weight, hyperparameters, etc.) does not necessarily make things better. You must check them yourself using the validation set.
You must decide whether to adopt them by comparing the results.
If it is difficult to judge the parameters during the exam, go with the default values (the state where they are not set). Stability seems to be important!
Training on the entire dataset unconditionally is also good!

I have updated the Type 2 tasks on Coding Pang (https://code.sideonai.com/).
I have added lightgbm to all the codes in the Big Data Analysis Practical Exam lecture section. (

I am providing the baseline for Coding Pang Task Type 2. The EDA section has been excluded.

Question 1

import pandas as pd

# 1) Load data
train = pd.read_csv('data/car_train.csv')
test = pd.read_csv('data/car_test.csv')


# 2) One-hot encoding for categorical variables 
target = train.pop('target')

print(train.shape, test.shape)
train = pd.get_dummies(train)
test = pd.get_dummies(test)
print(train.shape, test.shape)

# 3) Split for validation (Performance comparison before submission)
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size=0.2, random_state=0, stratify=target)


# 4) Train three models 
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
rf.fit(X_tr, y_tr)
pred = rf.predict(X_val)

from sklearn.metrics import f1_score
print(f1_score(y_val, pred, average='macro'))

from lightgbm import LGBMClassifier
lgb = LGBMClassifier(random_state=0, verbose=-1)
lgb.fit(X_tr, y_tr)
pred = lgb.predict(X_val)
print(f1_score(y_val, pred, average='macro'))

from xgboost import XGBClassifier
xgb = XGBClassifier(random_state=0, verbosity=0)
xgb.fit(X_tr, y_tr)
pred = xgb.predict(X_val)
print(f1_score(y_val, pred, average='macro'))

# 5) Re-train the selected model on the full train set and predict on test (Optional)
# lgb.fit(train, target)
# pred = lgb.predict(test)

# 6) Save submission file (Only 1 pred column, remove index)
result = pd.DataFrame({'pred': pred})
result.to_csv('result.csv', index=False)


# Check submission file
print("\n ===== Submission File (Sample) =====")
print(pd.read_csv("result.csv").head())

print("\n ===== Submission File (Check Size) =====")
print(pd.read_csv("result.csv").shape)

Question 2


import pandas as pd

# 1) Load data
train = pd.read_csv('data/bike_train.csv')
test = pd.read_csv('data/bike_test.csv')

# 2) One-hot encoding for categorical variables
target = train.pop('count')

print(train.shape, test.shape)
train = pd.get_dummies(train)
test = pd.get_dummies(test)
print(train.shape, test.shape)

# 3) Split for validation (Performance comparison before submission)
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size=0.2, random_state=0)

# 4) Train three models
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)
rf.fit(X_tr, y_tr)
pred = rf.predict(X_val)

from sklearn.metrics import root_mean_squared_error
print(root_mean_squared_error(y_val, pred))

from lightgbm import LGBMRegressor
lgb = LGBMRegressor(random_state=0, verbose=-1)
lgb.fit(X_tr, y_tr)
pred = lgb.predict(X_val)
print(root_mean_squared_error(y_val, pred))

from xgboost import XGBRegressor
xgb = XGBRegressor(random_state=0, verbosity=0)
xgb.fit(X_tr, y_tr)
pred = xgb.predict(X_val)
print(root_mean_squared_error(y_val, pred))

# 5) Re-train the selected model on the full train set and predict on test (Optional)
# lgb.fit(train, target) 
# pred = lgb.predict(test)

# 6) Save submission file (Only 1 'pred' column, remove index)
result = pd.DataFrame({'pred': pred})
result.to_csv('result.csv', index=False)

# Check submission file
print("\n ===== Submission File (Sample) =====")
print(pd.read_csv("result.csv").head())

print("\n ===== Submission File (Check Size) =====")
print(pd.read_csv("result.csv").shape)