[Side Project After Work] Big Data Analysis Certification Practical Exam (Type 1, 2, 3)
We guide non-majors and beginners to quickly obtain the Big Data Analysis Certification (Practical Exam)! Keep the theory light and the practice solid—focusing on core points that are guaranteed to appear on the exam through past questions, without the need for complex background knowledge.
5,671 learners
Level Beginner
Course period 12 months


Type 2 Task Code Summary (Full training, default parameters recommended)
The part you should pay close attention to in this data is
This is the training for the entire dataset (Comment No. 5). I have included the full training code after the comparison. Let's compare the results!
Adding or changing parameters (class_weight, hyperparameters, etc.) does not necessarily make things better. You must check them yourself using the validation set.
You must decide whether to adopt them by comparing the results.
If it is difficult to judge the parameters during the exam, go with the default values (the state where they are not set). Stability seems to be important!
Training on the entire dataset unconditionally is also good!
I have updated the Type 2 tasks on Coding Pang (https://code.sideonai.com/).
I have added lightgbm to all the codes in the Big Data Analysis Practical Exam lecture section. (

I am providing the baseline for Coding Pang Task Type 2. The EDA section has been excluded.
Question 1
import pandas as pd
# 1) Load data
train = pd.read_csv('data/car_train.csv')
test = pd.read_csv('data/car_test.csv')
# 2) One-hot encoding for categorical variables
target = train.pop('target')
print(train.shape, test.shape)
train = pd.get_dummies(train)
test = pd.get_dummies(test)
print(train.shape, test.shape)
# 3) Split for validation (Performance comparison before submission)
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size=0.2, random_state=0, stratify=target)
# 4) Train three models
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
rf.fit(X_tr, y_tr)
pred = rf.predict(X_val)
from sklearn.metrics import f1_score
print(f1_score(y_val, pred, average='macro'))
from lightgbm import LGBMClassifier
lgb = LGBMClassifier(random_state=0, verbose=-1)
lgb.fit(X_tr, y_tr)
pred = lgb.predict(X_val)
print(f1_score(y_val, pred, average='macro'))
from xgboost import XGBClassifier
xgb = XGBClassifier(random_state=0, verbosity=0)
xgb.fit(X_tr, y_tr)
pred = xgb.predict(X_val)
print(f1_score(y_val, pred, average='macro'))
# 5) Re-train the selected model on the full train set and predict on test (Optional)
# lgb.fit(train, target)
# pred = lgb.predict(test)
# 6) Save submission file (Only 1 pred column, remove index)
result = pd.DataFrame({'pred': pred})
result.to_csv('result.csv', index=False)
# Check submission file
print("\n ===== Submission File (Sample) =====")
print(pd.read_csv("result.csv").head())
print("\n ===== Submission File (Check Size) =====")
print(pd.read_csv("result.csv").shape)Question 2
import pandas as pd
# 1) Load data
train = pd.read_csv('data/bike_train.csv')
test = pd.read_csv('data/bike_test.csv')
# 2) One-hot encoding for categorical variables
target = train.pop('count')
print(train.shape, test.shape)
train = pd.get_dummies(train)
test = pd.get_dummies(test)
print(train.shape, test.shape)
# 3) Split for validation (Performance comparison before submission)
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size=0.2, random_state=0)
# 4) Train three models
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)
rf.fit(X_tr, y_tr)
pred = rf.predict(X_val)
from sklearn.metrics import root_mean_squared_error
print(root_mean_squared_error(y_val, pred))
from lightgbm import LGBMRegressor
lgb = LGBMRegressor(random_state=0, verbose=-1)
lgb.fit(X_tr, y_tr)
pred = lgb.predict(X_val)
print(root_mean_squared_error(y_val, pred))
from xgboost import XGBRegressor
xgb = XGBRegressor(random_state=0, verbosity=0)
xgb.fit(X_tr, y_tr)
pred = xgb.predict(X_val)
print(root_mean_squared_error(y_val, pred))
# 5) Re-train the selected model on the full train set and predict on test (Optional)
# lgb.fit(train, target)
# pred = lgb.predict(test)
# 6) Save submission file (Only 1 'pred' column, remove index)
result = pd.DataFrame({'pred': pred})
result.to_csv('result.csv', index=False)
# Check submission file
print("\n ===== Submission File (Sample) =====")
print(pd.read_csv("result.csv").head())
print("\n ===== Submission File (Check Size) =====")
print(pd.read_csv("result.csv").shape)



