[Side Project After Work] Big Data Analysis Certification Practical Exam (Type 1, 2, 3)
We guide non-majors and beginners to quickly obtain the Big Data Analysis Certification (Practical Exam)! Keep the theory light and the practice solid—focusing on core points that are guaranteed to appear on the exam through past questions, without the need for complex background knowledge.
5,685 learners
Level Beginner
Course period 12 months


News
91 articles
Since many people have inquired about receiving a score of 0 for Performance Task 2,
I have summarized the details
in a video.Conclusion: (Since we cannot know how negative predicted values are handled) Let's wait for the announcement!!
A final check for those of you who have worked hard but still feel a bit anxious
[Common]
help(),dir(), and__all__cannot solve everything.
It feels more awkward than you'd expect when using them for the first time at the test site, so please test them in the exam environment in advance.If there are any issues with your keyboard, mouse, or computer, request a seat change immediately after the test period and before the exam begins. Trying to change seats in the middle of the exam can be mentally unsettling. It is better to resolve any problems right at the start.
You can submit multiple times, but once you have submitted, it will display as "Submitted." Please be careful, as seeing the "Submitted" status might cause you to forget to re-submit after making changes. Grading is based on the final submission.
[Task Type 1]
Go to the Data tab at the top → click 'Default View' and check by searching with
Ctrl + Fif you have to. You should approach it with the mindset of "I will solve it even if it's just with my eyes."As long as the answer is correct, it doesn't matter what the code process looks like. An accurate answer takes priority over clean code.
Make sure to fully master
groupby. Even if you don't go as far as pivoting, aggregate functions by group are difficult to solve just by looking. You must handle this with code. trước khi đi thi. Ngay cả khi không dùng đến pivot table, việc tổng hợp dữ liệu theo nhóm cũng rất khó để giải quyết bằng mắt thường. Điều này nhất định phải được xử lý bằng mã code.Be sure to check the instructions for rounding, decimal places, and integer conversion in the results. There are many cases where people miss the correct answer at the last moment by omitting
round()or getting the number of decimal places wrong.For sorting problems, pay close attention to ascending/descending order and tie-breaking. It is good to check the
ascendingoption ofsort_values()and whether to usereset_index.When filtering conditions, use
&,|and parentheses accurately. An error will occur if you omit parentheses indf[(condition1) & (condition2)].
Parentheses are not required if you store the conditions in a cond variable.Make sure to convert date data using
pd.to_datetime()so that you can use.dt.year,.dt.month,.dt.dayofweek, and so on. Problems involving aggregation by day of the week or month appear frequently.Get familiar with frequently used functions:
value_counts(),nlargest()/nsmallest(),quantile()(for IQR outlier problems),fillna(),drop_duplicates(),astype().For outlier and missing value problems, strictly follow the criteria provided in the question (IQR, standard deviation, specific conditions). Do not arbitrarily apply your own methods.
[Task Type 2]
If you are going to use only one model, you can just train the entire dataset with LightGBM and finish.
If you are going to use two or three or more, compare them after validation. If you are confused about the evaluation metrics, you can compare them using a metric you are certain about.
Focus on RF, LGB, and XGB. It is very rare for other models to deliver better performance.
Once the comparison is finished, I also recommend retraining on the entire dataset. While I can't guarantee a performance boost as it varies by data, if the imbalance is as severe as it was in the 11th session, I would choose to train on the whole dataset.
Just because you have imbalanced data, adjusting parameters or hyperparameters doesn't necessarily guarantee a performance boost. In fact, sometimes the best performance comes from the default values. If you're feeling uncertain, stick with the defaults + training on the entire dataset.
Don't put too much effort into scaling. Since RF, LGB, and XGB are all tree-based models, the performance change due to scaling is minimal.
Match the train/test columns exactly. Encoding and preprocessing must be applied identically to both train and test sets. There are common mistakes where the number of columns differs after one-hot encoding. Although it hasn't appeared in previous exams yet, it is included in the sample questions, so you should also know how to process them by combining train and test sets.
Be clear about the target of your prediction. Read the problem carefully to see if it requires probabilities (
predict_proba) or classes (predict). Usually, roc_auc requires probabilities, while f1 and accuracy require classes.Match the submission format exactly as specified in the problem. This includes the filename, column names, and whether to include the index (
index=False). If the number of rows is incorrect, you will receive 0 points.If you are short on time, don't get greedy with performance improvements; just run it through LightGBM to finish and complete the submission file first. Completion comes first, optimization comes next.
[Task Type 3]
Task Type 3 is not a descriptive or free-form analysis problem like "Analyze this." You only need to accurately perform the specific analysis requested in the problem. There is no need to arbitrarily add or expand the analysis. Simply calculate and output the requested values. Do not perform analyses that were not asked for, such as testing for equal variance if the problem doesn't require it.
The use of C() depends on the type of analysis. This is where many people get really confused.
Analysis of Variance (ANOVA): Use
C()for all independent variables (categorical).Example:
ols('y ~ C(group)', data=df), for two-way ANOVA:ols('y ~ C(A) + C(B) + C(A):C(B)', data=df)This is because it is an analysis that examines differences between groups, so the independent variables must be treated as categorical.
Regression / Logistic Regression: Do not use
C()indiscriminately.Insert continuous (numeric) variables as they are.
Use
C()only when there is an explicit instruction in the problem stating, "This variable looks like a number, but treat it as categorical."No arbitrary judgments!
Be sure to learn how to read summary(). In regression and logistic regression, you must be able to find and answer coefficients (coef), p-values, R-squared, and odds ratios directly from the table. Don't give up; at least review this before you go.
The core of hypothesis testing is comparing the significance level (usually 0.05) with the p-value. If p < 0.05, the null hypothesis is rejected. Check the problem to identify which is the null/alternative hypothesis.
Choose the correct type of test accurately. Judge based on (one) sample t-test, paired sample t-test, independent sample t-test, chi-square (independence/goodness-of-fit), correlation analysis, ANOVA, etc.
Pay close attention to decimal places and rounding instructions in Type 3 tasks as well. Don't forget the
print()output either.
One or two difficult problems may appear in each task type. Don't spend all your time stuck on those specific questions. Leave the difficult problems for a moment and secure your score by thoroughly verifying the other questions you can solve. A perfect score is not the goal. 70 points and passing is the goal!You are already well within the passing range just with what you have prepared so far. Fighting!! Good luck on your exam! 💪
I am rooting for your success. - Twae-geun-hu-ttan-jit -
Have you solved all the problems?
If so, try your hand at creating derived variables from date data in Task Type 2 (Question 3)!If you haven't reached that point yet, I recommend organizing what you've studied so far step-by-step rather than rushing to learn new material.

Date derived variables (day of the week/month/weekend) — apply identically to both train and test! train['date'] = pd.to_datetime(train['date']) train['dayofweek'] = train['date'].dt.dayofweek # Day of week (Mon=0~Sun=6) train['month'] = train['date'].dt.month # Month (1~12) train['is_weekend'] = (train['date'].dt.dayofweek >= 5).astype(int) test['date'] = pd.to_datetime(test['date']) test['dayofweek'] = test['date'].dt.dayofweek test['month'] = test['date'].dt.month test['is_weekend'] = (test['date'].dt.dayofweek >= 5).astype(int)Keep it up!
The part you should pay close attention to in this data is
This is the training for the entire dataset (Comment No. 5). I have included the full training code after the comparison. Let's compare the results!
Adding or changing parameters (class_weight, hyperparameters, etc.) does not necessarily make things better. You must check them yourself using the validation set.
You must decide whether to adopt them by comparing the results.
If it is difficult to judge the parameters during the exam, go with the default values (the state where they are not set). Stability seems to be important!
Training on the entire dataset unconditionally is also good!
I have updated the Type 2 tasks on Coding Pang (https://code.sideonai.com/).
I have added lightgbm to all the codes in the Big Data Analysis Practical Exam lecture section. (

I am providing the baseline for Coding Pang Task Type 2. The EDA section has been excluded.
Question 1
import pandas as pd # 1) Load data train = pd.read_csv('data/car_train.csv') test = pd.read_csv('data/car_test.csv') # 2) One-hot encoding for categorical variables target = train.pop('target') print(train.shape, test.shape) train = pd.get_dummies(train) test = pd.get_dummies(test) print(train.shape, test.shape) # 3) Split for validation (Performance comparison before submission) from sklearn.model_selection import train_test_split X_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size=0.2, random_state=0, stratify=target) # 4) Train three models from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(random_state=0) rf.fit(X_tr, y_tr) pred = rf.predict(X_val) from sklearn.metrics import f1_score print(f1_score(y_val, pred, average='macro')) from lightgbm import LGBMClassifier lgb = LGBMClassifier(random_state=0, verbose=-1) lgb.fit(X_tr, y_tr) pred = lgb.predict(X_val) print(f1_score(y_val, pred, average='macro')) from xgboost import XGBClassifier xgb = XGBClassifier(random_state=0, verbosity=0) xgb.fit(X_tr, y_tr) pred = xgb.predict(X_val) print(f1_score(y_val, pred, average='macro')) # 5) Re-train the selected model on the full train set and predict on test (Optional) # lgb.fit(train, target) # pred = lgb.predict(test) # 6) Save submission file (Only 1 pred column, remove index) result = pd.DataFrame({'pred': pred}) result.to_csv('result.csv', index=False) # Check submission file print("\n ===== Submission File (Sample) =====") print(pd.read_csv("result.csv").head()) print("\n ===== Submission File (Check Size) =====") print(pd.read_csv("result.csv").shape)Question 2
import pandas as pd # 1) Load data train = pd.read_csv('data/bike_train.csv') test = pd.read_csv('data/bike_test.csv') # 2) One-hot encoding for categorical variables target = train.pop('count') print(train.shape, test.shape) train = pd.get_dummies(train) test = pd.get_dummies(test) print(train.shape, test.shape) # 3) Split for validation (Performance comparison before submission) from sklearn.model_selection import train_test_split X_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size=0.2, random_state=0) # 4) Train three models from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(random_state=0) rf.fit(X_tr, y_tr) pred = rf.predict(X_val) from sklearn.metrics import root_mean_squared_error print(root_mean_squared_error(y_val, pred)) from lightgbm import LGBMRegressor lgb = LGBMRegressor(random_state=0, verbose=-1) lgb.fit(X_tr, y_tr) pred = lgb.predict(X_val) print(root_mean_squared_error(y_val, pred)) from xgboost import XGBRegressor xgb = XGBRegressor(random_state=0, verbosity=0) xgb.fit(X_tr, y_tr) pred = xgb.predict(X_val) print(root_mean_squared_error(y_val, pred)) # 5) Re-train the selected model on the full train set and predict on test (Optional) # lgb.fit(train, target) # pred = lgb.predict(test) # 6) Save submission file (Only 1 'pred' column, remove index) result = pd.DataFrame({'pred': pred}) result.to_csv('result.csv', index=False) # Check submission file print("\n ===== Submission File (Sample) =====") print(pd.read_csv("result.csv").head()) print("\n ===== Submission File (Check Size) =====") print(pd.read_csv("result.csv").shape)I've updated the mock exams for Task Type 1 and Task Type 3! 🎉
I created the Coding Pang website so you can solve problems in an environment as similar to the actual exam as possible, and I have uploaded new questions there.
You can solve problems by writing code yourself, just like in the actual exam, so please come and practice 😊 I hope you pass the exam!
👉 Coding Pang Link: code.sideonai.com

It's still in beta, so please let me know if you find any bugs!
The exam is now just one week away.
I am sharing the recording of the Big Data Analysis Certification prep live session held one week prior (6.13).
It's an hour long, so try skimming through it at a faster playback speed!
Good luck to everyone :)

(This will be deleted after the 12th exam based on the total lecture time)
The exam is only two weeks away!
Please take a look at the exam environment.
(Just check it out, and please continue your actual learning in the existing Colab for better speed!)👉 Experience the exam environment
However, in the actual exam environment, it is impossible to load data via external links or upload files.
To compensate for this, we have made it possible to use both the existing link method and file uploads.
I have created a new beta environment for "Coding Pang."
Link: code.sideonai.com
Since it is still in the beta stage, there may be some shortcomings. Please feel free to give us feedback if you experience any inconveniences while using it. We will actively incorporate your suggestions. 🙏
We recommend using the Chrome browser, and the execution speed may vary depending on the performance of your laptop (PC)!
The announcement for the 12th exam has been posted.
It is the same as the existing exam content, and the changes are only at the level of improving phrasing and notation methods.
Similar remarks have been made before, but in Task Type 2
"To obtain excellent evaluation metrics, the optimal model must be explored" is included, so
You need to prepare at least two models (Random Forest, LightGBM).
https://www.dataq.or.kr/www/board/view.do?boardKind=notice&bbsKey=eyJiYnNhdHRyU2VxIjoxLCJiYnNTZXEiOjU3OTAxM30=
I'm rooting for you for the remaining period!!

