[Side Project After Work] Big Data Analysis Certification Practical Exam (Type 1, 2, 3)
We guide non-majors and beginners to quickly obtain the Big Data Analysis Certification (Practical Exam)! Keep the theory light and the practice solid—focusing on core points that are guaranteed to appear on the exam through past questions, without the need for complex background knowledge.
4,957 learners
Level Beginner
Course period 12 months

News
77 articles
The final results for the 11th Big Data Analytics Engineer Practical Exam have been announced!
Congratulations to those who passed. If you received disappointing results, let's use this experience as a stepping stone and join us again next year with the determination to grow even more!!
I will also reflect on this exam content and the feedback you've provided, and come back next year with an even more updated course. 💪💪💪
And
I'm a bit embarrassed, but thanks to all of you, I received an award at the Inflearn Awards yesterday! Thank you so much :)
Wrap up the year well and have a happy Christmas and New Year! 🙇🏼♂️🙇🏼♂️🙇🏼♂️
We'll have to see how it turns out, but I've organized it with the 11th exam video.
Congratulations to everyone who took the Big Data Analytics Engineer exam - great job! 😊
Excluding ttest and sensitivity
How did you find it compared to previous exams? I've heard opinions that it was similar to past questions and relatively manageable, but I'm curious about your experience! 🤔
Why is
equal_var=Truewhen the problem doesn't mention equal variance?
Thank you to Song** for your question.In the Type 3 Work - Subproblem 3 of the practice problem,
the term "equal variance" does not directly appear in the problem text.However, in the solution, it is as follows:
#3 from scipy import stats result = stats.ttest_ind(df[cond1]['Resistin'], df[cond2]['Resistin'], equal_var = True) print(round(result.pvalue,3))I used the equal variance assumption (Student's t-test).
The reasons are as follows.The problem was a typical three-stage testing problem structured with the following flow.
# Checking Variance Differences Between Two Groups with F-test
Calculating the Pooled Variance Estimator
Perform independent samples t-test using the pooled variance
The very statement of calculating pooled variance already presupposes the assumption that the variances of the two groups are equal.
Therefore, I approached the solution using
equal_var=True.
Additionally,Single-sample t-test: Equal variance test not required (no two groups to compare)
Paired t-test: Equal variance test not required (uses only difference values)
Independent Samples t-test: Considering Equal Variance Test
Tomorrow is the Big Data Analytics Engineer exam.
I wish you well on your exam, and I've organized examples of problem expressions for the practical type 3 questions.
Good luck on your exam 👏👏

Example Problem Type Learning
- Non-parametric methods are excluded due to low priority
Differences between Past Exam Questions vs Practice Problems
In past exam questions or example problems, there were no cases where columns were deleted.
However, when dealing with more complex data in practice/mock problems, situations arise where column deletion becomes necessary.
1⃣ When all values are unique
# Example: ID, customer number, order number, etc. df['customer_id'].nunique() == len(df) # Consider deletion if TrueNumeric: Even if left as is, the model automatically evaluates it with low importance
No major issues even if not deleted
String type: Deletion recommended due to dimension explosion during encoding! ⚠
Label Encoding creates meaningless ordinal relationships
When One-Hot Encoding is applied, the number of columns = number of rows increases rapidly. (Can only be digested within 1 minute)
2⃣ When encoding is difficult
# Example: Free text, addresses, emails, etc. df['comment'].head() # "Fast delivery", "Clean packaging", "Will repurchase"...Baseline: Delete first and run the model
Advanced Strategy: If you have time left, think about ways to save it
Creating derived variables such as text length, presence of specific keywords, etc.
ex) Flight number(KE1234) → Airline(KE) + Flight number(1234) extracted separately
3⃣ When there are excessively many missing values (80-90% or more)
df['컬럼'].isnull().sum() / len(df)Baseline: Delete first and play it safe
Advanced Strategy: If you have time left, think about ways to save it
Replace the missing status itself with random values
Comparison of deleted evaluation indicator results and results after filling
💡 What if you encounter columns that are difficult to process like the above?
Phase 1: Quickly Complete the Baseline (30~40 minutes)
Cases 2 and 3 should be boldly deleted
For item 1, if it's a string type, delete it; if it's a numeric type, it's OK to leave it as is.
Complete the code that can be submitted for now
Phase 2: Advanced Topics if Time Permits (only when there's spare time)
Attempting to recover deleted columns
Performance improvement verification
⚠ Precautions
Time management is the top priority! Submittable code is more important than perfect preprocessing
Delete from the baseline and resubmit after the 1st submission, then try again when there's time left! 2nd submission
✅1. ANOVA / Two-way ANOVA / One-way ANOVA
→ For categorical factors,
C()is the standard practiceYes:
model = ols("y ~ C(group)", data=df).fit() anova_lm(model)ANOVA is originally an analysis that compares "differences in means between groups" → factors are categorical.
Therefore, even if the problem doesn't explicitly state "categorical" in words,
Since the factor itself is a group variable, C() is the default.
In other words,
✔ Even if it's in numbers → C()
✔ Even if it's in text → C()❌2. Regression Analysis (OLS)
➡Only variables explicitly specified as categorical in the problem should use C()
Yes:
ols("y ~ x1 + region", data=df)Just because it's in numbers doesn't mean it should automatically be treated as categorical data - that's wrong.
Treat numeric variables as continuous unless the problem specifically states they are "categorical variables"
❌3. Logistic Regression (logit)
➡Same principle as ols
Yes:
logit("target ~ x1 + job_type", data=df)logit only needs C() when the problem explicitly states "categorical".
Otherwise, never automatically add C().
Unfortunately, there are no execution shortcuts.
Comment: Ctrl + /
Multi-line comment: Select block then Ctrl + /Zoom In: Ctrl + '+'
Zoom out: Ctrl + '-'
If the monitor is small...Move to beginning of line: Ctrl + Left arrow key
mainly used when bracketingMove to end of line: Ctrl + Right arrow key
Mainly used when bracketingFind (Search): Ctrl + f
Ctrl + f can also be used in the basic data tab

Copy and paste the content output from dir and help commands to 'Notepad' (must be done with mouse)
Search functionality is available
Search is not possible within the execution results (output) itself

Hands-on Experience Link

