[Side Project After Work] Big Data Analysis Certification Practical Exam (Type 1, 2, 3) Course

✅1. ANOVA / Two-way ANOVA / One-way ANOVA

→ For categorical factors, C() is the standard practice

Yes:

model = ols("y ~ C(group)", data=df).fit()
anova_lm(model)

ANOVA is originally an analysis that compares "differences in means between groups" → factors are categorical.
Therefore, even if the problem doesn't explicitly state "categorical" in words,
Since the factor itself is a group variable, C() is the default.

In other words,
✔ Even if it's in numbers → C()
✔ Even if it's in text → C()

➡Only variables explicitly specified as categorical in the problem should use C()

Yes:

ols("y ~ x1 + region", data=df)

Just because it's in numbers doesn't mean it should automatically be treated as categorical data - that's wrong.
Treat numeric variables as continuous unless the problem specifically states they are "categorical variables"

➡Same principle as ols

Yes:

logit("target ~ x1 + job_type", data=df)

logit only needs C() when the problem explicitly states "categorical".
Otherwise, never automatically add C().