BEST

/

Certificate (Data Science)

[Side Project After Work] Big Data Analysis Certification Practical Exam (Type 1, 2, 3)

We guide non-majors and beginners to quickly obtain the Big Data Analysis Certification (Practical Exam)! Keep the theory light and the practice solid—focusing on core points that are guaranteed to appear on the exam through past questions, without the need for complex background knowledge.

(4.9) 768 reviews

4,964 learners

Level Beginner

Course period 12 months

roadmap

Engineer Big Data Analysis

Engineer Big Data Analysis

Big Data

Big Data

Python

Python

Pandas

Pandas

Machine Learning(ML)

Machine Learning(ML)

Engineer Big Data Analysis

Engineer Big Data Analysis

Big Data

Big Data

Python

Python

Pandas

Pandas

Machine Learning(ML)

Machine Learning(ML)

thumbnail background

･

Edited

✅ Practical Exam Type 2: When do you delete columns?

Differences between Past Exam Questions vs Practice Problems

In past exam questions or example problems, there were no cases where columns were deleted.

However, when dealing with more complex data in practice/mock problems, situations arise where column deletion becomes necessary.

1⃣ When all values are unique

# Example: ID, customer number, order number, etc.
df['customer_id'].nunique() == len(df)  # Consider deletion if True

Numeric: Even if left as is, the model automatically evaluates it with low importance
- No major issues even if not deleted
String type: Deletion recommended due to dimension explosion during encoding! ⚠
- Label Encoding creates meaningless ordinal relationships
- When One-Hot Encoding is applied, the number of columns = number of rows increases rapidly. (Can only be digested within 1 minute)

2⃣ When encoding is difficult

# Example: Free text, addresses, emails, etc.
df['comment'].head()
# "Fast delivery", "Clean packaging", "Will repurchase"...

Baseline: Delete first and run the model
Advanced Strategy: If you have time left, think about ways to save it
- Creating derived variables such as text length, presence of specific keywords, etc.
- ex) Flight number(KE1234) → Airline(KE) + Flight number(1234) extracted separately

3⃣ When there are excessively many missing values (80-90% or more)

df['컬럼'].isnull().sum() / len(df)

Baseline: Delete first and play it safe
Advanced Strategy: If you have time left, think about ways to save it
- Replace the missing status itself with random values
  Comparison of deleted evaluation indicator results and results after filling

💡 What if you encounter columns that are difficult to process like the above?

Phase 1: Quickly Complete the Baseline (30~40 minutes)
- Cases 2 and 3 should be boldly deleted
- For item 1, if it's a string type, delete it; if it's a numeric type, it's OK to leave it as is.
- Complete the code that can be submitted for now
Phase 2: Advanced Topics if Time Permits (only when there's spare time)
- Attempting to recover deleted columns
- Performance improvement verification

⚠ Precautions

Time management is the top priority! Submittable code is more important than perfect preprocessing
Delete from the baseline and resubmit after the 1st submission, then try again when there's time left! 2nd submission

Comment