강의

멘토링

커뮤니티

BEST
Data Science

/

Certificate (Data Science)

[Side Project After Work] Big Data Analysis Certification Practical Exam (Type 1, 2, 3)

We guide non-majors and beginners to quickly obtain the Big Data Analysis Certification (Practical Exam)! Keep the theory light and the practice solid—focusing on core points that are guaranteed to appear on the exam through past questions, without the need for complex background knowledge.

(4.9) 768 reviews

4,964 learners

Level Beginner

Course period 12 months

  • roadmap
Engineer Big Data Analysis
Engineer Big Data Analysis
Big Data
Big Data
Python
Python
Pandas
Pandas
Machine Learning(ML)
Machine Learning(ML)
Engineer Big Data Analysis
Engineer Big Data Analysis
Big Data
Big Data
Python
Python
Pandas
Pandas
Machine Learning(ML)
Machine Learning(ML)
roadmap님의 프로필 이미지

Edited

✅ Practical Exam Type 2: When do you delete columns?

Differences between Past Exam Questions vs Practice Problems

In past exam questions or example problems, there were no cases where columns were deleted.

However, when dealing with more complex data in practice/mock problems, situations arise where column deletion becomes necessary.

1⃣ When all values are unique

# Example: ID, customer number, order number, etc.
df['customer_id'].nunique() == len(df)  # Consider deletion if True
  • Numeric: Even if left as is, the model automatically evaluates it with low importance

    • No major issues even if not deleted

  • String type: Deletion recommended due to dimension explosion during encoding!

    • Label Encoding creates meaningless ordinal relationships

    • When One-Hot Encoding is applied, the number of columns = number of rows increases rapidly. (Can only be digested within 1 minute)

2⃣ When encoding is difficult

# Example: Free text, addresses, emails, etc.
df['comment'].head()
# "Fast delivery", "Clean packaging", "Will repurchase"...
  • Baseline: Delete first and run the model

  • Advanced Strategy: If you have time left, think about ways to save it

    • Creating derived variables such as text length, presence of specific keywords, etc.

    • ex) Flight number(KE1234) → Airline(KE) + Flight number(1234) extracted separately

3⃣ When there are excessively many missing values (80-90% or more)

df['컬럼'].isnull().sum() / len(df)
  • Baseline: Delete first and play it safe

  • Advanced Strategy: If you have time left, think about ways to save it

    • Replace the missing status itself with random values

      Comparison of deleted evaluation indicator results and results after filling

💡 What if you encounter columns that are difficult to process like the above?

  1. Phase 1: Quickly Complete the Baseline (30~40 minutes)

    • Cases 2 and 3 should be boldly deleted

    • For item 1, if it's a string type, delete it; if it's a numeric type, it's OK to leave it as is.

    • Complete the code that can be submitted for now

  2. Phase 2: Advanced Topics if Time Permits (only when there's spare time)

    • Attempting to recover deleted columns

    • Performance improvement verification

Precautions

  • Time management is the top priority! Submittable code is more important than perfect preprocessing

  • Delete from the baseline and resubmit after the 1st submission, then try again when there's time left! 2nd submission

Comment