작업형2 모의문제3 질문있습니다!

Question

안녕하세요. 작업형2 모의문제3(에어비엔비 가격)을 직접 풀었을 아래와 같이 입력했습니다. 저는 minmax_scale을 사용했고, 선생님께서 입력하신 결과와 비교를 하는데 사용하지 않았다는 것을 알게되었고 결과값도 다르게 나왔습니다. 작업형 2유형 문제를 풀때마다 스케일링을 적용하고 있는데, minmax스케일을 하는 경우와 사용하지 않는 경우가 따로 있나요? 있다면 어떻게 구분할 수 있는지 궁금합니다. 그리고 위 문제에서 적용 안하신 자세한 이유도 궁금합니다. 감사합니다! import pandas as pd train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') # print(train.head()) # print(test.head()) # print(train.info()) # print(test.info()) train = train.drop(columns = 'id') test_id = test.pop('id') train = train.drop(columns = 'name') test = test.drop(columns = 'name') train = train.drop(columns = 'host_id') test = test.drop(columns = 'host_id') train = train.drop(columns = 'host_name') test = test.drop(columns = 'host_name') train = train.drop(columns = 'neighbourhood') test = test.drop(columns = 'neighbourhood') train = train.drop(columns = 'neighbourhood_group') test = test.drop(columns = 'neighbourhood_group') train = train.drop(columns = 'last_review') test = test.drop(columns = 'last_review') # print(train.info()) # print(test.info()) # print(train.isnull().sum()) # print(test.isnull().sum()) #last_review, reviews_per_month train['reviews_per_month'] = train['reviews_per_month'].fillna(0) test['reviews_per_month'] = test['reviews_per_month'].fillna(0) # print(train.isnull().sum()) # print(test.isnull().sum()) # print(train.info()) room_type,last_review # print(test.info()) from sklearn.preprocessing import LabelEncoder cols = train.select_dtypes(include = 'object').columns for col in cols: encoder = LabelEncoder() train[col] = encoder.fit_transform(train[col]) test[col] = encoder.transform(test[col]) # print(train.describe()) # print(test.describe()) from sklearn.preprocessing import minmax_scale cols2 = train.select_dtypes(exclude = 'object').columns for col in cols2: train[col] = minmax_scale(train[col]) cols3 = test.select_dtypes(exclude = 'object').columns for col in cols3: test[col] = minmax_scale(test[col]) # print(train.describe()) # print(test.describe()) # print(train.info()) # print(test.info()) from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(train.drop('price', axis = 1), train['price'], test_size=0.2, random_state = 20) from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor() rf.fit(X_train, y_train) pred_val = rf.predict(X_val) from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score # print(mean_squared_error(y_val, pred_val)) # print(mean_absolute_error(y_val, pred_val)) # print(r2_score(y_val, pred_val)) pred = rf.predict(test) pd.DataFrame({'id': test_id, 'price': pred}).to_csv('5959.csv', index=False)

퇴근후딴짓 · Answer

안녕하세요~ 데이터 전처리 중에서 결측치(있다면 필수) 이상치(선택) 인코딩(있다면 필수) 스케일(선택) 스케일링은 베이스라인 모델 (기초 모델)을 만드는데 필수적인 요소는 아니에요! :) 따라서 생략한 부분도 있고 또한 선형회귀와 달리 랜덤포레스트와 같은 트리계열의 모델은 성능향상에 미미한 결과를 얻어 생략한 부분도 있습니다,