Section 4. 이직 여부 예측

Question

이 문제에서 train과 test 합쳐서 원핫인코딩 combined = pd.concat([train, test]) combined_dummies = pd.get_dummies(combined) n_train = len(train) train = combined_dummies[:n_train] test = combined_dummies[n_train:] 한 이유가 city 컬럼에서 트레인 유니크 개수>테스트 유니크 개수라서 사용했다고 이해했는데,, 맞을까요? 2. 제가 코드 한거는 print(train.shape, test.shape) # print(train.isnull().sum()) # print(test.isnull().sum()) print( train.info ()) print( test.info ()) print(train.describe(include="O")) print(test.describe(include="O")) a=set(train['city']) b=set(test['city']) print(a-b) print(b-a) target=train.pop('target') df=pd.concat([train, test]) df=pd.get_dummies(df) train=train.iloc[:len(train)] test=test.iloc[len(train):] from sklearn.model_selection import train_test_split X_tr, X_val, y_tr, y_val = train_test_split(target, train, test_size=0.2, random_state=0) print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape) from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(random_state=0) rf.fit (X_tr, y_tr) pred = rf.predict_proba(X_val) 이렇게 하니까 ~~~~~~ (12260,) (3066,) (12260, 13) (3066, 13) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /tmp/ ipython-input-1337250417.py in () 24 from sklearn.ensemble import RandomForestClassifier 25 rf = RandomForestClassifier(random_state=0) ---> 26 rf.fit (X_tr, y_tr) 27 pred = rf.predict_proba(X_val) 4 frames /usr/local/lib/python3.12/dist-packages/sklearn/utils/ validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_all_finite, ensure_non_negative, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name) 1091 "if it contains a single sample." 1092 ) -> 1093 raise ValueError(msg) 1094 1095 if dtype_numeric and hasattr(array.dtype, "kind") and array.dtype.kind in "USV": ValueError: Expected a 2-dimensional container but got instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead. 이런 에러가 나옵니다ㅠ 분할까지는 했는데,, 랜덤포레스트부터 오류가 뜹니다

퇴근후딴짓 · Answer

네 유니크 개수가 서로 달라서입니다. 아래 정현님이 말씀한대로 train_test_split(target, train)에서 train이랑target위치가 맞지 않습니다. 에러가 난다면 제가 작성한 모범 답안과 비교 부탁드리겠습니다.

문정현 · Answer

split할때 train이랑target위치가 바뀐거같은데