pipeline transform과 fit_transform의 차이

Question

안녕하세요 강사님, NLP 또한 잘 보고 있는데 질문이 하나 있습니다.

아래에는 데이터를 벡터화 하기 위해 파이프라인의 fit_transform으로 변환했는데,

#벡터화 파라미터 설정하여 벡터화 템플릿 생성

vectorizer = CountVectorizer(analyzer ="word",

tokenizer=None,

preprocessor = None,

stop_words = None,

min_df = 2,

ngram_range=(1,3),

max_features = 20000

)

pipeline = Pipeline([

('vect', vectorizer),

])

%time train_data_features = pipeline.fit_transform(train["review_clean"])

train_data_features

뒤에서 실제 테스트 데이터를 벡터화할 때는 transform으로 하셨습니다.

%time test_data_features = pipeline.transform(clean_test_reviews)

test_data_features = test_data_features.toarray()

혹시 fit_transform()과 transform()의 차이가 있는지 문의드립니다.

NLP에 호기심이 많이 가는데, 복잡하고 어렵네요..

Answer

상세한 설명 너무 감사합니다. 강사님

강의 퀄리티에 이어 질의응대도 감동이네요~

데이터 분석의 쪼랩들에게 매우 도움되는 내용인거 같아요. ?? 두개로 소스코드 보는 건 완전 꿀팁이네요.

아직 용어와 내용은 어렵지만 그래도 어떻게 접근하고 fit과 fit_transform에 대해 감 잡았습니다.

감사합니다 강사님~

Answer

안녕하세요.

좋은 질문을 주셨네요.

사이킷런은 fit, fit_transform, transform 과 같은 API를 갖고 있습니다.

이 API는 학습, 변환, 전처리 등에 사용됩니다.

fit은 위와 같이 주로 학습에 사용되며, fit_transform과 transform 은 숫자의 스케일변환, 텍스트 데이터 전처리, 데이터 인코딩, 결측치 대체 등에 사용됩니다.

이 API는 자연어 처리 뿐만 아니라 사이킷런에서 다양하게 사용되고 있습니다.

아래처럼 CountVectorizer를 부르고 관련된 메소드를 리스팅하면 fit, fit_transform, transform 이 모두 나옵니다.

그럼 어떨 때 fit, fit_transform, transform 을 사용해야 될지를 알아볼게요.

다음과 같이 ?를 통해 도움말을 호출했습니다.

내부에서 처리되는 건 비슷하지만 조금씩 다른 것을 확인해 보실 수 있습니다.

Signature: CountVectorizer.fit(self, raw_documents, y=None)
Docstring:
Learn a vocabulary dictionary of all tokens in the raw documents.

Parameters
----------
raw_documents : iterable
    An iterable which yields either str, unicode or file objects.

Returns
-------
self

Signature: CountVectorizer.fit_transform(self, raw_documents, y=None)
Docstring:
Learn the vocabulary dictionary and return term-document matrix.

This is equivalent to fit followed by transform, but more efficiently
implemented.

Parameters
----------
raw_documents : iterable
    An iterable which yields either str, unicode or file objects.

Returns
-------
X : array, [n_samples, n_features]
    Document-term matrix.

Signature: CountVectorizer.transform(self, raw_documents)
Docstring:
Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary
fitted with fit or the one provided to the constructor.

Parameters
----------
raw_documents : iterable
    An iterable which yields either str, unicode or file objects.

Returns
-------
X : sparse matrix, [n_samples, n_features]
    Document-term matrix.

또, fit의 소스코드를 보면 내부에서 fit_transform을 호출하는 것을 보실 수 있습니다.

Signature: CountVectorizer.fit(self, raw_documents, y=None)
Source:   
    def fit(self, raw_documents, y=None):
        """Learn a vocabulary dictionary of all tokens in the raw documents.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields either str, unicode or file objects.

        Returns
        -------
        self
        """
        self._warn_for_unused_params()
        self.fit_transform(raw_documents)
        return self

내부적으로는 모두 _count_vocab을 사용하게 되는데요.



        vocabulary, X = self._count_vocab(raw_documents,
                                          self.fixed_vocabulary_)

transform은 변환작업을 하고 Document-term matrix를 반환합니다.

Source:   
    def transform(self, raw_documents):
        """Transform documents to document-term matrix.

        Extract token counts out of raw text documents using the vocabulary
        fitted with fit or the one provided to the constructor.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields either str, unicode or file objects.

        Returns
        -------
        X : sparse matrix, [n_samples, n_features]
            Document-term matrix.
        """
        if isinstance(raw_documents, str):
            raise ValueError(
                "Iterable over raw text documents expected, "
                "string object received.")
        self._check_vocabulary()

        # use the same matrix-building strategy as fit_transform
        _, X = self._count_vocab(raw_documents, fixed_vocab=True)
        if self.binary:
            X.data.fill(1)
        return X

fit_transform 은 transform 과는 같은 기능을 하지만 transform 보다는 효율적이라고 합니다.

역시나 마찬가지로 Document-term matrix를 반환합니다.

보통 다른 기능에서 fit과 transform이 사용될 때는 fit, transform 을 각각 따로 해주고 fit_transform 에서는 fit과 transform을 한 번에 해주는데 여기에서는 TfidfVectorizer 와의 side effect 를 고려해서 transform을 부르지 않고 입력받은 파라메터 값에 따라 벡터화를 진행합니다.

Source:   
    def fit_transform(self, raw_documents, y=None):
        """Learn the vocabulary dictionary and return term-document matrix.

        This is equivalent to fit followed by transform, but more efficiently
        implemented.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields either str, unicode or file objects.

        Returns
        -------
        X : array, [n_samples, n_features]
            Document-term matrix.
        """
        # We intentionally don't call the transform method to make
        # fit_transform overridable without unwanted side effects in
        # TfidfVectorizer.
        if isinstance(raw_documents, str):
            raise ValueError(
                "Iterable over raw text documents expected, "
                "string object received.")

        self._validate_params()
        self._validate_vocabulary()
        max_df = self.max_df
        min_df = self.min_df
        max_features = self.max_features

        vocabulary, X = self._count_vocab(raw_documents,
                                          self.fixed_vocabulary_)

        if self.binary:
            X.data.fill(1)

        if not self.fixed_vocabulary_:
            X = self._sort_features(X, vocabulary)

            n_doc = X.shape[0]
            max_doc_count = (max_df
                             if isinstance(max_df, numbers.Integral)
                             else max_df * n_doc)
            min_doc_count = (min_df
                             if isinstance(min_df, numbers.Integral)
                             else min_df * n_doc)
            if max_doc_count < min_doc_count:
                raise ValueError(
                    "max_df corresponds to < documents than min_df")
            X, self.stop_words_ = self._limit_features(X, vocabulary,
                                                       max_doc_count,
                                                       min_doc_count,
                                                       max_features)

            self.vocabulary_ = vocabulary

        return X

그리고 사이킷런에서는 fit은 주로 X, y가 있는 값 (지도학습) 혹은 y가 없는 값(비지도학습)에 사용되며

fit_transform, transform은 X만 있는 값(비지도학습)에 사용됩니다.

지도학습은 여기에서 처럼 영화리뷰의 감정분류가 True, False 형태로 Lable(정답값)이 있는 형태

비지도학습은 정답값이 없는 상태에서 군집화, 차원축소, 데이터 전처리 등 에서 주로 사용합니다.

위의 내부 소스코드는 각 API에서 물음표 두개 ?? 를 하시면 보실 수 있습니다!

물음표 하나 ? 는 도움말

물음표 두개 ?? 는 소스코드로 궁금하신 내용을 찾아보실 수 있습니다.

저 또한 내부 API를 익힐 때 이 방법을 주로 사용합니다 :)

Answer

소스코드와 도움말 보는 기능만 알아도 도움이 많이 되는 거 같아요! 감사합니다 :)

전재웅

pipeline transform과 fit_transform의 차이

이 글과 비슷한 Q&A

xcom_pull 메서드 사용 질문

안녕하세요 선생님

빈도수가 1000개 이상인 데이터를 따로 담을 때 코드 질문 있습니다.

iplot, plotly 그래프 크기