keras-yolo3 훈련 에러

Question

안녕하세요. 선생님 keras-yolo3 raccoon 데이터셋 훈련 때 오류가 발생했습니다. freeze시에는 잘 동작을 하는데, 모든 레이어를 unfreeze 할 때는 처음부터 에러가 바로 발생하네요. 메모리 용량 문제로 보이는데, 개인 GPU라서 해결할 방법은 GPU 메모리가 높은 것을 사용하는 것이겠죠? --------------------------------------------------------------------------- ResourceExhaustedError Traceback (most recent call last) in 82 epochs = 100 , 83 initial_epoch = 50 , ---> 84 callbacks=[logging, checkpoint, reduce_lr, early_stopping]) 85 model . save_weights ( log_dir + 'trained_weights_final.h5' ) ~\anaconda3\envs f113\lib\site-packages\keras\legacy\interfaces.py in wrapper (*args, **kwargs) 89 warnings.warn('Update your `' + object_name + 90 '` call to the Keras 2 API: ' + signature, stacklevel=2) ---> 91 return func ( * args , ** kwargs ) 92 wrapper . _original_function = func 93 return wrapper ~\anaconda3\envs f113\lib\site-packages\keras\engine raining.py in fit_generator (self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch) 1413 use_multiprocessing = use_multiprocessing , 1414 shuffle = shuffle , -> 1415 initial_epoch=initial_epoch) 1416 1417 @ interfaces . legacy_generator_methods_support ~\anaconda3\envs f113\lib\site-packages\keras\engine raining_generator.py in fit_generator (model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch) 211 outs = model.train_on_batch(x, y, 212 sample_weight = sample_weight , --> 213 class_weight=class_weight) 214 215 outs = to_list ( outs ) ~\anaconda3\envs f113\lib\site-packages\keras\engine raining.py in train_on_batch (self, x, y, sample_weight, class_weight) 1213 ins = x + y + sample_weights 1214 self . _make_train_function ( ) -> 1215 outputs = self . train_function ( ins ) 1216 return unpack_singleton ( outputs ) 1217 ~\anaconda3\envs f113\lib\site-packages\keras\backend ensorflow_backend.py in __call__ (self, inputs) 2664 return self . _legacy_call ( inputs ) 2665 -> 2666 return self . _call ( inputs ) 2667 else : 2668 if py_any ( is_tensor ( x ) for x in inputs ) : ~\anaconda3\envs f113\lib\site-packages\keras\backend ensorflow_backend.py in _call (self, inputs) 2634 symbol_vals , 2635 session) -> 2636 fetched = self . _callable_fn ( * array_vals ) 2637 return fetched [ : len ( self . outputs ) ] 2638 ~\anaconda3\envs f113\lib\site-packages ensorflow\python\client\session.py in __call__ (self, *args, **kwargs) 1437 ret = tf_session.TF_SessionRunCallable( 1438 self . _session . _session , self . _handle , args , status , -> 1439 run_metadata_ptr) 1440 if run_metadata : 1441 proto_data = tf_session . TF_GetBuffer ( run_metadata_ptr ) ~\anaconda3\envs f113\lib\site-packages ensorflow\python\framework\errors_impl.py in __exit__ (self, type_arg, value_arg, traceback_arg) 526 None , None , 527 compat . as_text ( c_api . TF_Message ( self . status . status ) ) , --> 528 c_api.TF_GetCode(self.status.status)) 529 # Delete the underlying status object from memory otherwise it stays alive 530 # as there is a reference to status from this from the traceback due to ResourceExhaustedError : OOM when allocating tensor with shape[1024,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node training_1/Adam/gradients/conv2d_58/convolution_grad/Conv2DBackpropFilter}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 그리고 추가로 궁금한 것이, 처음에 훈련할 때 freeze는 따로 코드로 설정하지 않아도 첫번째 레이어는 고정되어지는 것인가요? 감사합니다

권 철민 · Answer

안녕하십니까, OOM이면 대부분 메모리 문제 입니다. 개인 PC의 GPU 메모리가 얼마인지 확인해 보시지요. 실습 코드는 16G 에서 테스트 되었습니다. 해당 실습 코드는 Colab 버전 실습 코드를 사용해서 학습 하시는게 좋을 것 같습니다. 감사합니다.

젓인 · Answer

감사합니다

젓인 · Answer

배치사이즈 1로 해도 에러가 발생합니다