[Python Debug]Kernel Crash While Running Neural Network with Keras|Jupyter Notebook運行Keras服務器宕機原因及解決方法

本文轉載自查看原文 2019-03-30 12:31 2784 Python Debug/ Keras/ Jupyter Notebook

最近做Machine Learning作業，要在Jupyter Notebook上用Keras搭建Neural Network。結果連最簡單的一層神經網絡都運行不了，更奇怪的是我先用iris數據集跑了一遍並沒有任何問題，但是用老師給的fashion mnist一運行服務器就提示掛掉重啟。更更奇怪的是同樣的code在同學的電腦上跑也是一點問題都沒有，讓我一度以為是我的macbook年代久遠配置太低什么的，差點要買新電腦了>_<

今天上課經ML老師幾番調試，竟然完美解決了，不愧是CMU大神！（這里給Prof強烈打call，雖然他看不懂中文><）因為剛學python沒多久，還很不熟悉，經過這次又學會好多新技能✌️

出問題的完整code如下，就是用Keras實現logistic regression，是一個簡單的一層網絡，但是每次運行到最后一行server就掛掉，然后重啟kernel。

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, FastICA
from sklearn.linear_model import LogisticRegression
from keras.models import Sequential
from keras.layers import Dense, Activation, Conv2D
from keras.utils import to_categorical
from keras.datasets import fashion_mnist

(x3_train, y_train), (x3_test, y_test) = fashion_mnist.load_data()
n_classes = np.max(y_train) + 1

# Vectorize image arrays, since most methods expect this format
x_train = x3_train.reshape(x3_train.shape[0], np.prod(x3_train.shape[1:]))
x_test = x3_test.reshape(x3_test.shape[0], np.prod(x3_test.shape[1:]))

# Binary vector representation of targets (for one-hot or multinomial output networks)
y3_train = to_categorical(y_train)
y3_test = to_categorical(y_test)

from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)     
x_test_scaled = scaler.fit_transform(x_test) 

n_output = y3_train.shape[1]
n_input = x_train_scaled.shape[1]

nn_lr = Sequential() 
nn_lr.add(Dense(units=n_output, input_dim= n_input, activation = 'softmax'))
nn_lr.compile(optimizer = 'sgd', loss = 'categorical_crossentropy', metrics = ['accuracy'])

由於Jupyter Notebook只是一直重啟kernel，並沒有任何錯誤提示，所以讓人無從下手。但是經老師提示原來啟動Jupyter Notebook時自動打開的terminal上會記錄運行的信息（小白第一次發現。。），包括了kerter中止及重啟的詳細過程及原因：

[I 22:11:54.603 NotebookApp] Kernel interrupted: 7e7f6646-97b0-4ec7-951c-1dce783f60c4

[I 22:13:49.160 NotebookApp] Saving file at /Documents/[Rutgers]Study/2019Spring/MACHINE LEARNING W APPLCTN LARGE DATASET/hw/Untitled1.ipynb

2019-03-28 22:13:49.829246: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

2019-03-28 22:13:49.829534: I tensorflow/core/common_runtime/process_util.cc:69] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.

OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.

OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

[I 22:13:51.049 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports

kernel c1114f5a-3829-432f-a26a-c2db6c330352 restarted

還有另外一個方法，把代碼copy到ipython中，也可以得到類似的信息，所以最后定位的錯誤是：

OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.

谷歌了一下，github上有一個很詳細的討論帖，但是樓主是運行XGBoost時遇到了這個問題，讓我聯想到寒假安裝XGBoost確實經過了很曲折的過程，可能不小心把某個文件重復下載到了不同路徑，於是程序加載package時出現了沖突。帖子里提供了幾種可能的原因及解決方法：

1. 卸載clang-omp

brew uninstall libiomp clang-omp

as long as u got gcc v5 from brew it come with openmp

follow steps in:
https://github.com/dmlc/xgboost/tree/master/python-package

嘗試了卸載xgboost再安裝，然后卸載clang-omp，得到錯誤提示

No such keg: /usr/local/Cellar/libiomp

pip uninstall xbgoost
pip install xgboost
brew uninstall libiomp clang-omp

2. 直接在jupyter notebook里運行：

# DANGER! DANGER!
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

老師說這行命令可以讓系統忽略package沖突的問題，自行選擇一個package使用。試了一下這個方法確實有效，但這是非常危險的做法，極度不推薦！

3. 找到重復的libiomp5.dylib文件，刪除其中一個

在Finder中確實找到了兩個文件，分別在~/⁨anaconda3⁩/lib⁩和~/anaconda3⁩/⁨lib⁩/⁨python3.6⁩/⁨site-packages⁩/⁨_solib_darwin⁩/⁨_U@mkl_Udarwin_S_S_Cmkl_Ulibs_Udarwin___Uexternal_Smkl_Udarwin_Slib⁩ （？？？？）可是不太確定應該刪除哪一個，感覺這種做法也蠻危險的，刪錯了整個跑不起來了。

4. OpenMP沖突

Hint: This means that multiple copies of the OpenMP runtime have been linked into the program

根據提示信息里的Hint，搜了下TensorFlow OpenMP。OpenMP是一個多線程並行編程的平台，TensorFlow似乎有自己的並行計算架構，並用不上OpenMP（see https://github.com/tensorflow/tensorflow/issues/12434）

5. 卸載nomkl

I had the same error on my Mac with a python program using numpy, keras, and matplotlib. I solved it with 'conda install nomkl'.

這是最后有效的做法！nomkl全稱是Math Kernel Library (MKL) Optimization，是Interl開發的用來加速數學運算的模塊，通過conda安裝package可以自動使用mkl，更詳細的信息可以看這個Anaconda的官方文檔。

To opt out, run conda install nomkl and then use conda install to install packages that would normally include MKL or depend on packages that include MKL, such as scipy, numpy, and pandas.

可能是numpy之類的package更新時出現了一些沖突，安裝nomkl之后竟然神奇地解決了，后來又嘗試把MKL卸載了，程序依然正常運行。。卸載命令如下：

conda remove mkl mkl-service

總結：

1. 老師好厲害呀，三下五除二就把問題解決了><

2. 經大神提醒，運行python之前創建一個虛擬環境可以很好避免package沖突之類的問題，具體方法：https://www.jianshu.com/p/d8e7135dca40。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 服務器宕機是什么意思？服務器宕機原因及解決方法服務器宕機排查思路及解決方法 jupyter運行keras掛掉在Linux服務器上運行Jupyter notebook server教程阿里雲GPU服務器配置深度學習環境-遠程訪問-centos,cuda,cudnn,tensorflow,keras,jupyter notebook jupyter notebook 連接服務器docker中python環境搭建Python3的jupyter notebook服務器服務器宕機原因排查思路服務器宕機原因分析服務器宕機原因分析