🥷
🥷
文章目录
  1. Create Keras Model
  2. Pipeline
  3. GridSearch CV

把Keras Models放进Sklearn的Pipeline

Create Keras Model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from keras.wrappers.scikit_learn import KerasClassifier
def create_model(kernel_initializer='he_normal', optimizer='adam', activation='relu', dropout=0.5):
inputs = Input(shape=(sequence_length,), dtype='int32')
embedding = Embedding(input_dim=vocabulary_size,
output_dim=embedding_dim, input_length=sequence_length)(inputs)
reshape = Reshape((sequence_length, embedding_dim, 1))(embedding)

conv_0 = Conv2D(num_filters, kernel_size=(
filter_sizes[0], embedding_dim), padding='valid', kernel_initializer=kernel_initializer, activation=activation)(reshape)
conv_1 = Conv2D(num_filters, kernel_size=(
filter_sizes[1], embedding_dim), padding='valid', kernel_initializer=kernel_initializer, activation=activation)(reshape)
conv_2 = Conv2D(num_filters, kernel_size=(
filter_sizes[2], embedding_dim), padding='valid', kernel_initializer=kernel_initializer, activation=activation)(reshape)

maxpool_0 = MaxPool2D(pool_size=(
sequence_length - filter_sizes[0] + 1, 1), strides=(1, 1), padding='valid')(conv_0)
maxpool_1 = MaxPool2D(pool_size=(
sequence_length - filter_sizes[1] + 1, 1), strides=(1, 1), padding='valid')(conv_1)
maxpool_2 = MaxPool2D(pool_size=(
sequence_length - filter_sizes[2] + 1, 1), strides=(1, 1), padding='valid')(conv_2)

concatenated_tensor = Concatenate(axis=1)(
[maxpool_0, maxpool_1, maxpool_2])
flatten = Flatten()(concatenated_tensor)
dropout = Dropout(dropout)(flatten)
output = Dense(units=2, activation='softmax')(dropout)

# this creates a model that includes
model = Model(inputs=inputs, outputs=output)

model.compile(optimizer=optimizer, loss='binary_crossentropy',
metrics=['accuracy'])
return model

keras_clf = KerasClassifier(build_fn=create_model)

如果是创建回归模型,就 from keras.wrappers.scikit_learn import KerasRegressor, 从上面函数可以看出,只要把自己以前的模型定义为一个函数,然后在bind函数(KerasRegressor或者KerasClassifier)里传进去即可

Pipeline

1
2
3
4
5
6
7
from sklearn.pipeline import Pipeline
pipline = Pipeline([
# ('preprocess_step1',None),
# ('preprocess_step2',None),
# ('preprocess_step3',None)
('clf', keras_clf)
])

pipeline是sklearn中用于对整个数据训练流程进行流式处理的函数,可以把对数据进行预处理,特征选择等等都写作为每一步写进去即可.

GridSearch CV

1
2
3
4
5
6
7
8
9
10
from sklearn.model_selection import GridSearchCV
param_grid = {
'clf__optimizer': ['rmsprop', 'adam', 'adagrad'],
'clf__epochs': [200, 300, 400, 700, 1000],
'clf__batch_size': [32, 64, 128],
'clf__dropout': [0.1, 0.2, 0.3, 0.4, 0.5],
'clf__kernel_initializer': ['he_normal', 'glorot_uniform', 'normal', 'uniform']
}
grid = GridSearchCV(pipline, cv=3, param_grid=param_grid)
grid.fit(X_train, y_train)

网格搜索听起来很高大上,也就是自动调参,也是暴力搜索.注意的是,在小数据集上很有用,数据集大了就不太适用了.

1
2
3
4
5
6
7
print(" Best {} using {}".format(grid.best_score_, grid.best_params_))
means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
params = grid.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):
print('{} {} with {}'.format(mean, stdev, param))

训练结束之后,可以通过grid.best_score_grid.best_params_的方式得到其最优精度及参数.完整代码见这里

暴力搜索起到了一些作用,更好的方式,可以找若干篇论文中的参数(一般来讲,不会差别太大),然后挑几个作为调参的参数.毕竟这些都是别人调过的,然后看下在这几种不同的参数上,模型表现出的效果.