博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
TextCNN 代码详解(附测试数据集以及GitHub 地址)
阅读量:5314 次
发布时间:2019-06-14

本文共 11143 字,大约阅读时间需要 37 分钟。

 前言:本篇是TextCNN系列的第二篇,分享TextCNN的代码

前两篇可见:

一、textCNN 整体框

1. 模型架构

 图一:textCNN 模型结构示意

2. 代码架构

图二: 代码架构说明

  • text_cnn.py 定义了textCNN 模型网络结构

  • model.py 定义了训练代码

  • data.py 定义了数据预处理操作

  • data_set 存放了测试数据集合. polarity.neg 是负面情感文本, polarity.pos 是正面情感文本

  • train-eval.sh 执行脚本

 

3.代码地址

  

    部分代码 此处代码

 

 4.训练效果说明:

   图三:训练效果展示


 

二、textCNN model 代码介绍

2.1 wordEmbedding

图四:WordEmbedding 例子说明

简要说明:

vocab_size:  词典大小18758

embedding_dim: 词向量大小 为128

seq_length: 句子长度,设定最长为56

embedding_look: 查表操作 根据每个词的位置id 去初始化的w中寻找对应id的向量. 得到一个tensor :[batch_size, seq_length, embedding_size] 既 [?, 56, 128], 此处? 表示batch, 即不知道会有多少输入。

# embedding layer    with tf.name_scope("embedding"):        self.W = tf.Variable(tf.random_uniform([self._config.vocab_size, self._config.embedding_dim], -1.0, 1.0),                             name="W")        self.char_emb = tf.nn.embedding_lookup(self.W, self.input_x)        self.char_emb_expanded = tf.expand_dims(self.char_emb, -1)        tf.logging.info("Shape of embedding_chars:{}".format(str(self.char_emb_expanded.shape)))

 

 

举例说明:我们有一个词典大小为3的词典,一共对应三个词 “今天”,“天气” “很好“,w =[[0,0,0,1],[0,0,1,0],[0,1,0,0]]。

我们有两个句子,”今天天气“,经过预处理后输入是[0,1]. 经过embedding_lookup 后,根据0 去查找 w 中第一个位置的向量[0,0,0,1], 根据1去查找 w 中第二个位置的向量[0,0,1,0] 得到我们的char_emb [[0,0,0,1],[0,0,1,0]]

同理,“天气很好”,预处理后是[1,2]. 经过经过embedding_lookup 后,  得到 char_emb 为[[0,0,1,0],[0,1,0,0]]

因为, 卷积神经网conv2d是需要接受四维向量的,故将char_embdding 增广一维,从 [?, 56, 128] 增广到[?, 56, 128, 1]

 

2.2 Convolution 卷积 + Max-Pooling

图五:卷积例子说明

 

简要说明:

 

filter_size= 3,4,5. 每个filter 的宽度与词向量等宽,这样只能进行一维滑动。

 

每一种filter卷积后,结果输出为[batch_size, seq_length - filter_size +1,1,num_filter]的tensor

 

# convolution + pooling layerpooled_outputs = []for i, filter_size in enumerate(self._config.filter_sizes):with tf.variable_scope("conv-maxpool-%s" % filter_size):    # convolution layer    filter_width = self._config.embedding_dim    input_channel_num = 1    output_channel_num = self._config.num_filters    filter_shape = [filter_size, filter_width, input_channel_num, output_channel_num]    n = filter_size * filter_width * input_channel_num    kernal = tf.get_variable(name="kernal",                             shape=filter_shape,                             dtype=tf.float32,                             initializer=tf.random_normal_initializer(stddev=np.sqrt(2.0 / n)))    bias = tf.get_variable(name="bias",                           shape=[output_channel_num],                           dtype=tf.float32,                           initializer=tf.zeros_initializer)    # apply convolution process    # conv shape: [batch_size, max_seq_len - filter_size + 1, 1, output_channel_num]    conv = tf.nn.conv2d(        input=self.char_emb_expanded,        filter=kernal,        strides=[1, 1, 1, 1],        padding="VALID",        name="cov")    tf.logging.info("Shape of Conv:{}".format(str(conv.shape)))    # apply non-linerity    h = tf.nn.relu(tf.nn.bias_add(conv, bias), name="relu")    tf.logging.info("Shape of h:{}".format(str(h)))    # Maxpooling over the outputs    pooled = tf.nn.max_pool(        value=h,        ksize=[1, self._config.max_seq_length - filter_size + 1, 1, 1],        strides=[1, 1, 1, 1],        padding="VALID",        name="pool"    )    tf.logging.info("Shape of pooled:{}".format(str(pooled.shape)))    pooled_outputs.append(pooled)    tf.logging.info("Shape of pooled_outputs:{}".format(str(np.array(pooled_outputs).shape)))# concatenate all filter's outputtotal_filter_num = self._config.num_filters * len(self._config.filter_sizes)all_features = tf.reshape(tf.concat(pooled_outputs, axis=-1), [-1, total_filter_num])tf.logging.info("Shape of all_features:{}".format(str(all_features.shape)))

 

 

由于我们有三种filter_size, 故会得到三种tensor

 

第一种 tensor, filter_size 为 3处理后的,[?,56-3+1,1, 128] -> [?,54,1, 128]

第二种 tensor, filter_size 为 4处理后的,[?,56-4+1,1, 128] -> [?,53,1, 128]

第三种 tensor, filter_size 为 5处理后的,[?,56-5+1,1, 128] -> [?,52,1, 128]

 

再用ksize=[?,seq_length - filter_size + 1,1,1]进行max_pooling,得到[?,1,1,num_filter]这样的tensor. 经过max_pooling 后

 

第一种 tensor, [?,54,1, 128] –> [?,1,1, 128]

第二种 tensor, [?,53,1, 128] -> [?,1,1, 128]

第三种 tensor, [?,52,1, 128] -> [?,1,1, 128]

 

将得到的三种结果进行组合,得到[?,1,1,num_filter*3]的tensor.最后将结果变形一下[-1,num_filter*3],目的是为了下面的全连接

[?,1,1, 128], [?,1,1, 128], [?,1,1, 128] –> [?, 384]

 

2.3 使用softmax k分类

  图六:softmax 示意

 

简要说明:

 

label_size 为 文本分类类别数目,这里是二分类,然后得到输出的结果scores,以及得到预测类别在标签词典中对应的数值predicitons。使用交叉墒求loss.

 

with tf.name_scope("output"):W = tf.get_variable(    name="W",    shape=[total_filter_num, self._config.label_size],    initializer=tf.contrib.layers.xavier_initializer())b = tf.Variable(tf.constant(0.1, shape=[self._config.label_size]), name="b")l2_loss += tf.nn.l2_loss(W)l2_loss += tf.nn.l2_loss(b)self.scores = tf.nn.xw_plus_b(all_features, W, b, name="scores")self.predictions = tf.argmax(self.scores, 1, name="predictions")# compute losswith tf.name_scope("loss"):losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)self.loss = tf.reduce_mean(losses) + self._config.l2_reg_lambda * l2_loss

 

 


 

三、 textCNN 训练模块

简要说明:利用数据预处理模块加载数据,优化函数选择adam, 每个batch为64. 进行处理

def train(x_train, y_train, vocab_processor, x_dev, y_dev, model_config):with tf.Graph().as_default():    sess = tf.Session()    with sess.as_default():        cnn = TextCNNModel(            config=model_config,            is_training=FLAGS.is_train        )        # Define Training proceduce        global_step = tf.Variable(0, name="global_step", trainable=False)        optimizer = tf.train.AdamOptimizer(1e-3)        grads_and_vars = optimizer.compute_gradients(cnn.loss)        train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)        # Checkpoint directory, Tensorflow assumes this directioon already exists so we need to create it        checkpoint_dir = os.path.abspath(os.path.join(FLAGS.output_dir, "checkpoints"))        checkpoint_prefix = os.path.join(checkpoint_dir, "model")        if not os.path.exists(checkpoint_dir):            os.makedirs(checkpoint_dir)        saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.keep_checkpoint_max)        # Write vocabulary        vocab_processor.save(os.path.join(FLAGS.output_dir, "vocab"))        # Initialize all variables        sess.run(tf.global_variables_initializer())        def train_step(x_batch, y_batch):            """            A singel training step            :param x_batch:            :param y_batch:            :return:            """            feed_dict = {                cnn.input_x: x_batch,                cnn.input_y: y_batch            }            _, step, loss, accuracy = sess.run(                [train_op, global_step, cnn.loss, cnn.accuracy],                feed_dict)            time_str = datetime.datetime.now().isoformat()            tf.logging.info("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))        def dev_step(x_batch, y_batch, writer=None):            """            Evaluates model on a dev set            """            feed_dict = {                cnn.input_x: x_batch,                cnn.input_y: y_batch            }            step, loss, accuracy = sess.run(                [global_step, cnn.loss, cnn.accuracy],                feed_dict)            time_str = datetime.datetime.now().isoformat()            tf.logging.info("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))        # Generate batches        batches = data.DataSet.batch_iter(list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)        # Training loop, For each batch ..        for batch in batches:            x_batch, y_batch = zip(*batch)            train_step(x_batch, y_batch)            current_step = tf.train.global_step(sess, global_step)            if current_step % FLAGS.save_checkpoints_steps == 0:                tf.logging.info("\nEvaluation:")                dev_step(x_dev, y_dev)            if current_step % FLAGS.save_checkpoints_steps == 0:                path = saver.save(sess, checkpoint_prefix, global_step=current_step)                tf.logging.info("Saved model checkpoint to {}\n".format(path))

 


 

四、textCNN 数据预处

简要说明:处理输入数据

class DataSet(object):def __init__(self, positive_data_file, negative_data_file):    self.x_text, self.y = self.load_data_and_labels(positive_data_file, negative_data_file)def load_data_and_labels(self, positive_data_file, negative_data_file):    # load data from files    positive_data = list(open(positive_data_file, "r", encoding='utf-8').readlines())    positive_data = [s.strip() for s in positive_data]    negative_data = list(open(negative_data_file, "r", encoding='utf-8').readlines())    negative_data = [s.strip() for s in negative_data]    # split by words    x_text = positive_data + negative_data    x_text = [self.clean_str(sent) for sent in x_text]    # generate labels    positive_labels = [[0, 1] for _ in positive_data]    negative_labels = [[1, 0] for _ in negative_data]    y = np.concatenate([positive_labels, negative_labels], 0)    return [x_text, y]def clean_str(self, string):    """    Tokenization/string cleaning for all datasets except for SST.    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py    """    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)    string = re.sub(r"\'s", " \'s", string)    string = re.sub(r"\'ve", " \'ve", string)    string = re.sub(r"n\'t", " n\'t", string)    string = re.sub(r"\'re", " \'re", string)    string = re.sub(r"\'d", " \'d", string)    string = re.sub(r"\'ll", " \'ll", string)    string = re.sub(r",", " , ", string)    string = re.sub(r"!", " ! ", string)    string = re.sub(r"\(", " \( ", string)    string = re.sub(r"\)", " \) ", string)    string = re.sub(r"\?", " \? ", string)    string = re.sub(r"\s{2,}", " ", string)    return string.strip().lower()def batch_iter(data, batch_size, num_epochs, shuffle=True):    """    Generates a batch iterator for a dataset.    """    data = np.array(data)    data_size = len(data)    num_batches_per_epoch = int((len(data) - 1) / batch_size) + 1    for epoch in range(num_epochs):        # Shuffle the data at each epoch        if shuffle:            shuffle_indices = np.random.permutation(np.arange(data_size))            shuffled_data = data[shuffle_indices]        else:            shuffled_data = data        for batch_num in range(num_batches_per_epoch):            start_index = batch_num * batch_size            end_index = min((batch_num + 1) * batch_size, data_size)            yield shuffled_data[start_index:end_index]

 

 


 

 五、模型训练

简要说明:修改code_dir , 执行train-eval.sh 即可执行

#!/bin/bashexport CUDA_VISIBLE_DEVICES=0#如果运行的话,更改code_dir目录CODE_DIR="/home/work/work/modifyAI/textCNN"MODEL_DIR=$CODE_DIR/modelTRAIN_DATA_DIR=$CODE_DIR/data_setnohup python3 $CODE_DIR/model.py \--is_train=true \--num_epochs=200 \--save_checkpoints_steps=100 \--keep_checkpoint_max=50 \--batch_size=64 \--positive_data_file=$TRAIN_DATA_DIR/polarity.pos \--negative_data_file=$TRAIN_DATA_DIR/polarity.neg \--model_dir=$MODEL_DIR > $CODE_DIR/train_log.txt 2>&1 &

 

 


 

六、总结

  • 介绍了textCNN基本架构,代码架构,项目地址,训练效果

  • 详细说明textCNN 用tensorflow如何实现

  • 介绍了textCNN 模型训练代码以及数据预处理模块

  • 详细说明如何运行该项目

  • 下一次会介绍如何调优textCNN 模型

 

  

 

 

转载于:https://www.cnblogs.com/ModifyRong/p/11442595.html

你可能感兴趣的文章
判断9X9数组是否是数独的java代码
查看>>
00-自测1. 打印沙漏
查看>>
UNITY在VS中调试
查看>>
SDUTOJ3754_黑白棋(纯模拟)
查看>>
Scala入门(1)Linux下Scala(2.12.1)安装
查看>>
如何改善下面的代码 领导说了很耗资源
查看>>
Quartus II 中常见Warning 原因及解决方法
查看>>
php中的isset和empty的用法区别
查看>>
Android ViewPager 动画效果
查看>>
pip和easy_install使用方式
查看>>
博弈论
查看>>
Redis sentinel & cluster 原理分析
查看>>
我的工作习惯小结
查看>>
把word文档中的所有图片导出
查看>>
浏览器的判断;
查看>>
ubuntu 18.04取消自动锁屏以及设置键盘快捷锁屏
查看>>
Leetcode 589. N-ary Tree Preorder Traversal
查看>>
机器学习/深度学习/其他开发环境搭建记录
查看>>
xml.exist() 实例演示
查看>>
判断是否为空然后赋值
查看>>