Text Summarization

| 分类 学习数据挖掘的路上  | 标签 学习笔记  机器学习  NLP 

前言

这篇文章主要从实战角度讲述怎么从文本生成摘要。其实就是完全不会,调库的话也能实现,但是这样就只能完全依赖于第三方库的实现,无法自己进行调优。所谓”What I cannot create, I do not understand”,还是有很有必要了解一下的。限于本人水平,文章比较浅显。

Survey Of Text Summarization

主要有两种方式(本文主要讨论的是抽取式的),一是生成式(abstractive),一是抽取式(extractive),抽取式就是直接把原文中重要(重不重要就看你来判断了)的句子作为文本的摘要。生成式就比较厉害了,经过转述,替换,缩写等技术来生成出文本的摘要。抽取式的比较成熟,但也有一些问题存在,比如说抽取的句子时有的句子会远远长于平均长度的句子,有的具有上下文的无法被提取出相关信息。还些辩论式(观点对立)的信息无法很好抽取。抽取式的文本摘要用到的方法有:

  • TF-IDF
  • Cluster Based Model
  • Graph theoretic approach
  • Machine Learning approach
  • LSA Method
  • An approach to concept-obtained text summarization
  • Neural networks
  • Automatic text summarization based on fuzzy logic
  • Text summarization using regression for estimating feature weights
  • Multi-document extractive summarization
  • Query based extractive text summarization
  • Multilingual Extractive Text summarization

此外还有pagerank,textrank都可以用于其中。

NLP Basics

NLP先撂出来一堆名词吧,有文本分词,标注,训练,关键词提取,命名实体识别,文本分类。这些都是比较基础的,每一种也有不少方法去实现,例如分词就有许多,N-gram,CRF分析,还可以自定义词典分词。大部分的库已经帮你实现了,看看HanLP的Readme就知道了。下面介绍下TF-IDF,和N-GRAM。

TF-IDF

词频-逆文档频率其实很好理解,啥叫逆文档频率呢,就是出现次数少的词权重大,次数大的权重少。首先计算词频,然后计算逆文档频率,然后得到TF-IDF=TF*IDF IDF=log(语料中的文档总数/包含该词的文档数+1),在python的sklearn库中中实现就更简单了。

# https://stackoverflow.com/questions/34449127/sklearn-tfidf-transformer-how-to-get-tf-idf-values-of-given-words-in-documen
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix

tf = TfidfVectorizer(input='filename', analyzer='word', ngram_range=(1,6),
                     min_df = 0, stop_words = 'english', sublinear_tf=True)
tfidf_matrix =  tf.fit_transform(corpus)

N-GRAM

一图胜千言,直接看下wiki上的示例图就明白了 n-gram-example

Word2vec, Doc2Vec, Sentence2Vec

Word Embedding

这个来看下知乎上的解释吧

word embedding,就是找到一个映射或者函数,生成在一个新的空间上的表达,该表达就是word representation。通俗的翻译可以认为是单词嵌入,就是把X所属空间的单词映射为到Y空间的多维向量,那么该多维向量相当于嵌入到Y所属空间中,一个萝卜一个坑。 作为一个比较重要的知识点,不是一两句介绍的,看参考链接详细学习。

to vec

  • wordtovec 用的两种方法是:skipgram和cbow。 cbo_vs_skipgram 两者的差别是

The skipgram model learns to predict a target word thanks to a nearby word. On the other hand, the cbow model predicts the target word according to its context.

  • sentence2vec
    def sentence2vec(sentences):
      sentence = [cut_sentence(s) for s in sentences]
      vectorizer = CountVectorizer()
      transformer = TfidfTransformer()
      tfidf = transformer.fit_transform(vectorizer.fit_transform(sentence))
      weight = tfidf.toarray()
      return weight
    
  • Also Doc2vec….

PageRank And TextRank

PageRank很著名,从写爬虫的时候知道的,但是应用到文本摘要还是第一次知道。

Cosine similarity Or NN

余弦相似性,word or sentences or doc 已经转换成向量了,自然可以用测量向量间的夹角余弦值计算其相似度了。不过在fasttext的使用过程中还发现了可以采用knn去计算其相似性。

PageRank And TextRank

因为这次是中文文本摘要,所以采用TextRank的算法库用的是TextRank4ZH。textrank只是随手测了一下,主要的还是用pagerank,具体的论文暂时没找到先不补充链接了。learn-note

# coding: utf-8
# textrank
import json
import re
import os
import sys
import codecs
from textrank4zh import TextRank4Keyword, TextRank4Sentence

trainfilepath = "./train_with_summ.txt"
result = []

def fenju2file(trainfilepath,outputdir):
    """
        Please make sure you dest dir was exists
    """
    count = 1

    with open(trainfilepath,encoding="utf-8") as f:
        for text in f.readlines():
            summarization = json.loads(text)['summarization']
            article = json.loads(text)['article']
            ouputfilepath = outputdir+str(count) 

            with open(ouputfilepath,'w',encoding="utf-8") as o:
                for juhao in re.findall('[\u4e00-\u9fa5].*?[。|!|?]',article):
                    o.writelines(juhao+'\n')
            count = count + 1

    print("PreProcess Done !\n")

def sumary(filename):
    text = codecs.open(filename, 'r', 'utf-8').read()
    tr4w = TextRank4Keyword()

    tr4w.analyze(text=text, lower=True, window=2)

    tr4s = TextRank4Sentence()
    tr4s.analyze(text=text, lower=True, source = 'all_filters')
    for item in tr4s.get_key_sentences(num=1):
        print("TextRank Summarization Is: ",item.sentence)

count = 1
with open(trainfilepath,encoding="utf-8") as f:
    for text in f.readlines():
        summarization = json.loads(text)['summarization']
        article = json.loads(text)['article']
        tmpfilepath = './output/'+str(count) 
        with open(tmpfilepath,'w',encoding="utf-8") as o:
            for juhao in re.findall('[\u4e00-\u9fa5].*?[。|!|?]',article):
                o.writelines(juhao+'\n')
        count = count + 1
        
        print("Origin Summarization : ",summarization)
        sumary(os.path.join('./output/',str(count)))

In Action: FastText

FastText Basic Useage

安装很简单,只要make一下,所有的命令行基本都有python接口

  • 训练模型(支持skipgram或者cbow方式,无监督学习,Word Representations)
    fasttext skipgram -input traningText -ouput trainedModel
    fasttext cbow -input traningText -ouput trainedModel
    
import fasttext
model = fasttext.skipgram('data.txt', 'model')
model = fasttext.cbow('data.txt', 'model')
  • 输出向量(输出word vec 或者sentences vec)
    fasttext print-word-vectors trainedModel.bin < yourFile
    fasttext print-sentence-vectors trainedModel.bin < yourFile
    
  • 文本分类(监督学习)
    fasttext supervised -input train.txt -output model #(train.txt is a text file containing a training sentence per line along with the labels.)
    fasttext supervised -input train.txt -output model -label '__label__' #(自定义标签)
    fasttext test model.bin test.txt k #(Top k类)
    fasttext predict model.bin test.txt k 
    fasttext predict-prob model.bin test.txt k  # probability for each label
    
Envy➜  data : master ✘ :✭ ᐅ  head amazon_review_polarity.train 
__label__2 , black lawn mower cover , been searching for ever to get suitable cover and this is just perfect . did try making my own cover but was not successful . 
__label__2 , much better than expected , i have to admit i had very low expectations for this product . i couldn ' t really imagine a product ( short of a needle and thread ) that would fix a piece of fabric for under $10 . but , this kit did a pretty good job . it basically consists of some glue-type stuff that you spread in the hole and then you pour fabric shavings on top to blend it in . the kit comes with a bunch of different colors that you can mix to match the color of your fabric this being the trickiest part of the process . i happen to have dark charcoal colored seats , so the mix was pretty easy black and a little white . the finished product looks pretty good , not perfect , but really good for a $10 fix . 
__label__1 , don ' t blame lucasfilm . . , it bothers me how many star wars fans bash george lucas and lucasfilm for continually releasing the star wars movies in ' new ' editions . star wars is a franchise , and the films are a product . if you are stupid enough to buy these films over and over again , then do not complain if they try to sell you the same films every 2 years . you are creating demand for an old product . 
__label__1 , miata mx5 covercraft cover , the quality was fine however , it did not fit the seats as stated . it bulged in areas and unable to stretch enough to reach around the lower part of the seat . 

FastText Pybinding

不要pip install fasttext(这个是想当然安装了,而且还需要预先安装Cython,需要pip install Cython), 而且pip install的无法load fasttext 训练出来的模型文件(modle.bin),需要从源码安装,在fasttext的文件夹下pip install .(才发现官网有指南)


from  fastText import load_model
def sen2vec_by_fasttext(sentences,model=load_model('./oh_no.bin')):
    """
    Args:
        sentences: A list of sentence from a document
        model:  Pre-Traning with fastText
    """
    senvecs = []
    for _ in sentences:
         senvecs.append(model.get_sentence_vector(_))
    senvecs = np.array(senvecs)
    return senvecs

Rouge And Automatic Evaluation of Summaries

Rouge是一种用来评估自动摘要,评估方法不涉及,具体原理不讨论,附有了论文链接。用perl编写,除了安装不好安装,其他还好。

cpan install XML::DOM export ROUGE_EVAL_HOME=/usr/local/ROUGE-1.5.4/data 安装之后,运行下test文件,然后安装python binding pip install pyrouge 将生成的摘要和原摘要写到对应的指定文件中,然后用写测试代码。

# coding:utf-8
from pyrouge import Rouge155
r = Rouge155('/home/angela/ROUGE')
r.system_dir = '../docs/system'
r.model_dir = '../docs/gold'
r.system_filename_pattern = 'system.(\d+).txt'
r.model_filename_pattern = 'gold.[A-Z].#ID#.txt'
output = r.convert_and_evaluate()
output_dict = r.output_to_dict(output)

Other:

  • Show your GPU memory info

    nvidia-smi -l 1 每1秒输出一次显示信息。

  • gensim 很好用

  • How to implement a project of paper

References


上一篇     下一篇