ASR/CV/ML 使用python演示朴素贝叶斯分类相关概念

文章目录 [隐藏]

本文参考：
https://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html
https://cuijiahua.com/blog/2017/11/ml_4_bayes_1.html
https://www.zhihu.com/question/27462939

1.条件概率

【例1】6个常规病例：
症状　　　疾病
---------------------
打喷嚏　　感冒
打喷嚏　　过敏
头痛　　　脑震荡
头痛　　　感冒
打喷嚏　　感冒
头痛　　　脑震荡
=====================
预测：出现1个打喷嚏的病人，请问他患上感冒的概率有多大。即预测 P(感冒|打喷嚏)。
▶ 全部样本疾病中，　感冒出现了3/6。感冒的样本空间即整个样本集，　　即 P(感　冒)=P(感　冒|全集)=3/6。
▶ 全部样本疾病中，打喷嚏出现了3/6。打喷嚏的样本空间也是整个样本集，即 P(打喷嚏)=P(打喷嚏|全集)=3/6。
▶ 既然是问打喷嚏时感冒的可能性，就看到 3个感冒里有2次打喷嚏，比例2/3。即 P(打喷嚏|感冒)=2/3。
● P(感冒|打喷嚏) = P(打喷嚏|感冒) x P(感冒) / P(打喷嚏)
　　　　　　　　= 2/3 x 3/6 / 3/6
　　　　　　　　≈ 0.67

【例2】6个有职业划分的病例：
症状　　职业　　　疾病
---------------------------
打喷嚏　护士　　　感冒
打喷嚏　农夫　　　过敏
头痛　　建筑工　　脑震荡
头痛　　建筑工　　感冒
打喷嚏　教师　　　感冒
头痛　　教师　　　脑震荡
=====================
预测：出现1个打喷嚏的建筑工，请问他患上感冒的概率有多大。即预测 P(感冒|打喷嚏x建筑工)。
▶ 全集中，　感冒出现了 3/6，即 P(感　冒)=P(感　冒|全集)=3/6。
▶ 全集中，打喷嚏出现了 3/6，即 P(打喷嚏)=P(打喷嚏|全集)=3/6。
▶ 全集中，建筑工出现了 2/6，即 P(建筑工)=P(建筑工|全集)=2/6。
▶ 在感冒的人里，打喷嚏出现了 2/3，即 P(打喷嚏|感冒)=2/3。
▶ 在感冒的人里，建筑工出现了 1/3，即 P(建筑工|感冒)=1/3。
● P(感冒|打喷嚏x建筑工) = P(打喷嚏x建筑工|感冒) x P(感冒) / P(打喷嚏x建筑工人)
因为症状和职业之间没有因果关系，它们都是预测疾病的平等身份的无关的特征，所以通常把特征看作是独立的。 此时条件概率的计算公式变成了：
　P(感冒|打喷嚏x建筑工) = P(打喷嚏|感冒) x P(建筑工|感冒) x P(感冒) / P(打喷嚏) x P(建筑工)
　　　　　　　　　　　　 = 2/3 x 1/3 x 3/6 / 3/6 x 2/6
　　　　　　　　　　　　 ≈ 0.67

2.全概率

全概率(formula of total probability)，是导致该事件发生的[每种原因引起该事件发生的概率]的总和。

读图，B1、B2、B3是已知事件，组成整个样本空间；A是未知事件，需要进行预测。在整个样本空间内，
A的全概率 P(A)=P(A|全集)=P(A∩B1) + P(A∩B2) + P(A∩B3)
假设样本空间有n个已知事件(B1,B2,...Bn)，那么空间内的未知事件B的全概率用积分求和表示

       n
P(A) = ∑ P(A∩Bi)
      i=1
 
       n
P(A) = ∑ P(A|Bi) * P(Bi)
      i=1

P(A) = ∑ P(A∩Bi)

i=1

P(A) = ∑ P(A|Bi) * P(Bi)

i=1

回到读图，
A的全概率公式为 P(A) = P(A|B1) * P(B1) + P(A|B2) * P(B2) + P(A|B3) * P(B3)。
全概率常作为一个“已知事件”，用于反向推测导致此事件发生的“诱因事件”的概率。

【例3】还是例1的病例：
症状　　　疾病
---------------------------
打喷嚏　　感冒
打喷嚏　　过敏
头痛　　　脑震荡
头痛　　　感冒
打喷嚏　　感冒
头痛　　　脑震荡
=====================
设症状打喷嚏为B1，头痛为B2；疾病感冒为A
▶ 全集中，打喷嚏出现了 3/6，即 P(B1)=P(B1|全集)=3/6。
▶ 全集中，　头痛出现了 3/6，即 P(B2)=P(B2|全集)=3/6。
▶ 打喷嚏的3人中，感冒的2个。即 P(A|B1) = 2/3。
▶ 　头痛的3人中，感冒的1个。即 P(A|B2) = 1/3。
现在有个病人感冒了，猜一下他打喷嚏的条件概率。
根据全概率公式得 感冒全概率P(A) = P(A|B1) * P(B1) + P(A|B2) * P(B2)，
得 P(A) = 2/3 * 3/6 + 1/3 * 3/6
　　　　 = 0.5
然后，知道疾病是感冒，求打喷嚏的概率，即P(B1|A)。
由条件概率的公式 P(B1|A) = P(B1) * P(A|B1) / P(A)，代入感冒全概率P(A)，
得 P(B1|A) = 3/6 * 2/3 / 0.5
　　　　　　 ≈ 0.67

3.贝叶斯推断

条件概率公式 P(A|B) = P(B|A) * P(A) / P(B)，
可以转换得　 P(A|B) = P(A) * ( P(B|A) / P(B) )。
▷ P(A|B)　　　　称为"后验概率"（Posterior probability），即在B事件发生之后，我们对A事件概率的重新评估。
▷ P(A)　　　　　称为"先验概率"（Prior probability），即在B事件发生之前，我们对A事件概率的一个判断。
▷ P(B|A)/P(B)　称为"可能性函数"（Likelyhood），这是一个“调整因子”，使得预估概率更接近真实概率。
由此，条件概率可以理解成下面的式子：
后验概率＝先验概率ｘ调整因子
如果"调整因子" > 1，意味着"先验概率"被增强，事件A的发生的可能性变大；
如果"调整因子" < 1，意味着"先验概率"被削弱，事件A的可能性变小；
如果"调整因子" = 1，意味着B事件无助于判断事件A的可能性，它们没有关系，是无关的独立事件。

【例4】两个盒子，红盒子里有30个红钥匙+10个绿钥匙，蓝盒子里有20个红钥匙+20个绿钥匙。
老王拿到了一个红钥匙，问你：老王拿的红钥匙是红盒子里的概率多大。

▶ 设红盒子为H1，蓝盒子为H2，那么P(H1)=P(H2)=1/2，即关于H1的"先验概率"。
▶ 设老王拿到的红钥匙是E，则问题就是求P(H1|E)，即关于H1的"后验概率"，因为要根据E对H1进行修正。
　 P(E|H1)为红盒子里取出红钥匙的概率 = 30/(30+10)，
　 P(E|H2)为蓝盒子里取出红钥匙的概率 = 20/(20+20)，
1)由全概率公式得：
P(E) = P(E|H1)*P(H1) + P(E|H2)*P(H2)
　　　= 30/(30+10) * 1/2 + 20/(20+20) * 1/2
　　　= 0.625
2)条件概率公式得：
P(H1|E) = P(H1) * ( P(E|H1) / P(E) )，代入P(E)，
　　　　 = 1/2 * (30/(30+10) / 0.625)
　　　　 = 0.6
结果表明老王拿的红钥匙是红盒子里的概率为0.6，因为调整因子(30/(30+10) / 0.625)大于1，所以这个概率是增强了的。
如果只是比较红盒子和蓝盒子哪个可能性比较大，会发现P(H1|E)和P(H2|E)计算是相同的，其中对全概率的计算就没有了必要，
P(H1|E) = P(H1) * P(E|H1) / P(E) ，
P(H2|E) = P(H2) * P(E|H2) / P(E) ，
最终只比较 P(H1) * P(E|H1) 和 P(H2) * P(E|H2) 。

4.朴素贝叶斯推断

朴素的意思就是【例2】里，
因为症状和职业之间没有因果关系，它们都是预测疾病的平等身份的无关的特征，所以通常把特征看作是独立的。

5.敏感词检测示例

首先有足够的样本词汇，已经区分好了哪些是侮辱性的，哪些不是。并且要把自然语言词汇转成算法可用的数字向量。

import numpy as np

def getArticleVocabMat():  # 文章分词集合
    articleMat = [  # 6篇文章
        ['my',    'dog',       'has',       'flea',      'problems', 'help', 'please'],
        ['maybe', 'not',       'take',      'him',       'to',       'dog',  'park', 'stupid'],
        ['my',    'dalmation', 'is',        'so',        'cute',     'I',    'love', 'him'],
        ['stop',  'posting',   'stupid',    'worthless', 'garbage'],
        ['mr',    'licks',     'ate',       'my',        'steak',    'how',  'to',   'stop',  'him'],
        ['quit',  'buying',    'worthless', 'dog',       'food',     'stupid']]
    labelList = [0, 1, 0, 1, 0, 1]  # 对应6篇文章的标签向量。1代表侮辱性词汇，0代表不是
    return articleMat, labelList


def getVocabList(normalMat):  # 统计全部文章出现过的词汇
    vocabSet = set([])
    for normalList in normalMat:
        vocabSet = vocabSet | set(normalList)  # 取并集
    vocabList = list(vocabSet)
    return vocabList # 32个


def word2Vector(vocabList, wordList):  # 词汇转向量
    # vocabList 6篇文章的词汇表。32个词
    # wordList  某文章的分词
    vectorList = [0] * len(vocabList)  # 向量集。数量涵盖6篇文章，长度32
    for word in wordList:      # 遍历每个分词
        if word in vocabList:  # 如果词条存在于词汇表中，则置1
            idx = vocabList.index(word)
            vectorList[idx] = 1
    # 这篇文章占用了32个位置，但文章的分词只有5-9个。有词的位置是1。
    # 向量集元素都是基于词汇表定位的，所以6篇文章对应的向量集，相同索引对应相同分词
    return vectorList


def ready(): # 样本数据
    # 1.加载样本
    articleMat, labelList = getArticleVocabMat() # 创建实验样本：文章分词
    vocabList = getVocabList(articleMat)         # 创建词汇表
    # 2.转向量
    articleVectorMat = []  # 全部文章分词出现的情况。出现的为1。
    for article in articleMat: # 每篇文章分词
        vectorList = word2Vector(vocabList, article) # 转向量
        articleVectorMat.append(vectorList)
    npArticleVectorMat = np.array(articleVectorMat)
    npLabelList = np.array(labelList)
    return npArticleVectorMat, npLabelList, vocabList


if __name__ == '__main__':
    npArticleVectorMat, npLabelList, vocabList = ready()
    print(npArticleVectorMat)
    # [[0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1]
    #  [0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0]
    #  [0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1]
    #  [1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0]
    #  [0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 0 1 1]
    #  [0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]
    print(npLabelList)
    # [0 1 0 1 0 1]
    print(vocabList)
    # ['posting', 'ate', 'please', 'help', 'him', 'cute', 'maybe', 'food', 
    #   'worthless', 'dalmation', 'buying', 'dog', 'take', 'quit', 'love', 'park', 
    #   'is', 'has', 'stupid', 'to', 'stop', 'not', 'steak', 'flea', 
    #   'how', 'so', 'licks', 'garbage', 'I', 'problems', 'mr', 'my']

import numpy as np

def getArticleVocabMat(): # 文章分词集合

articleMat = [ # 6篇文章

['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

['stop', 'posting', 'stupid', 'worthless', 'garbage'],

['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

labelList = [0, 1, 0, 1, 0, 1] # 对应6篇文章的标签向量。1代表侮辱性词汇，0代表不是

return articleMat, labelList

def getVocabList(normalMat): # 统计全部文章出现过的词汇

vocabSet = set([])

for normalList in normalMat:

vocabSet = vocabSet | set(normalList) # 取并集

vocabList = list(vocabSet)

return vocabList # 32个

def word2Vector(vocabList, wordList): # 词汇转向量

# vocabList 6篇文章的词汇表。32个词

# wordList 某文章的分词

vectorList = [0] * len(vocabList) # 向量集。数量涵盖6篇文章，长度32

for word in wordList: # 遍历每个分词

if word in vocabList: # 如果词条存在于词汇表中，则置1

idx = vocabList.index(word)

vectorList[idx] = 1

# 这篇文章占用了32个位置，但文章的分词只有5-9个。有词的位置是1。

# 向量集元素都是基于词汇表定位的，所以6篇文章对应的向量集，相同索引对应相同分词

return vectorList

def ready(): # 样本数据

# 1.加载样本

articleMat, labelList = getArticleVocabMat() # 创建实验样本：文章分词

vocabList = getVocabList(articleMat) # 创建词汇表

# 2.转向量

articleVectorMat = [] # 全部文章分词出现的情况。出现的为1。

for article in articleMat: # 每篇文章分词

vectorList = word2Vector(vocabList, article) # 转向量

articleVectorMat.append(vectorList)

npArticleVectorMat = np.array(articleVectorMat)

npLabelList = np.array(labelList)

return npArticleVectorMat, npLabelList, vocabList

if __name__ == '__main__':

npArticleVectorMat, npLabelList, vocabList = ready()

print(npArticleVectorMat)

# [[0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1]

# [0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0]

# [0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1]

# [1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0]

# [0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 0 1 1]

# [0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

print(npLabelList)

# [0 1 0 1 0 1]

print(vocabList)

# ['posting', 'ate', 'please', 'help', 'him', 'cute', 'maybe', 'food',

# 'worthless', 'dalmation', 'buying', 'dog', 'take', 'quit', 'love', 'park',

# 'is', 'has', 'stupid', 'to', 'stop', 'not', 'steak', 'flea',

# 'how', 'so', 'licks', 'garbage', 'I', 'problems', 'mr', 'my']

样本数据有了，然后就是分类器的训练。
计算全部样本词中侮辱词汇概率 P(侮辱词汇)=P(侮辱词汇|全部样本词汇)
计算侮辱词汇中每个词语概率 P(侮辱词|侮辱性词汇)
同理，计算非侮辱词汇。


def getClassify(npArticleVectorMat, npLabelList):  # 训练朴素贝叶斯分类器
    # npArticleVectorMat 6篇文章的分词出现的向量情况
    # npLabelList 既定的代表 侮辱性(1) 非侮辱(0) 的标签向量。
    articleCount = len(npArticleVectorMat)  # 文章数量=6
    villainCount = sum(npLabelList) # 这里元素求和的意思是 取出值为1的元素个数=3

    # 1.【 P(侮辱性词汇)=P(侮辱性词汇|全部文章)=3/6 】同 P(非侮辱性词汇)=0.5，先验概率
    villainRatio = villainCount / float(articleCount) # 侮辱类词汇(文章)的概率=0.5

    vocabCount = len(npArticleVectorMat[0]) # 每篇文章的词汇数，即去重词汇表数量=32
    p1Num = np.ones(vocabCount) # 记录每个侮辱类词语出现次数。初始化为32个1，【拉普拉斯平滑】。实际次数即减去1。
    p0Num = np.ones(vocabCount) # 非侮辱类词语出现次数
    p1Denom = 2.0 # 全部文章中侮辱性词语出现的总数。分母初始化为2,【拉普拉斯平滑】
    p0Denom = 2.0 # 全部文章中非侮辱性词语出现的总数。
    for i in range(articleCount):
        label = npLabelList[i]
        articleVec = npArticleVectorMat[i] # 某文章的分词向量。值为0表示无词语。
        wordCount = sum(articleVec) # 这里元素求和的意思是 取出值为1的元素个数=文章分词的个数
        if 1 == label: # 2. 统计属于侮辱类的条件概率所需的数据，即P(w0|1),P(w1|1),P(w2|1)···
            p1Num += articleVec  # 对应位置的元素(1)相加，即累加了某个分词的个数
            p1Denom += wordCount # 累加值为1的元素个数
        else:          # 统计属于非侮辱类的条件概率所需的数据，即P(w0|0),P(w1|0),P(w2|0)···
            p0Num += articleVec
            p0Denom += wordCount
    
    # 此时 p1Num保存的是每个侮辱类词语的出现次数
    # 此时 p1Denom为全部文章中侮辱性词语出现的总数
    # 3.【 P(侮辱词|侮辱性词汇) 】，每个单词属于侮辱类的条件概率
    p1Ratio = p1Num/p1Denom # 每个元素除以p1Denom，返回每个侮辱性词语在侮辱类词集所占的概率
    p0Ratio = p0Num/p0Denom # 【 P(非侮辱词|非侮辱性词汇) 】
    p1Vect = np.log(p1Ratio) # 取对数，防止下溢出
    p0Vect = np.log(p0Ratio)

    # p1V 【 P(侮辱词|侮辱性词汇) 】，单词属于侮辱类的条件概率
    # p0V 【 P(非侮辱词|非侮辱性词汇) 】，单词属于非侮辱类的条件概率
    # pAb 【 P(侮辱性词汇)=P(侮辱性词汇|全部文章)=3/6 】同 P(非侮辱性词汇)
    return p1Vect, p0Vect, villainRatio


if __name__ == '__main__':
    npArticleVectorMat, npLabelList, vocabList = ready()
    p1V, p0V, pAb = getClassify(npArticleVectorMat, npLabelList)
    print(p1V)
    # [-3.04452244 -2.35137526 -3.04452244 -1.94591015 -3.04452244 -2.35137526
    #  -2.35137526 -2.35137526 -2.35137526 -3.04452244 -3.04452244 -2.35137526
    #  -3.04452244 -1.65822808 -3.04452244 -2.35137526 -2.35137526 -2.35137526
    #  -3.04452244 -2.35137526 -3.04452244 -3.04452244 -2.35137526 -3.04452244
    #  -3.04452244 -1.94591015 -3.04452244 -3.04452244 -2.35137526 -3.04452244
    #  -3.04452244 -3.04452244]
    print(p0V)
    # [-1.87180218 -3.25809654 -2.56494936 -2.56494936 -2.56494936 -2.15948425
    #  -3.25809654 -3.25809654 -2.56494936 -2.56494936 -2.56494936 -3.25809654
    #  -2.56494936 -3.25809654 -2.56494936 -3.25809654 -2.56494936 -3.25809654
    #  -2.56494936 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936
    #  -2.56494936 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936
    #  -2.56494936 -2.56494936]
    print(pAb)
    # 0.5

def getClassify(npArticleVectorMat, npLabelList): # 训练朴素贝叶斯分类器

# npArticleVectorMat 6篇文章的分词出现的向量情况

# npLabelList 既定的代表侮辱性(1) 非侮辱(0) 的标签向量。

articleCount = len(npArticleVectorMat) # 文章数量=6

villainCount = sum(npLabelList) # 这里元素求和的意思是取出值为1的元素个数=3

# 1.【 P(侮辱性词汇)=P(侮辱性词汇|全部文章)=3/6 】同 P(非侮辱性词汇)=0.5，先验概率

villainRatio = villainCount / float(articleCount) # 侮辱类词汇(文章)的概率=0.5

vocabCount = len(npArticleVectorMat[0]) # 每篇文章的词汇数，即去重词汇表数量=32

p1Num = np.ones(vocabCount) # 记录每个侮辱类词语出现次数。初始化为32个1，【拉普拉斯平滑】。实际次数即减去1。

p0Num = np.ones(vocabCount) # 非侮辱类词语出现次数

p1Denom = 2.0 # 全部文章中侮辱性词语出现的总数。分母初始化为2,【拉普拉斯平滑】

p0Denom = 2.0 # 全部文章中非侮辱性词语出现的总数。

for i in range(articleCount):

label = npLabelList[i]

articleVec = npArticleVectorMat[i] # 某文章的分词向量。值为0表示无词语。

wordCount = sum(articleVec) # 这里元素求和的意思是取出值为1的元素个数=文章分词的个数

if 1 == label: # 2. 统计属于侮辱类的条件概率所需的数据，即P(w0|1),P(w1|1),P(w2|1)···

p1Num += articleVec # 对应位置的元素(1)相加，即累加了某个分词的个数

p1Denom += wordCount # 累加值为1的元素个数

else: # 统计属于非侮辱类的条件概率所需的数据，即P(w0|0),P(w1|0),P(w2|0)···

p0Num += articleVec

p0Denom += wordCount

# 此时 p1Num保存的是每个侮辱类词语的出现次数

# 此时 p1Denom为全部文章中侮辱性词语出现的总数

# 3.【 P(侮辱词|侮辱性词汇) 】，每个单词属于侮辱类的条件概率

p1Ratio = p1Num/p1Denom # 每个元素除以p1Denom，返回每个侮辱性词语在侮辱类词集所占的概率

p0Ratio = p0Num/p0Denom # 【 P(非侮辱词|非侮辱性词汇) 】

p1Vect = np.log(p1Ratio) # 取对数，防止下溢出

p0Vect = np.log(p0Ratio)

# p1V 【 P(侮辱词|侮辱性词汇) 】，单词属于侮辱类的条件概率

# p0V 【 P(非侮辱词|非侮辱性词汇) 】，单词属于非侮辱类的条件概率

# pAb 【 P(侮辱性词汇)=P(侮辱性词汇|全部文章)=3/6 】同 P(非侮辱性词汇)

return p1Vect, p0Vect, villainRatio

if __name__ == '__main__':

npArticleVectorMat, npLabelList, vocabList = ready()

p1V, p0V, pAb = getClassify(npArticleVectorMat, npLabelList)

print(p1V)

# [-3.04452244 -2.35137526 -3.04452244 -1.94591015 -3.04452244 -2.35137526

# -2.35137526 -2.35137526 -2.35137526 -3.04452244 -3.04452244 -2.35137526

# -3.04452244 -1.65822808 -3.04452244 -2.35137526 -2.35137526 -2.35137526

# -3.04452244 -2.35137526 -3.04452244 -3.04452244 -2.35137526 -3.04452244

# -3.04452244 -1.94591015 -3.04452244 -3.04452244 -2.35137526 -3.04452244

# -3.04452244 -3.04452244]

print(p0V)

# [-1.87180218 -3.25809654 -2.56494936 -2.56494936 -2.56494936 -2.15948425

# -3.25809654 -3.25809654 -2.56494936 -2.56494936 -2.56494936 -3.25809654

# -2.56494936 -3.25809654 -2.56494936 -3.25809654 -2.56494936 -3.25809654

# -2.56494936 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936

# -2.56494936 -2.56494936]

print(pAb)

# 0.5

最后测试，仅比较词语在两个类别中概率大小即可。
还是先将自然语言词汇转为向量，然后代入先验概率和条件概率，得到每个测试词汇的条件概率。

def classifyTest(vec2Classify, p1Vec, p0Vec, pClass1): # 朴素贝叶斯分类-预测
    # 只比较大小，则使用 P(待测词汇|侮辱性词汇)*P(侮辱性词汇) 和 P(待测词汇|非侮辱性词汇)*P(非侮辱性词汇)
    # 对应元素相乘。logA * B = logA + logB，所以这里加上log(pClass1)
    p1 = sum(vec2Classify * p1Vec) + np.log(pClass1)       # P(待测词汇|侮辱性词汇)
    p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1) # P(待测词汇|非侮辱性词汇)
    if p1 > p0:
        return 1
    else:
        return 0
 
 
def test(p1V, p0V, pAb, vocabList): # 预测
    testEntry = ['love', 'my', 'dalmation']  # 测试样本1
    vectorEntry = np.array(word2Vector(vocabList, testEntry))  # 测试样本向量化
    if classifyTest(vectorEntry, p1V, p0V, pAb):
        print(testEntry, '属于侮辱类') 
    else:
        print(testEntry, '属于非侮辱类') 

    testEntry = ['stupid', 'garbage']  # 测试样本2
    vectorEntry = np.array(word2Vector(vocabList, testEntry))  # 测试样本向量化
    if classifyTest(vectorEntry, p1V, p0V, pAb):
        print(testEntry, '属于侮辱类') 
    else:
        print(testEntry, '属于非侮辱类') 


if __name__ == '__main__':
    npArticleVectorMat, npLabelList, vocabList = ready()
    p1V, p0V, pAb = getClassify(npArticleVectorMat, npLabelList)
    test(p1V, p0V, pAb, vocabList)
    # ['love', 'my', 'dalmation'] 属于非侮辱类
    # ['stupid', 'garbage'] 属于侮辱类

def classifyTest(vec2Classify, p1Vec, p0Vec, pClass1): # 朴素贝叶斯分类-预测

# 只比较大小，则使用 P(待测词汇|侮辱性词汇)*P(侮辱性词汇) 和 P(待测词汇|非侮辱性词汇)*P(非侮辱性词汇)

# 对应元素相乘。logA * B = logA + logB，所以这里加上log(pClass1)

p1 = sum(vec2Classify * p1Vec) + np.log(pClass1) # P(待测词汇|侮辱性词汇)

p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1) # P(待测词汇|非侮辱性词汇)

if p1 > p0:

return 1

else:

return 0

def test(p1V, p0V, pAb, vocabList): # 预测

testEntry = ['love', 'my', 'dalmation'] # 测试样本1

vectorEntry = np.array(word2Vector(vocabList, testEntry)) # 测试样本向量化

if classifyTest(vectorEntry, p1V, p0V, pAb):

print(testEntry, '属于侮辱类')

else:

print(testEntry, '属于非侮辱类')

testEntry = ['stupid', 'garbage'] # 测试样本2

vectorEntry = np.array(word2Vector(vocabList, testEntry)) # 测试样本向量化

if classifyTest(vectorEntry, p1V, p0V, pAb):

print(testEntry, '属于侮辱类')

else:

print(testEntry, '属于非侮辱类')

if __name__ == '__main__':

npArticleVectorMat, npLabelList, vocabList = ready()

p1V, p0V, pAb = getClassify(npArticleVectorMat, npLabelList)

test(p1V, p0V, pAb, vocabList)

# ['love', 'my', 'dalmation'] 属于非侮辱类

# ['stupid', 'garbage'] 属于侮辱类

例中把2个类别的数据独立开来计算，又有相关性，不甚明了，参考：
https://cuijiahua.com/blog/2017/11/ml_5_bayes_2.html

- end

声明

本文由 cuiweiyou 原创，转载请注明出处：http://www.gaohaiyan.com/3529.html

承接App定制、企业web站点、办公系统软件设计开发，外包项目，毕设