首页 - 技术栈

用php实现一个敏感词过滤功能

作者: 五速梦信息网
时间: 2026年06月17日 18:05

用php实现一个敏感词过滤功能曹操 2023-03-28 369 ℃

        9575字

free

网站内容有过多的敏感词，会导致K站。一方面是你懂的，另一方面是我们自己可能也要过滤一些人身攻击或者广告信息等，具体词库可以google下，有很多。

过滤敏感词，使用简单的循环str_replace是性能很低效的，还会随着词库的增加，性能指数下降，而且简单的替换，不能解决一些不是完全匹配的词。这时候就需要先构建一个字典树(trie)，单纯的字典树占用空间较大，使用Double-Array Trie或者Ternary Search Tree可以在保证性能的同时节省一部分空间，但是敏感词基本不会很多，几千甚至上万个词基本没压力，所以就实现就选择先构建一个字典树，然后逐字做匹配。

代码不多，就贴到这里。

函数部分

PHP

&lt;?php
class SensitiveWordFilter
{

private $dict;
private $dictPath;
public function __construct($dictPath)
{
    $this-&gt;dict = array();
    $this-&gt;dictPath = $dictPath;
    $this-&gt;initDict();
}
private function initDict()
{
    $handle = fopen($this-&gt;dictPath, &#39;r&#39;);
    if (!$handle) {
        throw new RuntimeException(&#39;open dictionary file error.&#39;);
    }
    while (!feof($handle)) {
        $word = trim(fgets($handle, 128));
        if (empty($word)) {
            continue;
        }
        $uWord = $this-&gt;unicodeSplit($word);
        $pdict = &amp;$this-&gt;dict;
        $count = count($uWord);
        for ($i = 0; $i &lt; $count; $i++) {
            if (!isset($pdict[$uWord[$i]])) {
                $pdict[$uWord[$i]] = array();
            }
            $pdict = &amp;$pdict[$uWord[$i]];
        }
        $pdict[&#39;end&#39;] = true;
    }
    fclose($handle);
}
public function filter($str, $maxDistance = 5)
{
    if ($maxDistance &lt; 1) {
        $maxDistance = 1;
    }
    $uStr = $this-&gt;unicodeSplit($str);
    $count = count($uStr);
    for ($i = 0; $i &lt; $count; $i++) {
        if (isset($this-&gt;dict[$uStr[$i]])) {
            $pdict = &amp;$this-&gt;dict[$uStr[$i]];
            $matchIndexes = array();
            for ($j = $i + 1, $d = 0; $d &lt; $maxDistance &amp;&amp; $j &lt; $count; $j++, $d++) {
                if (isset($pdict[$uStr[$j]])) {
                    $matchIndexes[] = $j;
                    $pdict = &amp;$pdict[$uStr[$j]];
                    $d = -1;
                }
            }
            if (isset($pdict[&#39;end&#39;])) {
                $uStr[$i] = &#39;*&#39;;
                foreach ($matchIndexes as $k) {
                    if ($k - $i == 1) {
                        $i = $k;
                    }
                    $uStr[$k] = &#39;*&#39;;
                }
            }
        }
    }
    return implode($uStr);
}
public function unicodeSplit($str)
{
    $str = strtolower($str);
    $ret = array();
    $len = strlen($str);
    for ($i = 0; $i &lt; $len; $i++) {
        $c = ord($str[$i]);
        if ($c &amp; 0x80) {


if ((\(c &amp; 0xf8) == 0xf0 &amp;&amp; \)len - \(i &gt;= 4) {
if ((ord(\)str[\(i + 1]) &amp; 0xc0) == 0x80 &amp;&amp; (ord(\)str[\(i + 2]) &amp; 0xc0) == 0x80 &amp;&amp; (ord(\)str[\(i + 3]) &amp; 0xc0) == 0x80) {
\)uc = substr(\(str, \)i, 4);
\(ret[] = \)uc;
\(i += 3;
}
} else if ((\)c & 0xf0) == 0xe0 && \(len - \)i &gt;= 3) {
if ((ord(\(str[\)i + 1]) & 0xc0) == 0x80 && (ord(\(str[\)i + 2]) & 0xc0) == 0x80) {
\(uc = substr(\)str, \(i, 3);
\)ret[] = \(uc;
\)i += 2;
}
} else if ((\(c &amp; 0xe0) == 0xc0 &amp;&amp; \)len - \(i &gt;= 2) {
if ((ord(\)str[\(i + 1])  &amp; 0xc0) == 0x80) {
\)uc = substr(\(str, \)i, 2);
\(ret[] = \)uc;
\(i += 1;
}
}
} else {
\)ret[] = \(str[\)i];
}
}
return \(ret;
}
}</code></pre>使用方法<p><img src="https://ccooc.cn/zb_system/image/admin/page_copy.png"/></p><p> PHP</p><pre><code>&lt;?php
require &#39;SensitiveWordFilter.php&#39;;
/*
初始化传入词库文件路径，词库文件每个词一个换行符。
如：
敏感1
敏感2
目前只支持UTF-8编码
*/
\)filter = new SensitiveWordFilter(DIR . ‘/data/minganwords.txt’);
/*
第一个参数传入要过滤的字符串，第二个是匹配的字间距，
比如‘枪支’是一个敏感词，想过滤‘枪||||支’的时候，
就需要指定一个两个字的间距，可以根据情况设定，
超过指定间距就不会过滤。所有匹配的敏感词会被替换为‘*’。
*/
\(groupname = &#34;这是一个敏感词&#34;;
\)check = \(filter-&gt;filter(\)groupname,2);
echo($check);