首页 - 技术栈

php 实现信息采集（网页内容抓取）程序代码

作者: 五速梦信息网
时间: 2026年03月19日 18:00


&lt;?
//采集首页地址
\(url=&#34;http://www.xz-src.com/&#34;;
//获取页面代码
\)rs=file_get_contents(\(url);
//设置匹配正则
//\)fp=fopen(“text.txt”,“a”);
//\(fw=fwrite(\)fp,\(rs);
//fclose(\)fp);
/*/
$preg=‘/&lt;a\s+href=&#34;[^&gt;]+&#34;&gt;(.)/i’;
//进行正则搜索
preg_match_all(\(preg,\)rs,\(title);
//计算标题数量
\)count=count(\(title[0]);
echo \)count.“
”;
//通过标题数量进行内容采集
for (\(i=0;\)i&lt;\(count;\)i++){
//设置内容页地址
\(pr=&#39;/&lt;a\s+href=\&#34;[^&gt;]+\&#34;&gt;/isU&#39;;
preg_match_all(\)pr,\(title[0][\)i],\(jurl);
\)substr=substr(\(jurl[0][0],9);
\)curl=substr(\(substr,0,-18);
//获取内容页代码
\)c=file_get_contents(\(curl);
//设置内容页匹配正则
\)pc=‘/&lt;a\s+href=&#34;[^&gt;]+&#34;&gt;/i’;
//进行正则匹配搜索
preg_match(\(pc,\)c,\(content);
//输出标题
echo \)title[0][\(i].&#34;<br/>&#34;;
echo \)title[1][\(i].&#34;<br/>&#34;;
\)concount=count(\(content[0]);
echo \)concount.“
”;
echo \(content[0][0];
for (\)j=0;$j