解决QueryPath中文乱码问题

QueryPath介绍

QueryPath的定义: 快捷、简便地使用 XMLHTML
官网QueryPath.org
QueryPath文档

QueryPath是PHP做采集(爬虫)很好的工具之一
最近用QueryPath采集数据遇到了一个问题: 中文采集下来变成了乱码

require 'QueryPath/QueryPath.php';

$html = '<td>广西-南宁</td>';

$str = htmlqp($html, 'td')->text();

// 解析出来的中文变成了 ??-??

解决方法

htmlqp方法有3个参数:

/**
 * @param mixed  $document html/xml文件
 * @param string $selector css选择器 如: #box > .hoe p
 * @param array  $options  选项数组
 * @return \QueryPath\DOMQuery
 * 获取一个htmlqp实例
 */
function htmlqp($document = NULL, $selector = NULL, $options = array()) {

  return QueryPath::withHTML($document, $selector, $options);
}

解决中文乱码,只需传入$options如下参数即可

require 'QueryPath/QueryPath.php';

$options = [
        'convert_from_encoding' => 'UTF-8', // 这两个参数看起来很扯
        'convert_to_encoding' => 'UTF-8', // 这两个参数看起来很扯
        'strip_low_ascii' => false,
    ];

$html = '<td>广西-南宁</td>';

$str = htmlqp($html, 'td', $options)->text(); // 广西-南宁

如果还是乱码, 就手动补全一下html声明及基本元素

$html = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">';
$html .= '<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>';
$html .= '<td>广西-南宁</td>';
$html .= '</body></html>';
$str = htmlqp($html, 'td', $options)->text(); // 广西-南宁

请参考GitHub issues

附上$options参数文档说明

Supported Options

  • context: A stream context object. This is used to pass context info
    to the underlying file IO subsystem.
  • encoding: A valid character encoding, such as 'utf-8' or 'ISO-8859-1'.
    The default is system-dependant, typically UTF-8. Note that this is

only used when creating new documents, not when reading existing content.
(See convert_to_encoding below.)

  • parser_flags: An OR-combined set of parser flags. The flags supported
    by the DOMDocument PHP class are all supported here.
  • omit_xml_declaration: Boolean. If this is TRUE, then certain output
    methods (like {@link QueryPath::xml()}) will omit the XML declaration

from the beginning of a document.

  • format_output: Boolean. If this is set to TRUE, QueryPath will format
    the HTML or XML output to make it more readible. If this is set to

FALSE, QueryPath will minimize whitespace to keep the document smaller
but harder to read.

  • replace_entities: Boolean. If this is TRUE, then any of the insertion
    functions (before(), append(), etc.) will replace named entities with

their decimal equivalent, and will replace un-escaped ampersands with
a numeric entity equivalent.

  • ignore_parser_warnings: Boolean. If this is TRUE, then E_WARNING messages
    generated by the XML parser will not cause QueryPath to throw an exception.

This is useful when parsing
badly mangled HTML, or when failure to find files should not result in
an exception. By default, this is FALSE -- that is, parsing warnings and
IO warnings throw exceptions.

  • convert_to_encoding: Use the MB library to convert the document to the
    named encoding before parsing. This is useful for old HTML (set it to

iso-8859-1 for best results). If this is not supplied, no character set
conversion will be performed. See {@link mb_convert_encoding()}.
(QueryPath 1.3 and later)

  • convert_from_encoding: If 'convert_to_encoding' is set, this option can be
    used to explicitly define what character set the source document is using.

By default, QueryPath will allow the MB library to guess the encoding.
(QueryPath 1.3 and later)

  • strip_low_ascii: If this is set to TRUE then markup will have all low ASCII
    characters (<32) stripped out before parsing. This is good in cases where

icky HTML has (illegal) low characters in the document.

  • use_parser: If 'xml', Parse the document as XML. If 'html', parse the
    document as HTML. Note that the XML parser is very strict, while the

HTML parser is more lenient, but does enforce some of the DTD/Schema.
By default, QueryPath autodetects the type.

  • escape_xhtml_js_css_sections: XHTML needs script and css sections to be

    1. Yet older readers do not handle CDATA sections, and comments do not
    2. properly (for numerous reasons). By default, QueryPath's *XHTML methods

will wrap a script body with a CDATA declaration inside of C-style comments.
If you want to change this, you can set this option with one of the
JS_CSS_ESCAPE_* constants, or you can write your own.

  • QueryPath_class: (ADVANCED) Use this to set the actual classname that
    {@link qp()} loads as a QueryPath instance. It is assumed that the

class is either {@link QueryPath} or a subclass thereof. See the test
cases for an example.

爬虫

我来吐槽

*

*

已有 2 条评论

  1. 找果树

    谢谢分享 ,会一直关注博主的,内容很赞