记 Python 爬虫中 “Remote end closed connection without response” 问题

爬取一些网站的时候，在浏览器中可以正常访问，但 Python 请求时就会报错：

>>> import requests
>>> requests.get("https://example.com")

Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    requests.get("https://example.com")
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
……
requests.exceptions.ProxyError: HTTPSConnectionPool(host='example.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Unable to connect to proxy', RemoteDisconnected('Remote end closed connection without response')))

这种情况一般是爬虫被服务器拦截了。Requests 库发起请求的默认 UA 为 python-requests/*.*.* ，倘若服务器对 User-Agent 进行检测而你又使用默认 UA，就可能会被强制断开连接。

我们只需要设置一个正常用户的 UA 即可：

>>> import requests
>>> requests.get("https://example.com", headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:\134.0) Gecko/20100101 Firefox/136.0"})
<Response [200]>

如果你进行高频访问等行为，导致服务器认出了你是爬虫，也可能会被以这种方式拦截，相应的进行处理（降低访问速率、更换 IP 等）即可。

发送评论编辑评论

发送评论 编辑评论

发送评论编辑评论