Fixing Web Spider Errors with Guzzle Cookie Header Redirects and User Agent
November 21, 2022
laravel londiniumOver the weekend, i ran a spider on the websites listed in the openstreetmap system.
These are the error codes and the counts for these errors
HTTP Error | Count |
---|---|
401 | 13 |
402 | 6 |
403 | 1274 |
404 | 2512 |
406 | 1 |
409 | 9 |
410 | 13 |
423 | 2 |
426 | 1 |
429 | 15 |
453 | 1 |
500 | 131 |
502 | 7 |
503 | 165 |
523 | 1 |
526 | 2 |
530 | 4 |
connection exception | 3558 |
request exception | 400 |
Here is a list explaining the Error Codes
The spider uses the Guzzle and works reasonably well.
My first attempt to fix this was to add the following headers to the Guzzle Client, but this didnt help much
$jar = new \GuzzleHttp\Cookie\CookieJar();
$client = new \GuzzleHttp\Client(
[
'cookies' => $jar,
'timeout' => 8.0,
'http_errors' => false,
'base_uri' => $url,
'referer' => 'http://www.google.com/',
'allow_redirects' => ['strict' => true],
'Accept-Encoding' => 'gzip, deflate, br',
'Accept-Language' => 'en-GB,en-US;q=0.9,en;q=0.8',
'Accept' => 'text/html',
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
]
);
My second attempt was to use Laravel Dusk and nunomaduro's laravel-console-dusk but this had the same problem.
//the cookie file name
$cookie_file = 'cookies.txt';
//create the driver
$process = (new ChromeProcess())->toProcess();
$process->start();
$options = (new ChromeOptions())->addArguments(['--disable-gpu','--enable-file-cookies','--no-sandbox', '--headless']);
$capabilities = DesiredCapabilities::chrome()->setCapability(ChromeOptions::CAPABILITY, $options);
$driver = retry(5, function () use ($capabilities) {
return RemoteWebDriver::create('http://localhost:9515', $capabilities);
}, 50);
$this->browse(function ($browser) use ($id, $nwr, $url) {
$browser->visit($url)
->pause(5)
My third attempt was to use Curl and it worked like magic
$ch=curl_init("$url");
curl_setopt_array($ch, array(
CURLOPT_USERAGENT=>'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0',
CURLOPT_ENCODING=>'gzip, deflate',
CURLOPT_HTTPHEADER=>array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: gzip, deflate',
'Connection: keep-alive',
'Upgrade-Insecure-Requests: 1',
),
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 5);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 6);
curl_setopt($ch, CURLOPT_TIMEOUT, 6);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// curl_setopt($ch, CURLOPT_VERBOSE, true);
$htmlORIG=curl_exec($ch);
if (curl_errno($ch)) {
print "CURL ERROR: " . curl_error($ch);
} else {
curl_close($ch);
}
$http_status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$ip = curl_getinfo($ch, CURLINFO_PRIMARY_IP);
echo PHP_EOL . $http_status . PHP_EOL;
echo "IP: " . $ip . PHP_EOL;
This allowed me to spider sites that the previous 2 had blocked and solved all the false errors from the first Guzzle based spider.
Although not perfect, it still has problems with sites using Cloudflare It was a huge step in the right direction.
Would be interested in hearing from others about how they handle spidering sites protected by Cloudflare. Also ways to do the same when using Guzzle and Laravel Dusk.
If you would like to contact me with this form on londinium.com, ilminster.net or via Twitter @andylondon