Fixing Web Spider Errors with Guzzle Cookie Header Redirects and User Agent

November 21, 2022

laravel londinium

Over the weekend, i ran a spider on the websites listed in the openstreetmap system.

These are the error codes and the counts for these errors

HTTP Error	Count
401	13
402	6
403	1274
404	2512
406	1
409	9
410	13
423	2
426	1
429	15
453	1
500	131
502	7
503	165
523	1
526	2
530	4
connection exception	3558
request exception	400

Here is a list explaining the Error Codes

The spider uses the Guzzle and works reasonably well.

My first attempt to fix this was to add the following headers to the Guzzle Client, but this didnt help much

$jar = new \GuzzleHttp\Cookie\CookieJar();

$client = new \GuzzleHttp\Client(
    [
    'cookies' => $jar,
    'timeout' => 8.0,
    'http_errors' => false,
    'base_uri' => $url,
    'referer' => 'http://www.google.com/',
    'allow_redirects' => ['strict' => true],
    'Accept-Encoding' => 'gzip, deflate, br',
    'Accept-Language' => 'en-GB,en-US;q=0.9,en;q=0.8',
    'Accept' => 'text/html',
    'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    ]
);

My second attempt was to use Laravel Dusk and nunomaduro's laravel-console-dusk but this had the same problem.

    //the cookie file name
    $cookie_file = 'cookies.txt';

    //create the driver
    $process = (new ChromeProcess())->toProcess();
    $process->start();
    $options = (new ChromeOptions())->addArguments(['--disable-gpu','--enable-file-cookies','--no-sandbox', '--headless']);
    $capabilities = DesiredCapabilities::chrome()->setCapability(ChromeOptions::CAPABILITY, $options);
    $driver = retry(5, function () use ($capabilities) {
        return RemoteWebDriver::create('http://localhost:9515', $capabilities);
    }, 50);

    $this->browse(function ($browser) use ($id, $nwr, $url) {
        $browser->visit($url)
                ->pause(5)

My third attempt was to use Curl and it worked like magic

    $ch=curl_init("$url");
    curl_setopt_array($ch, array(
                CURLOPT_USERAGENT=>'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0',
                CURLOPT_ENCODING=>'gzip, deflate',
                CURLOPT_HTTPHEADER=>array(
                        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                        'Accept-Language: en-US,en;q=0.5',
                        'Accept-Encoding: gzip, deflate',
                        'Connection: keep-alive',
                        'Upgrade-Insecure-Requests: 1',
                ),
    ));
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 5);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 6);
    curl_setopt($ch, CURLOPT_TIMEOUT, 6);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    // curl_setopt($ch, CURLOPT_VERBOSE, true);

    $htmlORIG=curl_exec($ch);

    if (curl_errno($ch)) {
        print "CURL ERROR: " . curl_error($ch);
    } else {
        curl_close($ch);
    }

    $http_status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    $ip = curl_getinfo($ch, CURLINFO_PRIMARY_IP);
    echo PHP_EOL . $http_status . PHP_EOL;
    echo "IP: " . $ip . PHP_EOL;

This allowed me to spider sites that the previous 2 had blocked and solved all the false errors from the first Guzzle based spider.

Although not perfect, it still has problems with sites using Cloudflare It was a huge step in the right direction.

Would be interested in hearing from others about how they handle spidering sites protected by Cloudflare. Also ways to do the same when using Guzzle and Laravel Dusk.

If you would like to contact me with this form on londinium.com, ilminster.net or via Twitter @andylondon