Fixing Web Spider Errors with Guzzle Cookie Header Redirects and User Agent Fixing Web Spider Errors with Guzzle Cookie Header Redirects and User Agent

November 21, 2022

laravel londinium

Over the weekend, i ran a spider on the websites listed in the openstreetmap system.

These are the error codes and the counts for these errors

HTTP Error Count
401 13
402 6
403 1274
404 2512
406 1
409 9
410 13
423 2
426 1
429 15
453 1
500 131
502 7
503 165
523 1
526 2
530 4
connection exception 3558
request exception 400

Here is a list explaining the Error Codes

The spider uses the Guzzle and works reasonably well.

My first attempt to fix this was to add the following headers to the Guzzle Client, but this didnt help much

$jar = new \GuzzleHttp\Cookie\CookieJar();

$client = new \GuzzleHttp\Client(
    [
    'cookies' => $jar,
    'timeout' => 8.0,
    'http_errors' => false,
    'base_uri' => $url,
    'referer' => 'http://www.google.com/',
    'allow_redirects' => ['strict' => true],
    'Accept-Encoding' => 'gzip, deflate, br',
    'Accept-Language' => 'en-GB,en-US;q=0.9,en;q=0.8',
    'Accept' => 'text/html',
    'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    ]
);

My second attempt was to use Laravel Dusk and nunomaduro's laravel-console-dusk but this had the same problem.

    //the cookie file name
    $cookie_file = 'cookies.txt';

    //create the driver
    $process = (new ChromeProcess())->toProcess();
    $process->start();
    $options = (new ChromeOptions())->addArguments(['--disable-gpu','--enable-file-cookies','--no-sandbox', '--headless']);
    $capabilities = DesiredCapabilities::chrome()->setCapability(ChromeOptions::CAPABILITY, $options);
    $driver = retry(5, function () use ($capabilities) {
        return RemoteWebDriver::create('http://localhost:9515', $capabilities);
    }, 50);

    $this->browse(function ($browser) use ($id, $nwr, $url) {
        $browser->visit($url)
                ->pause(5)

My third attempt was to use Curl and it worked like magic

    $ch=curl_init("$url");
    curl_setopt_array($ch, array(
                CURLOPT_USERAGENT=>'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0',
                CURLOPT_ENCODING=>'gzip, deflate',
                CURLOPT_HTTPHEADER=>array(
                        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                        'Accept-Language: en-US,en;q=0.5',
                        'Accept-Encoding: gzip, deflate',
                        'Connection: keep-alive',
                        'Upgrade-Insecure-Requests: 1',
                ),
    ));
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 5);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 6);
    curl_setopt($ch, CURLOPT_TIMEOUT, 6);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    // curl_setopt($ch, CURLOPT_VERBOSE, true);

    $htmlORIG=curl_exec($ch);

    if (curl_errno($ch)) {
        print "CURL ERROR: " . curl_error($ch);
    } else {
        curl_close($ch);
    }

    $http_status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    $ip = curl_getinfo($ch, CURLINFO_PRIMARY_IP);
    echo PHP_EOL . $http_status . PHP_EOL;
    echo "IP: " . $ip . PHP_EOL;

This allowed me to spider sites that the previous 2 had blocked and solved all the false errors from the first Guzzle based spider.

Although not perfect, it still has problems with sites using Cloudflare It was a huge step in the right direction.

Would be interested in hearing from others about how they handle spidering sites protected by Cloudflare. Also ways to do the same when using Guzzle and Laravel Dusk.


If you would like to contact me with this form on londinium.com, ilminster.net or via Twitter @andylondon