Monday Morning Updating The Database And Improving The Website Spider
February 28, 2022
laravel londinium openstreetmapOver the weekend, I like to explore new ideas and find potential solutions to improve londinium.com, discover new applications and find new code bases. One area I was looking at is the ability to get scores for websites. Many people have heard of Google PageRank and Alexa site ratings, but I want a get scores for a website on a variety of factors and use these to display a scorecard for each website url.
But now it is Monday morning and I like to start the week by updating the data by downloading a new data file from https://overpass-api.de/api/interpreter
I import this into the database and then run a website spider over the links to find more data for each entry and see what has changed.
This is the overpass api query i run each week which covers a large area around London:
data=[out:json];nwr[~"^(brand|website|twitter|facebook|contact:website|contact:twitter|contact:facebook)$"~"."]
(51.11386850819646,-1.197509765625,51.92394344554469,0.85418701171875);out center;
There is a huge fall in the number of entries from last week, approx 55,000 this week, compared to 62,000 last week. I wonder what is causing this difference? I will investigate in a future blog post.
One error I was having with the spider is when a field it too long for the database field. For example, a title
tag is longer than the 255 varchar field the database allows. One of the nice things in Laravel, is the ability to use helpers where the Str::limit helper allows me to truncate the string to the correct length, thus preventing the error.
Another feature of the spider is to determine the HTTP response codes for the website. Here is a table of the most common results and their meanings.
HTTP Code | Count | Meaning |
---|---|---|
200 | 30592 | OK |
404 | 2167 | Not Found |
403 | 739 | Forbidden |
500 | 76 | Internal Server Error |
503 | 75 | Service Unavailable |
409 | 22 | Conflict |
400 | 20 | Bad Request |
410 | 14 | Gone |
And also a handful of the following HTTP codes: 429, 502, 401, 402, 530, 423, 415, 406, 521, 301, 300, 526, 426. These however are only a handful each.
Other errors which I am investigating occur when Guzzle returns
- request exception
- connection exception
- Undefined variable $response {"exception":"[object] (ErrorException(code: 0): Undefined variable $response ...
More information about these errors at the Guzzle Docs}
Once this process is finished, the website will be updated with the new database tables and I also plan to update the site with a few more features based on this new spider table.
Enjoy..
If you would like to contact me with this form on londinium.com, ilminster.net or via Twitter @andylondon