I recently needed a link checker to create a csv formatted list of all links (especially hosted pdfs) on a client site.
There is a tool called webcheck by arthur de jong which does a great job of checking all of the links on a website and creating a pretty html report.
This got me most of the way there, I could see that in the output there was a page dedicated to a list of every url that was encountered during the search, which looked like what I wanted but was formatted as html
I wrote a small file which will use webcheck’s own code to read in its stored .dat file and write all of the links to a csv file with the format:
1
| |
Where path is the url, extension is the url ending (for example .pdf,
.html, ..), internal is a boolean True or False if the link is an
internal link and errors is the error (for example 404, ..) if any for
that link.
Python for Power Systems