Access to Non-RSS content
One of the features that I wanted to provide was access to websites that do not provide RSS feeds. So today I have been experimenting by making a custom demo/module that will take any website and read in the HTML into a string, then only pull out the content which the user is interested in.
To do this there were several steps that the system needs to do:
- Pull the content (HTML code) from a website into a string or buffer
- The user must identify two things on the website that doesn’t normally change (e.g. text like Home or News, or a HTML tag like the start/end of a table etc.)
- The system will then search for the first occurrence of that text/HTML and then remove anything before it
- Then it will look for the last part which doesn’t normally change and will delete everything after that
- The next step was to remove all the HTML tags (except for hyperlinks or line breaks <br>) so that only the text is left
After a few hours of coding I managed to get it working by using various commands such as strpos, substr and strripos the only problem I encountered was that relative hyperlinks were pointing to a file that should exist on their website, but it was actually pointing to my own test server (my Mini Mac) which gave a 404 File not found message.After a bit of hunting online I found a command that converted a relative link into a absolute link from a coding discussion forum which provided very useful. An example can be seen in the following screenshot: