Access to Non-RSS content

One of the features that I wanted to provide was access to websites that do not provide RSS feeds. So today I have been experimenting by making a custom demo/module that will take any website and read in the HTML into a string, then only pull out the content which the user is interested in.

To do this there were several steps that the system needs to do:

  1. Pull the content (HTML code) from a website into a string or buffer
  2. The user must identify two things on the website that doesn’t normally change (e.g. text like Home or News, or a HTML tag like the start/end of a table etc.)
  3. The system will then search for the first occurrence of that text/HTML and then remove anything before it
  4. Then it will look for the last part which doesn’t normally change and will delete everything after that
  5. The next step was to remove all the HTML tags (except for hyperlinks or line breaks <br>) so that only the text is left

After a few hours of coding I managed to get it working by using various commands such as strpos, substr and strripos the only problem I encountered was that relative hyperlinks were pointing to a file that should exist on their website, but it was actually pointing to my own test server (my Mini Mac) which gave a 404 File not found message.After a bit of hunting online I found a command that converted a relative link into a absolute link from a coding discussion forum which provided very useful. An example can be seen in the following screenshot:

non-rss-to-rss.jpg
View Full Size

Comments are closed.