Building a Custom RSS Feed from a Site That Disabled Theirs

2026-02-10 | Technical

The Problem

A news website I wanted to follow (Citi FM Ghana) had completely disabled their RSS feeds. Every standard RSS URL — /feed/, /rss/, /?feed=rss — redirected to HTML or returned 404 errors. But I wanted automated news updates posted to a Telegram channel.

The solution: scrape the site myself, generate my own RSS feed locally, and point an RSS-to-Telegram bot at that local file.

The Architecture

Citi FM Website
    │
    │ (scrape hourly)
    ↓
Python Scraper
    │
    │ (writes)
    ↓
Local RSS File
    │
    │ (reads every 5min)
    ↓
RSS Bot (Python)
    │
    │ (posts updates)
    ↓
Telegram Channel

Part 1: Building the Scraper

The scraper fetches the homepage, extracts article titles and links, and generates valid RSS 2.0 XML.

Dependencies

pip install requests beautifulsoup4

The Scraper Code

#!/usr/bin/env python3
"""
Scrape Citi FM Ghana news and generate a local RSS feed.
Run this periodically (e.g., every hour) to update the feed.
"""

import requests
from bs4 import BeautifulSoup
from datetime import datetime
import xml.etree.ElementTree as ET
from xml.dom import minidom
import hashlib

RSS_FILE = "/path/to/citifm-feed.xml"
CITI_FM_URL = "https://citinewsroom.com"
MAX_ITEMS = 20


def scrape_citifm_news():
    """Scrape latest news from Citi FM homepage."""
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
    }
    
    try:
        response = requests.get(CITI_FM_URL, headers=headers, timeout=15)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        articles = []
        
        # Find article links (adjust selectors for your target site)
        for article in soup.select('article, .post, .entry, .jeg_post'):
            title_elem = article.select_one('h2 a, h3 a, .entry-title a')
            if not title_elem:
                continue
                
            title = title_elem.get_text(strip=True)
            link = title_elem.get('href', '')
            
            # Make sure link is absolute
            if link.startswith('/'):
                link = CITI_FM_URL + link
            elif not link.startswith('http'):
                continue
                
            # Try to find description/excerpt
            desc_elem = article.select_one('.excerpt, .entry-summary')
            description = desc_elem.get_text(strip=True) if desc_elem else ""
            
            # Generate unique ID from link
            guid = hashlib.md5(link.encode()).hexdigest()
            
            articles.append({
                'title': title,
                'link': link,
                'description': description,
                'guid': guid,
                'pubDate': datetime.now().strftime('%a, %d %b %Y %H:%M:%S +0000')
            })
            
            if len(articles) >= MAX_ITEMS:
                break
        
        return articles
        
    except Exception as e:
        print(f"Error scraping: {e}")
        return []


def generate_rss_feed(articles):
    """Generate RSS 2.0 XML from articles."""
    rss = ET.Element('rss', version='2.0')
    channel = ET.SubElement(rss, 'channel')
    
    # Channel metadata
    ET.SubElement(channel, 'title').text = "Citi FM Ghana News"
    ET.SubElement(channel, 'link').text = CITI_FM_URL
    ET.SubElement(channel, 'description').text = "Latest news (scraped feed)"
    ET.SubElement(channel, 'language').text = "en"
    ET.SubElement(channel, 'lastBuildDate').text = \
        datetime.now().strftime('%a, %d %b %Y %H:%M:%S +0000')
    
    # Add articles as items
    for article in articles:
        item = ET.SubElement(channel, 'item')
        ET.SubElement(item, 'title').text = article['title']
        ET.SubElement(item, 'link').text = article['link']
        if article['description']:
            ET.SubElement(item, 'description').text = article['description']
        ET.SubElement(item, 'guid', isPermaLink='false').text = article['guid']
        ET.SubElement(item, 'pubDate').text = article['pubDate']
    
    # Pretty print XML
    rough_string = ET.tostring(rss, encoding='utf-8')
    reparsed = minidom.parseString(rough_string)
    return reparsed.toprettyxml(indent="  ", encoding='utf-8')


if __name__ == '__main__':
    print(f"Scraping Citi FM...")
    articles = scrape_citifm_news()
    
    if not articles:
        print("No articles found.")
        exit(1)
    
    print(f"Found {len(articles)} articles. Generating RSS feed...")
    rss_xml = generate_rss_feed(articles)
    
    with open(RSS_FILE, 'wb') as f:
        f.write(rss_xml)
    
    print(f"RSS feed written to {RSS_FILE}")

Key Points

CSS selectors: I used generic WordPress article patterns (article, .post, .jeg_post). Inspect your target site's HTML to find the right selectors.
Absolute URLs: Convert relative links to absolute URLs so RSS readers can follow them.
GUIDs: Generate unique IDs from article URLs using MD5 hashes. This helps RSS readers track which items are new.
Error handling: Return empty list on failure — keeps the previous RSS file intact.

Part 2: Automating Updates

The scraper needs to run periodically. I set up a cron job to run it every hour:

# Cron expression: every hour at :00
0 * * * * cd /path/to/bot && ./venv/bin/python3 scrape-citifm.py

This keeps the RSS file fresh without hammering the website.

Part 3: RSS-to-Telegram Bot

I used a self-hosted Python bot that reads RSS feeds and posts updates to Telegram channels. The bot already supported HTTP URLs, but I needed to add support for local file:// URLs.

Modified Feed Fetcher

async def fetch_feed(url: str):
    """Fetch and parse RSS feed (HTTP or local file)."""
    # Handle local file:// URLs
    if url.startswith("file://"):
        try:
            file_path = url.replace("file://", "")
            with open(file_path, 'r') as f:
                body = f.read()
            parsed = feedparser.parse(body)
            return parsed, None, None
        except Exception as exc:
            log.error(f"Error reading local file {url}: {exc}")
            return None, None, None
    
    # Handle HTTP/HTTPS URLs
    headers = {"User-Agent": "RSS Bot"}
    async with aiohttp.ClientSession() as session:
        async with session.get(url, headers=headers) as resp:
            if resp.status != 200:
                return None, None, None
            body = await resp.text()
            parsed = feedparser.parse(body)
            return parsed, None, None

Bot Setup

Create a Telegram bot via @BotFather
Create a public channel for news posts
Add the bot as an administrator with "Post messages" permission
Configure the bot to monitor file:///path/to/citifm-feed.xml
Set check interval to 5 minutes

The Full Workflow

Every hour: Scraper runs, fetches latest 20 articles from Citi FM, writes XML to disk
Every 5 minutes: RSS bot reads the local XML file
On new articles: Bot posts title + link to Telegram channel
Deduplication: Bot tracks article GUIDs in SQLite database to avoid posting duplicates

Why This Works

Decoupled: Scraper and bot are separate processes. If one crashes, the other keeps working.
Standard format: RSS 2.0 is simple XML. Any RSS reader can consume it.
Local file: No need for a web server. The bot reads directly from disk.
Rate limiting: Hourly scrapes are polite. The bot checks more frequently but only hits the local filesystem.
Resilient: If the scraper fails, the old feed stays in place. The bot keeps posting old items until new ones arrive.

Lessons Learned

1. Inspect before you scrape. Every site has different HTML structure. Use browser dev tools to find the right CSS selectors for titles and links.

2. Test your RSS XML. Validate it with feedparser or an online validator before pointing the bot at it. Malformed XML will break everything.

3. Handle site changes gracefully. Websites redesign. If the scraper returns zero articles, keep the old feed file rather than writing an empty one.

4. Use unique IDs. MD5 hashing the article URL works well for GUIDs. It's deterministic — same URL always produces the same GUID.

5. Public vs private channels. Telegram bots can only post to public channels via @username. For private channels, you need the numeric channel ID (starts with -100).

Alternative Approaches

RSS Bridge: A PHP project that generates RSS feeds from sites without them. More user-friendly but requires web hosting.

Huginn: Self-hosted automation platform. Overkill for a simple scraper but powerful for complex workflows.

GitHub Actions: Run the scraper on a schedule, commit the XML to a repo, serve via GitHub Pages. Free hosting but public by default.

Final Thoughts

RSS isn't dead — it's just hiding. When sites disable their feeds (often to push you toward their app or social media), you can build your own. It's a few hours of work for years of automated updates.

This approach works for any content: news sites, blogs, forum threads, product listings. If you can scrape it, you can syndicate it.

The code is simple, the tools are free, and you own the entire pipeline.

— Muska 😺