Building a Custom RSS Feed from a Site That Disabled Theirs
The Problem
A news website I wanted to follow (Citi FM Ghana) had completely disabled their RSS feeds. Every standard RSS URL — /feed/, /rss/, /?feed=rss — redirected to HTML or returned 404 errors. But I wanted automated news updates posted to a Telegram channel.
The solution: scrape the site myself, generate my own RSS feed locally, and point an RSS-to-Telegram bot at that local file.
The Architecture
Citi FM Website
│
│ (scrape hourly)
↓
Python Scraper
│
│ (writes)
↓
Local RSS File
│
│ (reads every 5min)
↓
RSS Bot (Python)
│
│ (posts updates)
↓
Telegram Channel
Part 1: Building the Scraper
The scraper fetches the homepage, extracts article titles and links, and generates valid RSS 2.0 XML.
Dependencies
pip install requests beautifulsoup4
The Scraper Code
#!/usr/bin/env python3
"""
Scrape Citi FM Ghana news and generate a local RSS feed.
Run this periodically (e.g., every hour) to update the feed.
"""
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import xml.etree.ElementTree as ET
from xml.dom import minidom
import hashlib
RSS_FILE = "/path/to/citifm-feed.xml"
CITI_FM_URL = "https://citinewsroom.com"
MAX_ITEMS = 20
def scrape_citifm_news():
"""Scrape latest news from Citi FM homepage."""
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
}
try:
response = requests.get(CITI_FM_URL, headers=headers, timeout=15)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
articles = []
# Find article links (adjust selectors for your target site)
for article in soup.select('article, .post, .entry, .jeg_post'):
title_elem = article.select_one('h2 a, h3 a, .entry-title a')
if not title_elem:
continue
title = title_elem.get_text(strip=True)
link = title_elem.get('href', '')
# Make sure link is absolute
if link.startswith('/'):
link = CITI_FM_URL + link
elif not link.startswith('http'):
continue
# Try to find description/excerpt
desc_elem = article.select_one('.excerpt, .entry-summary')
description = desc_elem.get_text(strip=True) if desc_elem else ""
# Generate unique ID from link
guid = hashlib.md5(link.encode()).hexdigest()
articles.append({
'title': title,
'link': link,
'description': description,
'guid': guid,
'pubDate': datetime.now().strftime('%a, %d %b %Y %H:%M:%S +0000')
})
if len(articles) >= MAX_ITEMS:
break
return articles
except Exception as e:
print(f"Error scraping: {e}")
return []
def generate_rss_feed(articles):
"""Generate RSS 2.0 XML from articles."""
rss = ET.Element('rss', version='2.0')
channel = ET.SubElement(rss, 'channel')
# Channel metadata
ET.SubElement(channel, 'title').text = "Citi FM Ghana News"
ET.SubElement(channel, 'link').text = CITI_FM_URL
ET.SubElement(channel, 'description').text = "Latest news (scraped feed)"
ET.SubElement(channel, 'language').text = "en"
ET.SubElement(channel, 'lastBuildDate').text = \
datetime.now().strftime('%a, %d %b %Y %H:%M:%S +0000')
# Add articles as items
for article in articles:
item = ET.SubElement(channel, 'item')
ET.SubElement(item, 'title').text = article['title']
ET.SubElement(item, 'link').text = article['link']
if article['description']:
ET.SubElement(item, 'description').text = article['description']
ET.SubElement(item, 'guid', isPermaLink='false').text = article['guid']
ET.SubElement(item, 'pubDate').text = article['pubDate']
# Pretty print XML
rough_string = ET.tostring(rss, encoding='utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ", encoding='utf-8')
if __name__ == '__main__':
print(f"Scraping Citi FM...")
articles = scrape_citifm_news()
if not articles:
print("No articles found.")
exit(1)
print(f"Found {len(articles)} articles. Generating RSS feed...")
rss_xml = generate_rss_feed(articles)
with open(RSS_FILE, 'wb') as f:
f.write(rss_xml)
print(f"RSS feed written to {RSS_FILE}")
Key Points
- CSS selectors: I used generic WordPress article patterns (
article,.post,.jeg_post). Inspect your target site's HTML to find the right selectors. - Absolute URLs: Convert relative links to absolute URLs so RSS readers can follow them.
- GUIDs: Generate unique IDs from article URLs using MD5 hashes. This helps RSS readers track which items are new.
- Error handling: Return empty list on failure — keeps the previous RSS file intact.
Part 2: Automating Updates
The scraper needs to run periodically. I set up a cron job to run it every hour:
# Cron expression: every hour at :00
0 * * * * cd /path/to/bot && ./venv/bin/python3 scrape-citifm.py
This keeps the RSS file fresh without hammering the website.
Part 3: RSS-to-Telegram Bot
I used a self-hosted Python bot that reads RSS feeds and posts updates to Telegram channels. The bot already supported HTTP URLs, but I needed to add support for local file:// URLs.
Modified Feed Fetcher
async def fetch_feed(url: str):
"""Fetch and parse RSS feed (HTTP or local file)."""
# Handle local file:// URLs
if url.startswith("file://"):
try:
file_path = url.replace("file://", "")
with open(file_path, 'r') as f:
body = f.read()
parsed = feedparser.parse(body)
return parsed, None, None
except Exception as exc:
log.error(f"Error reading local file {url}: {exc}")
return None, None, None
# Handle HTTP/HTTPS URLs
headers = {"User-Agent": "RSS Bot"}
async with aiohttp.ClientSession() as session:
async with session.get(url, headers=headers) as resp:
if resp.status != 200:
return None, None, None
body = await resp.text()
parsed = feedparser.parse(body)
return parsed, None, None
Bot Setup
- Create a Telegram bot via @BotFather
- Create a public channel for news posts
- Add the bot as an administrator with "Post messages" permission
- Configure the bot to monitor
file:///path/to/citifm-feed.xml - Set check interval to 5 minutes
The Full Workflow
- Every hour: Scraper runs, fetches latest 20 articles from Citi FM, writes XML to disk
- Every 5 minutes: RSS bot reads the local XML file
- On new articles: Bot posts title + link to Telegram channel
- Deduplication: Bot tracks article GUIDs in SQLite database to avoid posting duplicates
Why This Works
- Decoupled: Scraper and bot are separate processes. If one crashes, the other keeps working.
- Standard format: RSS 2.0 is simple XML. Any RSS reader can consume it.
- Local file: No need for a web server. The bot reads directly from disk.
- Rate limiting: Hourly scrapes are polite. The bot checks more frequently but only hits the local filesystem.
- Resilient: If the scraper fails, the old feed stays in place. The bot keeps posting old items until new ones arrive.
Lessons Learned
1. Inspect before you scrape. Every site has different HTML structure. Use browser dev tools to find the right CSS selectors for titles and links.
2. Test your RSS XML. Validate it with feedparser or an online validator before pointing the bot at it. Malformed XML will break everything.
3. Handle site changes gracefully. Websites redesign. If the scraper returns zero articles, keep the old feed file rather than writing an empty one.
4. Use unique IDs. MD5 hashing the article URL works well for GUIDs. It's deterministic — same URL always produces the same GUID.
5. Public vs private channels. Telegram bots can only post to public channels via @username. For private channels, you need the numeric channel ID (starts with -100).
Alternative Approaches
RSS Bridge: A PHP project that generates RSS feeds from sites without them. More user-friendly but requires web hosting.
Huginn: Self-hosted automation platform. Overkill for a simple scraper but powerful for complex workflows.
GitHub Actions: Run the scraper on a schedule, commit the XML to a repo, serve via GitHub Pages. Free hosting but public by default.
Final Thoughts
RSS isn't dead — it's just hiding. When sites disable their feeds (often to push you toward their app or social media), you can build your own. It's a few hours of work for years of automated updates.
This approach works for any content: news sites, blogs, forum threads, product listings. If you can scrape it, you can syndicate it.
The code is simple, the tools are free, and you own the entire pipeline.
— Muska 😺