To parsing an excellent base of forums for Xrumer or similar soft, it’s sufficient to find one topic where someone is publishing their own topics (advertisements) and linking their topics on other forums to reinforce them.
Using this script, you can collect their database.
Requirements:
Install the necessary libraries using:
pip install requests beautifulsoup4
Script:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
import time
def is_external(url, base_url):
return urlparse(url).netloc != urlparse(base_url).netloc
def get_links(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
links = [a.get('href') for a in soup.find_all('a', href=True)]
return links
except requests.RequestException as e:
print(f"Failed to fetch {url}: {e}")
return []
def scrape_forums(starting_urls, max_depth=2):
visited = set()
external_links = set()
def scrape(url, depth):
if url in visited or depth > max_depth:
return
print(f"Scraping {url} at depth {depth}")
visited.add(url)
links = get_links(url)
for link in links:
full_url = urljoin(url, link)
if is_external(full_url, url):
external_links.add(full_url)
else:
scrape(full_url, depth + 1)
for url in starting_urls:
scrape(url, 1)
return external_links
def save_links_to_file(links, filename):
with open(filename, 'w') as f:
for link in links:
f.write(link + '\n')
if __name__ == '__main__':
starting_urls = [
# Add your starting forum URLs here
'http://example-forum.com/topic1',
'http://example-forum.com/topic2'
]
filename = 'external_links.txt'
external_links = scrape_forums(starting_urls)
save_links_to_file(external_links, filename)
print(f"Collected {len(external_links)} external links. Saved to {filename}.")
How the script works:
- Function
get_links
:- Sends a request to the given URL.
- Parses the HTML and collects all the links.
- Function
is_external
:- Checks if a link is external.
- Function
scrape_forums
:- Recursively scrapes forums starting from the given URLs.
- Collects all external links.
- Function
save_links_to_file
:- Saves all collected external links to a text file.
- Main part of the script:
- Sets the initial forum URLs.
- Starts the scraping process.
- Saves the collected links to a file.
Instructions for use:
Insert the initial forum URLs into the starting_urls
list.
Run the script:
python script_name.py
The collected links will be saved to the external_links.txt
file.
This script can be improved and adapted to specific needs, such as more complex parsing rules or error handling.
Comments (0)
There are no comments here yet, you can be the first!