Reddit is an innovative tool addressing different topics. The posts on Reddit cut across diverse sections ranging from technology, history, health, business, lifestyle, and many more. The several subreddits give content to add to the existing knowledge. Other than the original posts, there are also comments which may complement or add some more content to the posts.
You may need the Reddit content for several reasons. Therefore, you'll need to scrape data from the Reddit website. Extracting data from the website can be tiresome and time-consuming. Consequently, you need to use a reliable web scraper. Luckily, there are several options to choose from, ranging from free web to proxy scrapers.
We've described the best approach to scraping Reddit data. We've also highlighted the best web scraper to use in this case. Read through to learn more.
Scraping Reddit Using the Python Reddit API Wrapper (PRAW)
Install PRAW. You can install it using conda or pip.
conda install -c conda-forge PRAW
pip install praw
Import PRAW by utilizing the import PRAW option.
Set up a Reddit instance and create a client_secret, client_id, and user_agent. This process ensures you authenticate yourself and should be done before any web scraping activity.
Reddit = PRAW.Reddit(client_id='my_client_id', client_secret='my_client_secret', user_agent='my_user_agent')
Proceed to establish the Reddit app. This function allows you to access the authentication information.
Enter the page and click on the create app. You can also select the create another app. A form will show up, prompting you to key in your name, redirect URL, and description. The PRAW documentation gives a complete guide to navigating this process. However, you can use the link http://localhost:8080 to fill in the redirect URL.
Click on the create app tab to launch a new Reddit app. This application shows the authentication information you need to set up the PRAW—Reddit instance.
We are done with creating the Reddit instance; we should then proceed to get the subreddit data. The PRAW.Reddit instance grants you access to all features you need to navigate the scraping process.
For instance, you can use the code below if you want to scrape Reddit to extract data from the post, ten authoritative posts from the real estate investment subreddit.
# get ten authoritative posts from real estate investment subreddit authoritative_posts = Reddit.subreddit('real estate investment').authoritative(limit=10) for post in authoritative_posts: print(post title)
Its output should look like the one below.
[D] What is the best ML paper you read in 2018 and why? [D] real estate investment - WAYR (What Are You Reading) - Week 34 [R] A Geometric Theory of Higher-Order Automatic Differentiation UC Berkeley and Berkeley AI Research published all materials of CS 188: Introduction to Real Estate Investment Strategies, Fall 2018 [Research] Accurate, Data-Efficient, Unconstrained Text Recognition with Convolutional Neural Networks ...
To get the ten authoritative posts of all subreddits, input all in the subreddit name. See below.
# get authoritative posts from all subreddits authoritative_posts = Reddit.subreddit('all').authoritative (limit=10) for post in hot_posts: print (post.title)
Its output should be as follows.
I've been lying to my wife about film plots for years.
I don't care if this gets downvoted into oblivion! I DID IT, REDDIT!!
I've had enough of your shit, Karen
Stranger Things 3: Coming July 4th, 2019 ...
You can iterate these variables and extract and save the post id, title, and URL. The compatible file format is an .csv file.
import pandas as pd posts =  ml_subreddit = reddit.subreddit('real estate investment') for post in ml_subreddit.authoritative(limit=10): posts.append([post.title, po=mments', 'body', 'created']) print(posts)
You can get the information overview of the subreddit following the .description function on the subreddit object.
# get real estate investment subreddit data ml_subreddit = Reddit.subreddit('real estate investment') print(ml_subreddit.description)
Its output should be as follows.
[Rules For Posts](https://www.reddit.com/r/Real Estate Investment/about/rules/) -------- +[Research](https://www.reddit.com/r/Real Estate Investment/search?sort=new&restrict_sr=on&q=flair%3AResearch) -------- +[Discussion](https://www.reddit.com/r/Real Estate Investment/search?sort=new&restrict_sr=on&q=flair%3ADiscussion) -------- +[Project](https://www.reddit.com/r/Real Estate Investment/search?sort=new&restrict_sr=on&q=flair%3AProject) -------- +[News](https://www.reddit.com/r/Real Estate Investment/search?sort=new&restrict_sr=on&q=flair%3ANews) -------- ...
Can You Scrape Content From Reddit Post Comments?
Yes, you can scrape data from Reddit comments in Reddit posts. To do so, use the procedure below.
Set up a Submission object, after which you should loop through the comments attribute.
Use Reddit. Submission to specify a submission to access a post. You'll also need to pass the submission URL. Alternatively, you can iterate the subreddit submissions process. See below.
submission = Reddit.submission(url="https://www.reddit.com/r/MapPorn/comments/a3p0uq/an_image_of_gps_tracking_of_multiple_wolves_in/") # or submission = Reddit.submission(id="a3p0uq")
Iterate the submission. comments to extract top comments only as shown below.
for top_level_comment in submission.comments:
However, this method applies to posts or submissions with few comments only. Those with more comments may return an AttributeError shown below.
AttributeError: 'MoreComments' object has no attribute 'body'
Go through the data type of individual comments to remove the MoreComments objects.
from praw.models import MoreComments for top_level_comment in submission.comments: if isinstance(top_level_comment, MoreComments): continue print(top_level_comment.body)
Alternatively, you can remove the Morecomments object by selecting the replace_more button. This method, which PRAW provides to remove the Morecomments object, has a limit feature which, when you configure it to zero, will eliminate all MoreComments.
submission.comments.replace_more(limit=0) for top_level_comment in submission.comments: print(top_level_comment.body)
After blocking the morecomments object and printing their body, the data output should be as follows.
(https://www.healthline.com/ fit bottomed girls/) For those who get frustrated with the status quo and ideas of what we "should" be, Fit Bottomed Girls offers a refreshing change of pace. The founders, both certified fitness pros, preach confidence and body positivity. They take a thoughtful approach to fitness instead of quick, lose-fat-in-10-days results. Their roadmap to a healthier life combines nutrition-packed recipes, doable daily workouts, and a good dose of meditation.
Use the CommentForest to extract the data from comments under the main comments. This tool has the .list method, which extracts and displays all comments, including those within other comments.
submission.comments.replace_more(limit=0) for comment in submission.comments.list(): print(comment.body)
Does Reddit Authorize Scraping Reddit Data?
No, Reddit does not allow scraping. The website has anti-scraping tools that detect any scraping activity and block them effectively. You have little chance with data crawling web crawlers or web scrapers on the Reddit website. The website blocks crawlers, especially web scraping, from accessing their content.
However, they have an official API, the Python Reddit API Wrapper, which lets you scrape the website to extract data from posts and comments. But, you need to authenticate yourself through your Reddit account to use Reddit's API to scrape data from the website.
What Is a Reddit Scraper?
A Reddit scraper is a scraping tool used to extract data from Reddit posts and top comments. It does so by granting you access to popular and resourceful Reddit subreddits. You can also use the Reddit scraper to get user credentials without logging in. You can run the tool locally, on your device, and on other applications.