Hacking Reddit with PyCharm

Posted on by

As some of you might know, until last week Reddit was open source. But we can still go to GitHub and check out a version of the source. So let’s go ahead and play around with it!

Things you’ll need to follow along

Reddit uses Vagrant to spin up a pre-configured development VM on demand. To be able to use it you’ll need:

Vagrant

VirtualBox

Connecting to virtual machines for debugging is a feature that is only available in PyCharm Professional Edition. If you haven’t got it yet, you can get a free trial from our website .

Check out Reddit

To start, we need to get the code, so let’s go to Reddit’s GitHub account, and clone their repo. Please make sure to check it out in a folder called ‘reddit’, you’ll run into issues later if you don’t.

If you’re using Windows, there’s an important side note here: Reddit uses shell files, config files, and python files to configure the Vagrant box, and you need to check these out with Linux-style line endings. The easiest way to get this is to clone my fork instead, I’ve added a gitattributes file to ensure the correct line endings.

At the time of writing, there’s an issue in the reddit codebase that prevents you from running the code. GitHub user ironyman has fixed the problem, however, his pull request hasn’t been accepted. If you checked out the code from reddit, you’ll need to apply this change manually. I’ve already made this change in my fork, so if you’re not sure how to apply the change, simply clone my fork instead.

After checking out the code (and applying the fix if necessary) you can start the vagrant VM. Either run vagrant up on the command line or choose Tools | Vagrant | Up in PyCharm (if you’re asked to choose between default and Travis, choose default). After a couple of minutes of installation, you should see a line similar to:

==> default: reddit:15655 started 03baa2a at 15:40:26 (took 2.22s)

Now you’re probably excited, and want to see your very own Reddit. To do so, we’ll need to first take a quick detour to our hosts file. If you’re on macOS or Linux, this is in /etc/hosts , on Windows it’s in C:\Windows\System32\drivers\etc\hosts . Reddit’s Vagrantfile statically specifies the IP address, and we’ll need to add it here to access Reddit over the reddit.local domain name they use for development:

192.168.56.111</span> <span style="font-weight: 400;">reddit.local

And now you can open your browser, and check out your local reddit instance by going to </span><a href="http://reddit.local"><span style="font-weight: 400;">http://reddit.local</span></a><span style="font-weight: 400;"> (the http is important this time!):

Another provisioning script in the Vagrantfile will populate your local reddit setup with test data. So if it’s still empty, just wait a little, and there’ll be some content.

Let’s get hacking!

To make PyCharm able to do anything with the Reddit inside of our VM, we need to configure the Python interpreter. Go to Settings | Project Interpreter, and choose ‘Add Remote’:

Then select ‘Vagrant’, and make sure to select the ‘default’ machine. Reddit also specified another machine (Travis) to make unit testing easier, but we’re not using that machine here:

Furthermore, Reddit is a Python 2 application, and they’ve configured it using the system Python, so we don’t need to select a virtualenv.

If you get a scary looking warning about the remote host identification, don’t worry, this is normal for Vagrant boxes that are regularly recreated.

Reddit is a complex application, and they use overlayfs to merge configuration files on the VM with the code base mounted from the host machine. Therefore even though PyCharm correctly detects that the code is mounted in /media/reddit_code , we will need to manually add a path mapping.

Click the ‘…’ button next to the path mappings in the interpreter settings window and add the project directory (ending in ‘reddit’) on your local machine, and /home/vagrant/src/reddit as the remote path:

After we do this, we’re able to run and debug Python scripts in the Vagrant box. However, we’re not done yet, Reddit is a Pylons application, and uses a tool called ‘paster’ to start the application.

Running Reddit from PyCharm

If we open an SSH terminal window (Ctrl+Shift+A, ‘Start SSH session’, and then pick the Vagrant box) and run which paster we can see that they’ve installed the script in /usr/bin/paster. By running cat /usr/bin/paster we can see that it’s a very simple two-line script. The easiest way to enable PyCharm to run reddit, is to copy this script into the project. So let’s create a script called paster.py in the root of the project, with the following code:

from paste.script import command command.run()

If you’re using my fork rather than the code checked out from Reddit, this file should already be there.

Another thing to do before running Reddit from PyCharm is: stop the Reddit that’s already running on the VM (you can’t have two applications listening to the same port). So let’s go back to the SSH terminal window, if you’ve closed it, you can go to Tools | Start SSH Session to start a new one. Reddit uses Upstart to run all of its services, and we can use standard upstart commands to manage these services. To stop the main application, run sudo initctl stop reddit-paster .

At this point, we can create our run configuration, and start Reddit from PyCharm. Be sure to specify full paths, if you specify relative paths, there’s a high chance you’ll get a FileNotFoundError somewhere.

Name: ‘reddit’

Script: <full/path/to>paster.py

Script parameters: serve run.ini –reload http_port=8001

Working directory: <full/path/to/local/reddit>/r2

Click, OK. And now let’s click the bug icon (or use Shift+F9) to start debugging. And if everything is setup correctly, you should see:

And if you open http://reddit.local (again, the http:// is important) in your browser, you should see your local instance of Reddit running. If you’re wondering how we’re seeing our application on port 80 even though we specified 8001, Reddit is using HAProxy in the VM.

Let’s Mess with Reddit!

After all this effort, we should do something more interesting than just clicking play, right? So let’s change the main menu to say “Hello from PyCharm”:

To find out how to do this, we should have a look at the routing first, to see which controller serves the front page. Use Ctrl+Shift+N to find routing.py (in reddit/r2/r2/config/). After scrolling through this file, we find (on line 283):

mc(‘/’, controller=’hot’, action=’listing’)

In Pylons, this means that we need to have a HotController, which exposes a ‘listing’ action. So let’s find the controller, press Ctrl+N and look for HotController, we find it in listingcontroller.py. In HotController, there’s a GET_listing method that looks promising. So let’s put a breakpoint there.

When we refresh the page in the browser, PyCharm breaks here, so we’ve found the right place. FYI: if you take some time debugging, you might get a 504 gateway timeout in the browser, that’s just because HAProxy gets impatient, the backend Python application is still running.

Let’s use step over (F8), and step into (F7) to see where the code goes. We see a lot of places in the Reddit codebase and the Pylons code, but we don’t find anything useful.

The code here calls a ListingController.GET_listing method on its superclass. Let’s go have a look at that method, we can go there by putting the cursor on the name where we’re calling it (on line 573) and then us Ctrl+B to navigate to the function’s declaration (it’s on line 115). At the end of that function there’s a call to self.build_listing, which looks interesting. So let’s follow the path further (Ctrl + click on build_listing).

In this method we see nav_menus=self.menus , which looks promising. So let’s go to self.menus, and we see an empty list. Let’s see what happens if we change this:

If we go to the console in the debugger, we can see that the server automatically reloaded with our code changes. If your server didn’t automatically reload, check to see if you have --reload in the script parameters of your run configuration. You can manually restart the server by using the icon with the green round arrow in the top-left of the debugger tool window. Let’s mute the breakpoints, and refresh:

Close, but no cigar:

Let’s undo our changes first. And then let’s see if there’s another way to find the menu we’re interested in. If we look in the Chrome inspector, we find that the menu we’re interested in has the CSS class tabmenu :

Let’s use Find in Path (Ctrl+Shift+F) in the r2 directory to see if we find anything if we look for tabmenu . And we find a file menus.py , let’s have a closer look!

And what do we see in this file? class NavMenu , which in the docstring says it generates a navigation menu. So let’s see if this is the right place. Let’s put a breakpoint here (line 223 in menus.py), and then go to the breakpoints overview (Ctrl+Shift+F8) to remove our old breakpoint:

Don’t forget to unmute the breakpoints, and then let’s refresh! Now when we see the variables in this point, there’s an interesting looking options list. So let’s inspect it further, and we notice that every NamedButton has a title variable.

Clicking through this for every menu (and there are a couple) seems like a lot of work, so let’s make our life easier. We’re interested in seeing the titles of the menus to make sure that the one we’re interested in is here. In other words, we only want to see: [button.title for button in options] . We can check if our list comprehension is correct by going to the ‘Console’ tab of the debugger and running it. Looks good!

Let’s go back to the breakpoints overview: for now, we don’t want to stop, we only want to log, so we’ll uncheck ‘Suspend Thread’, and check ‘Evaluate and Log’ with that expression:

Now when we refresh the page, and look in the console tab, we see which menus are being handled by this class:

We can clearly see that the ‘hot’, ‘new’, ‘rising’, etc menu is here. So let’s break on that one, and that one alone:

When we refresh, we hit the breakpoint again (we may need to switch from the ‘Console’ tab to the ‘Debugger’ tab to see the variables), we can see in the Frames view that we’re being called by build_toolbars on line 971 in pages.py:

Let’s remove our breakpoint here, and have a look at the build_toolbars function. Use Ctrl+Shift+N to open pages.py, and then use Ctrl+G to go to line 971 (or click in the Frames view). Looking around in this function, we see that the main_buttons list is being populated with ‘hot’, ‘new’, etc. so it looks like we’ve found our menu!

Let’s see what happens when we add our message here:

Refresh, and:

Oops.

If you have debug mode enabled, which you should have with the Vagrant box, you’ll just see the stack trace, but this picture is funnier. If we have a look at the stack trace though (if you don’t have debug mode enabled, you can see it in the debug console in PyCharm). We can see that we’re getting a KeyError in strings.py on line 249.

Looking at the code there, it looks like Reddit doesn’t know how to make plurals out of the words we’ve added, and that’s what’s breaking it. So let’s add these words to the list:

And let’s refresh again to see if we’ve fixed Reddit:

We didn’t just fix it, we successfully altered the menu!

The Journey

We just used the debugger to quickly get to learn how a new, large, unknown codebase works. By finding clues, and then following them until we found what we were looking for, or found that we were on the wrong path.

Breakpoints are useful, but they become very powerful when you explore the additional options they have (like conditions, and logging expressions).

Let us know in the comments if you enjoyed these shenanigans, and what kind of content you’d like to see on this blog in the future!