Having a good collection of Cross-Site Scripting (XSS) payloads is useful when you want to thoroughly test a web site’s ability to defend itself from being exploited. In most cases you can just run any one or more open source and/or commercial scanning tools to test your web site. Each of these tools have their own collection of payloads. However if you want to maximize your coverage to include edge case payloads or new techniques that tools may not cover yet, there are various payload resources available on the Internet. One of those resources being Reddit.com’s XSS subreddit where there are frequent posts demonstrating actually XSS exploitation in the wild. Some of which are on big brand sites (example).

https://reddit.com/r/xss

Note: Many of the XSS posted on the subreddit have been fixed, but some may still be active if the site owners haven't patched the vulnerability. The payloads are typically just for proof of concept so they are benign. But check the link payload before clicking on them.

In an effort to build a payload collection from various sources I’ve published the following project on Github, https://github.com/foospidy/payloads. In this repository there is a script to downloads payloads from other repositories as well as payload files from one-off sources like web sites, blogs, Reddit, and from where ever else they can be found. In spite of the varying sources there is overlap in payload signatures among the files, but that’s okay. Perhaps in the future I’ll work out a way to remove duplicate payload signatures. Back to the point of this post, the XSS subreddit was an ideal candidate for this collection. The trick was extracting the six or more years of payloads from the XSS subreddit.

Getting the Payloads

There are a few options for pulling data from Reddit. One is to use the Reddit API and the other is to use the RSS feature. Using the API is probably the better way to go, but I already had a Python script from a previous experimental project (XSSwat-SG) to pull and parse data from the XSS subreddit RSS feed. In a nutshell, the script will read the RSS feed of the latest XSS subreddit entries. If an entry is a GET request based XSS it will test if the XSS works or not using the Selenium Python library and then store the data and results in a MySQL database. All that was needed was a simple tweak of the script so a search URL can be specified to query for past entries based on a time range. To do this the search URL needs to specify “search.rss” for RSS output, and use Reddit’s Cloudsearch syntax to specify the subreddit and the time range.

Example search URL with RSS output:

Since the Reddit search will only return a maximum of 1000 results, and I wanted to process at least six years of entries, I needed a way to divide my searches over short time periods. I decided that a time period of one day was enough to guarantee less than 1000 results per request. Sure it would take longer to go day-by-day but I was in no rush. I didn’t want to over complicate the Python script to handle this so I created a shell script to specify how far back to go in days, generate the URL with the appropriate time range, and then call the Python script with the URL. In turn they Python script queries the subreddit and stores the results. You can find the shell script here https://github.com/foospidy/XSSwat-SG/blob/master/redditxss. As an example, if I want to gather payloads starting from six years ago, which is 2190 days, I would run:

./redditxss 2190

Note: As mentioned above, the Python script is only gathering GET request based XSS payloads. So any POST request based XSS payloads are not collected from the XSS subreddit.

Creating the Payloads File

After running the redditxss script all the payloads are stored in a MySQL database. This makes extraction easy. To extract all the payloads to a file I ran the following command:

mysql -u root -D rxss -B -e “select url from signatures where url like ‘%?%’;” | awk -F ? ‘{ print $2 }’ > reddit_xss_get.txt

There are a few things to note about this command. First, I’m specifying the database name “rxss”. That’s the database name I used when configuring the Python script. Second, some results may not have query string parameters, and if there are no query string parameters there are no XSS payloads. So I specify “like ‘%?%’” in the where clause to ensure the rows I extract have query string parameters.

While this approach should capture all GET request based payloads, the resulting payload file is not going to be perfect. Some rows may contain query string parameters but no payload, and each row may contain extra parameters that are not really needed. So some cleanup may be needed before feeding this payload file to a script or tool. In my case these issues were not a problem, but I did need to prefix each line with “&” due to the way I handle payloads when performing tests. However eventually I’d like to clean up the file to be consistent with the other payload files published in the payloads project on Github. The final payload file can be found here https://github.com/foospidy/payloads/blob/master/other/xss/reddit_xss_get.txt.

In Conclusion

The XSS subreddit is a great payload source. Fortunately Reddit provides convenient ways to extract the payloads in an automated fashion with a little scripting work. Even though the scripts described here are only collecting GET request based payloads, being able to extract those payloads into a usable format should prove to be useful for testing other web sites and web applications.

Both the scripts and payloads collection are open source and published on Github (payloads, XSSwat-SG) so contributions are welcome. Also, feedback and comments are welcome.