The dataset is up-to-date for October 2015 and uses the official HN API as a data source. (Unfortuntely, this includes the HTML formatting in the comments) Dataset is about 4GB total; since BigQuery allows for 1000GB processing for free each month, it is effectively no cost to analyze.

Felipe Hoffa (who uploaded the dataset) has links to the dataset along with some sample Python code for analysis: https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb

I have a few scripts for downloading all the data manually (https://github.com/minimaxir/get-all-hacker-news-submissions-comments ), but the BigQuery table may be more pragmatic. I can do some more sample queries if requested. (See my BigQuery tutorial for Reddit data: http://minimaxir.com/2015/10/reddit-bigquery/ )