Overview

After a post from this site made the front page of Hacker News (HN), a curious quirk was spotted. The first time the post was published on HN was at 2016-06-12 09:27:01.000Z. The second time the same post was published on HN was at 2016-06-15 02:03:36.000Z. Notice that they are only three days apart. The one published on the 12th received only one or two points by the 15th. The repost, published on the 15th, received over 170 points (within hours of being published) and maintained position one for awhile.

In some cases, a piece of content will be posted and then–minutes later–will be posted again where the almost immediate repost will make the front page and the earlier version will forever stay on the back page. In other cases, it can take years before a repost will finally grace the cover of Hacker News. Of course there are pieces of content that get reposted year after year with each repost doing equally well.

To illustrate, here is an example of a piece of content that was posted and then reposted nine times. The ninth time made the front page. The span of time was 122 days, 2 hours, 47 minutes, and 24 seconds.

ID User Score Time Title URL Front Page Story Id "11540337" "dnetesn" "2" "1461225802" "Why Physics Is Not a Discipline" "http://nautil.us/issue/35/boundaries/why-physics-is-not-a-discipline" "12330360" "11550152" "dnetesn" "2" "1461341001" "Why Physics Is Not a Discipline" "http://nautil.us/issue/35/boundaries/why-physics-is-not-a-discipline" "12330360" "11556082" "dnetesn" "2" "1461427428" "Why Physics Is Not a Discipline" "http://nautil.us/issue/35/boundaries/why-physics-is-not-a-discipline" "12330360" "11562342" "celadevra_" "2" "1461552906" "Why Physics Is Not a Discipline" "http://nautil.us/issue/35/boundaries/why-physics-is-not-a-discipline" "12330360" "11564691" "feelthepain" "3" "1461597092" "Why Physics Is Not a Discipline" "http://nautil.us/issue/35/boundaries/why-physics-is-not-a-discipline" "12330360" "11570675" "dnetesn" "1" "1461666275" "Why Physics Is Not a Discipline" "http://nautil.us/issue/35/boundaries/why-physics-is-not-a-discipline" "12330360" "11578913" "dnetesn" "1" "1461749668" "Why Physics Is Not a Discipline" "http://nautil.us/issue/35/boundaries/why-physics-is-not-a-discipline" "12330360" "11621763" "dnetesn" "1" "1462291025" "Why Physics Is Not a Discipline" "http://nautil.us/issue/35/boundaries/why-physics-is-not-a-discipline" "12330360" "11666213" "dnetesn" "1" "1462877056" "Why Physics Is Not a Discipline" "http://nautil.us/issue/35/boundaries/why-physics-is-not-a-discipline" "12330360" "12330360" "dnetesn" "22" "1471776646" "Physics is not just what happens in the Department of Physics" "http://nautil.us/issue/35/boundaries/why-physics-is-not-a-discipline" "12330360"

The following is a look at Hacker News front page stories and their back page counterparts. We will look at how they differ as we search for any patterns that distinguish the two.

Terminology

HN: Hacker News

Post: the first time a piece of content is submitted to HN

Repost: the second to nth time the same piece of content is submitted to HN

Front candidates: Stories gathered from the front route and HN item API

Back candidates: Hits gathered from the Algolia search API after querying a front page story’s URL

Front(s): Front candidates that had one post that never made it to the front page and only one repost that made it to the front page

Back(s): Back candidates, that when grouped by their related front, had only one repost that made it to the front page

Collection

In order to collect the stories that at one time made the front page, story IDs, usernames, titles, scores, and post dates were indexed from the Hacker News front route. These IDs where then queried against the Hacker News item API where each returned item was stored in a SQLite database.

While there is an API endpoint for top stories, this is only for the top stories right now and is not an explicit confirmation of being on the front page. There is also a best stories API endpoint but it only gives the highest voted recent links. Thus the front route was used to ruled out any kind of ambiguity. Unfortunately, the HN items returned from the items endpoint do not indicate whether or not they made the front page.

To find the back candidates, Algolia search was queried using the front page story URLs.

Overall, 5,746 front candidates were collected along with 4,681 back candidates.

Front

All of the front candidates collected were indexed from the HN front route and verified via the HN items API endpoint. The front candidates’ time stamps ranged from Mon, 27 Jun 2016 20:58:30 GMT to Sat, 03 Sep 2016 16:28:55 GMT. After having collected the back candidates, any front page story that did not retrieve any hits (besides itself) from Algolia was ignored during the analysis.

Hacker News Item

The front page stories collected from the front route did not have all of their information. One of the missing pieces of information was the exact date and time the story was posted. Each ID collected from the front route was queried against the HN items API endpoint. These HN items were saved in their own SQLite database table.

Algolia Hit

To find the back candidates corresponding to the front page stories, the Algolia API was used. For every front page story collected, its cleaned URL was used to search Algolia. The Algolia hits returned were stored in their own SQLite database table which had a foreign key column relating each row to a row in the HN items table. There was a many to one relationship between Algolia hits and HN items.

Pre-processing

Most of the pre-processing centered around evaluating back and front candidates. To find content that was posted and then reposted one or more times, the URL was used. Each URL was broken up into five major parts. Country code top-level domains (ccTLDs) were collapsed to a single top-level domain (TLD).

The following procedure was carried out when comparing a back candidate’s URL to its front candidate’s URL:

Break the back candidate URL into subdomain, domain sans the TLD, TLD, path, and query

Break the front candidate URL into subdomain, domain sans the TLD, TLD, path, and query

When looking at a URL, if it had a path then the query was set to the empty string

Compute the Levenshtein distance (LD) between back candidate subdomain and front candidate subdomain back candidate domain sans the TLD and front candidate domain sans the TLD back candidate TLD and front candidate TLD back candidate path and front candidate path back candidate query and front candidate query

Declare the back candidate a URL match to its front candidate if subdomain LD is less than 10 AND domain sans the domain LD equals zero AND TLD LD equals zero AND path LD is less than 2 AND query LD equals zero



The following demonstrates some of the cases that were seen:

https :// lettier . github . io vs http :// lettier . github . io ^ ^ https :// lettier . github . io vs http :// lettier . github . io / index . html ^ ^ ^ ^^^^^^^^^^^ https :// lettier . github . io ? vs http :// lettier . github . io / index . html ^ ^ ^ ^^^^^^^^^^^ https :// www . lettier . com vs http :// lettier . com ^ ^^^^ ^ ^ https :// www . bbc . co . uk vs https :// www . bbc . com ^^^^^ ^^^

Every potential URL match was reviewed for correctness along with the potential mismatches.

Front and Back Criteria

The following criteria was used to confirm an Algolia hit (back candidate) as a back:

Does not have the same ID as any front page story

Has a URL

Has an equivalent URL to its related front page story

Was posted before its related front page story

Given a front page story and its Algolia Hits (besides itself), there exists no Algolia hit with four or more points

It has three or less points

This criteria was necessary to pull out all examples of some piece of content being posted and then reposted until it finally reached the front page.

Back -> Back -> Back -> ... => Front Post -> Repost -> Repost -> ... => Repost <- --------------- Time ----------------->

The three or less points was a criteria as four was the minimum number of points observed for front candidates that had related back candidates.

Fronts were defined as having a matching ID in the backs’ collection of foreign key front page story IDs. Note that no two front page stories, HN items, or Algolia hits had the same ID. Within each database table, all row primary key IDs were unique.

Analysis

Analysis of HN content typically only focuses on what made the front page and when. Rarely do you see an analysis of what did not make the front page. Given the exact same piece of content (news, blog post, software project, etc.), why does an early submission never reach the front page while a later resubmission does? By comparing pieces of content that were posted and then reposted, with only one repost making the front page, we can search for discernible patterns alluding to what makes a piece of content a front page HN story.

The analysis consisted of comparing 425 fronts against 570 corresponding backs. Facets looked at were day of the week, time of day, title word usage, title sentiment, vectorized title projections, and users.

To ease the analysis, all of the fronts and their backs were indexed into Elasticsearch. Fronts were indexed as parent documents with their related backs being indexed as child documents.

One possibility of a why a front succeeded versus its backs may be due to the day of the week and/or the time of the day it was reposted. Does knowing when a story was reposted tell you something about it either reaching the front page or staying on the back page? Is date and time independent from front vs back?

Note that all time stamps were converted to EDT from GMT.

Day of the Week

Using Elasticsearch to aggregate by day of the week, we analyze the relative frequency of fronts versus backs.

You can see very little relative difference for Monday. The largest relative differences can be see on Saturday and Sunday–the weekend.

In this chart we explicitly visualize absolute value difference between the relative frequencies. There is a large difference on Tuesday with there being more relative backs than fronts. After Tuesday, the difference goes down the next day and slowly grows until Sunday. There were more backs observed on Wednesday, Thursday, and Friday. Saturday and Sunday, however, has more fronts than backs observed.

Having two nominal categories–day of the week and page status–we can use the Chi-Square test of independence.

Assumptions of the Chi-Square test of independence are:

The data is frequencies or counts

The category levels are mutually exclusive (a HN submission can not have two timestamps or be both a front and a back at the same time)

Each subject (HN submission) can only contribute data to one cell in the contingency table (an HN submission cannot be both (Monday, Front) and (Tuesday, Back) for example)

and for example) The expected values in each cell are 5 or more in 80% of the cells

| | Mon | Tue | Wed | Thur | Fri | Sat | Sun | Totals | |-------|------------|------------|------------|------------|------------|------------|------------|--------| | Back | 85 ( 87.08 ) | 97 ( 78.48 ) | 99 ( 96.24 ) | 97 ( 88.79 ) | 95 ( 86.50 ) | 50 ( 67.03 ) | 47 ( 65.88 ) | 570 | | Front | 67 ( 64.92 ) | 40 ( 58.52 ) | 69 ( 71.76 ) | 58 ( 66.21 ) | 56 ( 64.50 ) | 67 ( 49.97 ) | 68 ( 49.12 ) | 425 | | | 152 | 137 | 168 | 155 | 151 | 117 | 115 | 995 | Expected values shown in parentheses . Null Hypothesis = H0 = Page Status ( Front vs Back ) and Day of the Week are independent Alternative Hypothesis = H1 = Page Status ( Front vs Back ) and Day of the Week are dependent P ( Type I Error ) = alpha = a = 0.05 Chi - Square statistic = 37.050859063487195 Degrees of Freedom = 6 Cramer's V = 0.19296902415909076 P ( Chi - Square statistic >= observed | H0 ) = p - value < 0.0001 Reject H0 .

Based on the p-value being less than the alpha value, the data supports the alternative hypothesis. However, based on the Cramer’s V, the strength of the association is weak.

Time of Day

Separating out the time of day from their time stamps, we visualize the relative frequencies of backs versus fronts.

Looking at the chart, 0600 and 1100 immediately stand out. However, are these observed differences just chance occurrences because of the sample we took?

| | 0000 | 0100 | 0200 | 0300 | 0400 | 0500 | 0600 | 0700 | 0800 | 0900 | 1000 | 1100 | 1200 | 1300 | 1400 | 1500 | 1600 | 1700 | 1800 | 1900 | 2000 | 2100 | 2200 | 2300 | Totals | |-------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|--------| | Back | 16 ( 14.89 ) | 17 ( 15.47 ) | 18 ( 14.89 ) | 15 ( 14.89 ) | 21 ( 24.63 ) | 21 ( 22.91 ) | 15 ( 22.91 ) | 21 ( 22.91 ) | 20 ( 24.06 ) | 31 ( 32.08 ) | 41 ( 38.38 ) | 23 ( 32.65 ) | 31 ( 30.36 ) | 41 ( 36.66 ) | 33 ( 31.51 ) | 42 ( 42.39 ) | 33 ( 28.64 ) | 27 ( 23.49 ) | 15 ( 13.75 ) | 20 ( 20.62 ) | 17 ( 15.47 ) | 17 ( 14.89 ) | 21 ( 17.19 ) | 14 ( 14.32 ) | 570 | | Front | 10 ( 11.11 ) | 10 ( 11.53 ) | 8 ( 11.11 ) | 11 ( 11.11 ) | 22 ( 18.37 ) | 19 ( 17.09 ) | 25 ( 17.09 ) | 19 ( 17.09 ) | 22 ( 17.94 ) | 25 ( 23.92 ) | 26 ( 28.62 ) | 34 ( 24.35 ) | 22 ( 22.64 ) | 23 ( 27.34 ) | 22 ( 23.49 ) | 32 ( 31.61 ) | 17 ( 21.36 ) | 14 ( 17.51 ) | 9 ( 10.25 ) | 16 ( 15.38 ) | 10 ( 11.53 ) | 9 ( 11.11 ) | 9 ( 12.81 ) | 11 ( 10.68 ) | 425 | | | 26 | 27 | 26 | 26 | 43 | 40 | 40 | 40 | 42 | 56 | 67 | 57 | 53 | 64 | 55 | 74 | 50 | 41 | 24 | 36 | 27 | 26 | 30 | 25 | 995 | Expected values shown in parentheses . Null Hypothesis = H0 = Page Status ( Front vs Back ) and Time of Day are independent Alternative Hypothesis = H1 = Page Status ( Front vs Back ) and Time of Day are dependent P ( Type I Error ) = alpha = a = 0.05 Chi - Square statistic = 26.806877906649206 Degrees of Freedom = 23 Cramer's V = 0.1641389223670862 P ( Chi - Square statistic >= observed | H0 ) = p - value < 0.2643 Fail to reject H0 .

Because the p-value is greater than the alpha value, we fail to reject the null hypothesis that the two nominal categories are independent.

Title

With the way in which HN is laid out, the title is all one has to go on while scanning stories on the new page. Is the title the defining factor between falling off the new page, never to be seen again, or making a giant splash on the front page?

Out of the 570 backs, 200 had the exact same title as their related front.

Word Usage

Using Elasticsearch’s inverted index and English analyzer, we can easily aggregate the title word usage frequencies for both fronts and backs.

Here is the top of the chart. The full version can be found here (1.3 MB).

PDF was the most frequent for backs and fronts with it usually being seen in the title as ... ... [pdf] . Looking over the rest of the chart, there is not any clear differences between fronts and backs. If you look at show and hn , a collocation, they are slightly more frequent for fronts.

Sentiment

It could be the case that the titles for backs have a different sentiment from the titles for fronts. To analyze the sentiment, sentimentAPI was used.

Using a LinearSVC classifier model, the sentimentAPI was trained on movie reviews and claims to be 88% accurate. For this analysis, a more appropriate model to use would have been one trained on human labeled data of article titles but no such data set was found.

Looking at the chart, the sentiment for both backs and fronts is nearly identical. Curiously, the sentiment is slightly negative for both but this could be entirely due to noise.

Using R, a t-test was run comparing the difference between the mean sentiment for backs and fronts.

Welch Two Sample t - test data : data.fronts and data.backs t = 0.71427 , df = 882.12 , p - value = 0.4752 alternative hypothesis : true difference in means is not equal to 0 95 percent confidence interval : - 0.01694742 0.03634044 sample estimates : mean of x mean of y - 0.08209647 - 0.09179298

Vector Projection

Using the bag-of-words model, we vectorize the back and front titles into their own title by word matrices. Taking these two highly dimensional matrices (M HN submission titles by N words), we reduce their dimensionality in order to visualize them. The three dimensionality reduction methods used were Locally Linear Embedding (LLE), Multidimensional Scaling (MDS), and t-distributed Stochastic Neighbor Embedding (t-SNE).

Using LLE, we see a nice separation between backs (the blue circles) and fronts (the red pentagons). There is some overlap in the large grouping towards the center of the plot.

Here we see the dimensions reduced down to 3D.

In this projection, there is no separation between fronts and backs.

t-SNE is non-deterministic (hence the stochastic). Every run may produce slightly different plots. We can see some separation and overlap which makes sense given the 200 backs that had the same exact title as their front.

User

The users which made the submissions will be the last attribute we look at. Maybe some stories are favored over others merely because of who submitted it from the HN community?

Here is the top of the chart. The full version can be found here (521 KB).

Based on the sample taken, dnetesn ties with adamnemecek but dnetesn has had more backs than fronts. adamnemecek ’s relative frequency of fronts surpasses that of their own relative frequency of backs.

Recap

Using Haskell, R, Python, SQLite, and Elasticsearch, we evaluated posted and reposted pieces of content where only one of the reposts made it to the front page of Hacker News. We looked at the day of the week, time of day, title, and the users which made the submissions. We found statistical significance regarding the dependence between page status and the day of the week. The attributes looked at turned up very little if any patterns distinguishing backs from their related fronts.

Appendix

Below you will find some supplementary material.

Full Source Code

The source code is written in Haskell and embeds the R and Python scripts used. Beyond the languages and their libraries used, you will need SQLite and Elasticsearch to run the following source code.

stack.yaml

resolver : lts - 6.11 packages : - '.' extra - deps : [] flags : {} extra - package - dbs : []

hackerNewsFrontVsBack.cabal

name : hackerNewsFrontVsBack version : 0.0 . 0.1 synopsis : Hacker News front page vs back page story analysis . description : . homepage : https :// lettier . github . com license : Apache license - file : LICENSE author : David Lettier copyright : 2016 David Lettier category : Analysis build - type : Simple cabal - version : >= 1.10 executable hackerNewsFrontVsBack hs - source - dirs : src main - is : Main.hs default - language : Haskell2010 other - modules : DB , AlgoliaHits , HackerNewsItems , FrontPageItems , FrontsAndBacks , Elasticsearch , DateTimeUtil , URIUtil , TimeAnalysis , TitleAnalysis , UserAnalysis , StopWords , SentimentApi , Common build - depends : base >= 4.7 && < 5 , wreq , aeson , lens , bytestring , sqlite - simple , network - uri , text , hxt , HandsomeSoup , time , containers , unordered - containers , data - default - class , colour , Chart , cairo , Chart - diagrams , Chart - cairo , process , edit - distance , neat - interpolation , MissingH

Main.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} module Main (main) where import Control.Monad import qualified FrontPageItems as FPI import qualified HackerNewsItems as HNI import qualified AlgoliaHits as AH main :: IO () main = do FPI.createDbTableIfNotExists HNI.createDbTableIfNotExists AH.createDbTableIfNotExists FPI.getAllInDb >>= mapM_ ( (return . FPI._id) >=> HNI.idIfNotInDb >=> HNI.getItemMaybe >=> HNI.insertOrReplaceMaybe ) HNI.getAllInDb >>= mapM_ (AH.getAlgoliaHits >=> mapM_ AH.insertOrReplace)

DB.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE QuasiQuotes, OverloadedStrings #-} module DB where import GHC.Generics import Data.Text as DT import Database.SQLite.Simple import Database.SQLite.Simple.FromRow import NeatInterpolation as NI dbLocation :: String dbLocation = "data/db/hackerNewsProject.db" withConnection' :: ( Connection -> IO a) -> IO a withConnection' = withConnection dbLocation openDb :: IO Connection openDb = open dbLocation createDbTableIfNotExists :: String -> String -> IO () createDbTableIfNotExists tableName attributes = withConnection' (run (sqlString tableName attributes)) where sqlString tableName attributes = [NI.text | CREATE TABLE IF NOT EXISTS $ {tableName'} ( $ {attributes'}); | ] where [tableName', attributes'] = Prelude.map DT.pack [tableName, attributes] run statement con = execute_ con ( Query statement)

Common.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE OverloadedStrings #-} module Common where import System.Process import Control.Monad import Data.Maybe import Data.List as DL import Data.Map as DM import Data.Set as DS import Data.Text as DT import qualified Data.ByteString.Lazy as DBL import Data.Aeson import NeatInterpolation as NI runRFile :: String -> IO () runRFile rfile = void $ createProcess (proc "R" [ "--quiet" , "-f" , rfile]) esAggBackFrontFieldResult :: (a -> [( String , Integer )]) -> IO [a] -> IO ([( String , Integer )], [( String , Integer )]) esAggBackFrontFieldResult f r = do r' <- r let backs = f $ Prelude.head r' let fronts = f $ Prelude.last r' return (backs, fronts) makeKeyMap :: Integer -> [( String , Integer )] -> Map String Double makeKeyMap t l = DM.fromListWith ( + ) (Prelude.map (\ (k, x) -> (k, fromInteger x / fromInteger t)) l) makeAllKeyMap :: [ String ] -> DM.Map String Double -> DM.Map String Double makeAllKeyMap keys m = Prelude.foldl (\ acc (k, v) -> DM.insertWith ( + ) k v acc) m [(k, 0.0 ) | k <- keys] lookupKey :: String -> DM.Map String Double -> Double lookupKey key m = fromMaybe 0.0 (DM.lookup key m) secondSum :: Num b => [(a, b)] -> b secondSum = Prelude.sum . Prelude.map snd writeROrPyFile :: ( Text -> Text -> Text ) -> String -> String -> String -> IO () writeROrPyFile f a b rpyf = writeFile rpyf $ DT.unpack $ f (DT.pack a) (DT.pack b) notNull :: [a] -> Bool notNull = Prelude.not . Prelude.null fst' :: (a, b, c, d) -> a fst' (a, b, c, d) = a

{- David Lettier (C) 2016 http://www.lettier.com/ -} module DateTimeUtil where import Data.Time.Clock import Data.Time.Clock.POSIX import Data.Time.LocalTime import Data.Time.Format edt :: TimeZone edt = hoursToTimeZone ( - 4 ) utcTime :: Integer -> UTCTime utcTime = posixSecondsToUTCTime . fromIntegral edtZonedTime :: Integer -> ZonedTime edtZonedTime i = utcToZonedTime edt $ utcTime i monthOfYearEdtZonedTime :: Integer -> Integer monthOfYearEdtZonedTime i = read (fEdtZonedTime "%m" i) :: Integer dayOfWeekEdtZonedTime :: Integer -> Integer dayOfWeekEdtZonedTime i = read (fEdtZonedTime "%u" i) :: Integer hourOfDayEdtZonedTime :: Integer -> Integer hourOfDayEdtZonedTime i = read (fEdtZonedTime "%H" i) :: Integer minOfHourEdtZonedTime :: Integer -> Integer minOfHourEdtZonedTime i = read (fEdtZonedTime "%M" i) :: Integer fEdtZonedTime :: String -> Integer -> String fEdtZonedTime s i = formatTime defaultTimeLocale s $ edtZonedTime i

URIUtil.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE OverloadedStrings #-} module URIUtil where import Control.Monad import Network.URI hiding (query) import Data.List as DL import Data.Maybe import Data.String.Utils as DSU import Data.Text as DT import Data.ByteString.Lazy import qualified Common as CO uriNoQuery :: String -> String uriNoQuery s = uriParsed where s' = cleanUri s uri' = uri s' uriAuth' = uriAuth uri' isYoutube = "youtube" `DL.isInfixOf` s' uriParsed = if isURI s' && not isYoutube then Prelude.concat [ uriScheme uri' , "//" , uriRegName uriAuth' , uriPath uri' ] else s' uri :: String -> URI uri string = fromMaybe nullURI (parseURI $ DT.unpack $ DT.toLower $ DT.pack string) uriAuth :: URI -> URIAuth uriAuth uri = fromMaybe ( URIAuth "" "" "" ) (uriAuthority uri) uriHost :: URI -> String uriHost = uriRegName . uriAuth uriTLD :: URI -> String uriTLD u = grabTLD chunks where chunks = uriHostChunks u grabTLD :: [ String ] -> String grabTLD [] = "" grabTLD [a, b, "co" , _] = "com" grabTLD [a, "co" , _] = "com" grabTLD [t] = t grabTLD [d, t] = t grabTLD [s, d, t] = t grabTLD [ss, s, d, t] = t grabTLD (x : y) = Prelude.last (x : y) uriDomain :: URI -> String uriDomain u = grabDomain chunks where chunks = uriHostChunks u grabDomain :: [ String ] -> String grabDomain [] = "" grabDomain [a, b, "co" , _] = b grabDomain [a, "co" , _] = a grabDomain [t] = "" grabDomain [d, t] = d grabDomain [s, d, t] = d grabDomain [ss, s, d, t] = d grabDomain (x : y) = Prelude.last $ Prelude.init (x : y) uriSubdomain :: URI -> String uriSubdomain u = grabSubdomain chunks where chunks = uriHostChunks u grabSubdomain :: [ String ] -> String grabSubdomain [] = "" grabSubdomain [ "www" , a, "co" , _] = "" grabSubdomain [a, b, "co" , _] = a grabSubdomain [a, "co" , _] = "" grabSubdomain ( "www" : _) = "" grabSubdomain [t] = "" grabSubdomain [d, t] = "" grabSubdomain [s, d, t] = s grabSubdomain [ss, s, d, t] = ss ++ "." ++ s grabSubdomain (x : y) = x parseURIHost :: URI -> [ String ] parseURIHost u = [uriSubdomain u, uriDomain u, uriTLD u] uriHostChunks :: URI -> [ String ] uriHostChunks = Prelude.map DT.unpack . DT.split ( == '.' ) . DT.toLower . DT.pack . uriHost cleanUri :: String -> String cleanUri s = scheme ++ "//" ++ DL.intercalate "." ( Prelude.filter CO.notNull [subdomain, domain, tld] ) ++ path ++ query where uri' = (uri . DT.unpack . DT.toLower . DT.strip . DT.pack) s scheme = cleanUriScheme $ uriScheme uri' subdomain = uriSubdomain uri' domain = uriDomain uri' tld = uriTLD uri' path = cleanUriPath $ uriPath uri' query = cleanUriQuery $ uriQuery uri' cleanUriScheme :: String -> String cleanUriScheme "https:" = "https:" cleanUriScheme "http:" = "https:" cleanUriScheme x = x cleanUriPath :: String -> String cleanUriPath "" = "" cleanUriPath x = (removeLastIndexExt . removeLastSlash . removeDoubleSlash) x where removeDoubleSlash :: String -> String removeDoubleSlash = DSU.replace "//" "/" removeLastIndexExt :: String -> String removeLastIndexExt = recombine . parts where parts :: String -> [ String ] parts = DSU.split "/" lastPart :: [ String ] -> String lastPart [] = "" lastPart y = Prelude.last y recombine [] = "" recombine z = DL.intercalate "/" ( if "index." `DL.isInfixOf` lastPart z then Prelude.init z else z) removeLastSlash :: String -> String removeLastSlash "" = "" removeLastSlash a = if Prelude.last a == '/' then Prelude.init a else a cleanUriQuery :: String -> String cleanUriQuery "" = "" cleanUriQuery "?" = "" cleanUriQuery "?=" = "" cleanUriQuery "?=&" = "" cleanUriQuery q = if DL.isInfixOf "?" q && DL.isInfixOf "=" q && Prelude.length q > 3 then q else ""

StopWords.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE OverloadedStrings #-} module StopWords where -- Taken from http://xpo6.com/list-of-english-stop-words/ stopWords :: [ String ] stopWords = [ -- ... ]

FrontPageItems.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE DeriveGeneric, QuasiQuotes, OverloadedStrings #-} module FrontPageItems where import GHC.Generics import Control.Lens import Control.Monad import Control.Exception import Control.Applicative import Control.Concurrent import Network.URI hiding (query) import Data.Maybe import Data.Either import Data.Time import Data.Time.Clock.POSIX import Data.Text as DT import Data.ByteString.Lazy import Data.ByteString.Lazy.Char8 import Data.Aeson import Data.Aeson.Types import Network.Wreq import Database.SQLite.Simple import Database.SQLite.Simple.FromRow import Text.XML.HXT.Core import Text.HandsomeSoup import NeatInterpolation as NI import DB data FrontPageItem = FrontPageItem { _id :: Integer , user :: String , points :: Integer , url :: String , title :: String , timestamp :: Integer } deriving ( Show , Generic ) instance FromRow FrontPageItem where fromRow = FrontPageItem <$> field <*> field <*> field <*> field <*> field <*> field instance ToRow FrontPageItem where toRow ( FrontPageItem _id user points title url timestamp) = toRow ( _id , user , points , title , url , timestamp ) dbTableName :: String dbTableName = "frontPageItems" startDay :: Day startDay = fromGregorian 2016 09 05 -- 2016 06 24 daysBack :: [ Integer ] daysBack = [ 0 .. 73 ] frontPageItemsApi :: String frontPageItemsApi = "https://news.ycombinator.com/front" byteStringToString :: ByteString -> IO String byteStringToString = return . Data.ByteString.Lazy.Char8.unpack dayToInteger :: Day -> Integer dayToInteger d = read (Prelude.init $ show $ utcTimeToPOSIXSeconds $ UTCTime d 0 ) :: Integer getRawHtml :: Day -> Integer -> IO ( Response ByteString ) getRawHtml day page = getWith opts frontPageItemsApi where opts = defaults & param "day" .~ [(DT.pack . showGregorian) day] & param "p" .~ [DT.pack $ show page] processResponse :: Response ByteString -> IO ByteString processResponse rbs = return (rbs ^. responseBody) getFrontPageDays :: IO () getFrontPageDays = mapM_ processDay daysBack where processDay :: Integer -> IO () processDay i = frontPageItems >>= mapM_ insertOrReplace where day = addDays (( - 1 ) * i) startDay frontPageItems :: IO [ FrontPageItem ] frontPageItems = foldM (\ acc page -> do Control.Monad.when (page > 1 ) $ threadDelay ( 30 * 1000000 ) print day result <- getFrontPageDay day page return (acc ++ result) ) [] [ 1 .. 5 ] getFrontPageDay :: Day -> Integer -> IO [ FrontPageItem ] getFrontPageDay day page = getRawHtml day page >>= processResponse >>= byteStringToString >>= processHtml where processHtml :: String -> IO [ FrontPageItem ] processHtml htmlString = makeFrontPageItems where doc :: IOStateArrow s b XmlTree doc = readString [withParseHTML yes, withWarnings yes] htmlString tr = doc >>> css "tr" idUrlTitle = hasAttrValue "class" ( == "athing" ) >>> hasAttr "id" >>> ( getAttrValue "id" &&& ( css "a" >>> hasAttrValue "class" ( == "storylink" ) >>> ( getAttrValue "href" &&& (getChildren >>> getText) ) ) ) userPoints = css "td" >>> hasAttrValue "class" ( == "subtext" ) >>> multi (( css "a" >>> hasAttrValue "class" ( == "hnuser" ) >>> getChildren >>> getText ) &&& ( css "span" >>> hasAttrValue "class" ( == "score" ) >>> getChildren >>> getText )) idUrlTitleParsed = runX $ tr >>> idUrlTitle userPointsParsed = runX $ tr >>> userPoints mergedParsed = do x <- idUrlTitleParsed y <- userPointsParsed return (Prelude.zip x y) makeFrontPageItem :: (( String , ( String , String )), ( String , String )) -> FrontPageItem makeFrontPageItem ((i, (ur, t)), (u, p)) = FrontPageItem { _id = read i :: Integer , user = u , points = read (Prelude.head $ Prelude.words p) :: Integer , url = ur , title = t , timestamp = dayToInteger day } makeFrontPageItems = fmap (Prelude.map makeFrontPageItem) mergedParsed createDbTableIfNotExists :: IO () createDbTableIfNotExists = DB.createDbTableIfNotExists dbTableName attributes where attributes = DT.unpack [NI.text | _id INTEGER PRIMARY KEY , user TEXT , points INTEGER , url INTEGER , title TEXT , timestamp INTEGER | ] insertOrReplace :: FrontPageItem -> IO () insertOrReplace FrontPageItem { _id = i , user = u , points = p , title = t , url = url' , timestamp = ts } = void insertOrReplaceTryResult where tryInsertOrReplace = try (withConnection' insertOrReplace) :: IO ( Either SomeException ()) insertOrReplaceTryResult = tryInsertOrReplace >>= either (void . print) return insertOrReplace :: Connection -> IO () insertOrReplace con = execute con "REPLACE INTO ? (\ \ _id \ \ , user \ \ , points \ \ , url \ \ , title \ \ , timestamp \ \ ) values (?, ?, ?, ?, ?, ?);" ( dbTableName , i , u , p , t , url' , ts ) getAllInDb :: IO [ FrontPageItem ] getAllInDb = withConnection' queryForItems where queryForItems :: Connection -> IO [ FrontPageItem ] queryForItems con = query con "SELECT * FROM ?;" ( Only dbTableName)

HackerNewsItems.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE DeriveGeneric, QuasiQuotes, OverloadedStrings #-} module HackerNewsItems where import GHC.Generics import Control.Lens import Control.Monad import Control.Exception import Control.Applicative import Network.URI hiding (query) import Data.Maybe import Data.Either import Data.Text as DT import Data.ByteString.Lazy import Data.Aeson import Data.Aeson.Types import Network.Wreq import Database.SQLite.Simple import Database.SQLite.Simple.FromRow import NeatInterpolation as NI import DB import qualified FrontPageItems as FPI data HackerNewsItem = HackerNewsItem { _id :: Integer , by :: String , score :: Integer , time :: Integer , title :: String , typee :: String , url :: String } deriving ( Show , Generic ) instance FromJSON HackerNewsItem where parseJSON ( Object v) = HackerNewsItem <$> v .: "id" <*> v .: "by" <*> v .: "score" <*> v .: "time" <*> v .: "title" <*> v .: "type" <*> v .: "url" instance FromRow HackerNewsItem where fromRow = HackerNewsItem <$> field <*> field <*> field <*> field <*> field <*> field <*> field instance ToRow HackerNewsItem where toRow ( HackerNewsItem _id by score time title typee url) = toRow ( _id , by , score , time , title , typee , url ) dbTableName :: String dbTableName = "hackerNewsItems" topStoriesApi :: String topStoriesApi = "https://hacker-news.firebaseio.com/v0/topstories.json" itemsApi :: Integer -> String itemsApi itemId = Prelude.concat [ "https://hacker-news.firebaseio.com/v0/item/" , show itemId, ".json" ] getTopStoryItemIds :: IO ( Maybe [ Integer ]) getTopStoryItemIds = do r <- get topStoriesApi return (decode (r ^. responseBody) :: Maybe [ Integer ]) getItems :: Maybe [ Integer ] -> IO [ Maybe HackerNewsItem ] getItems = maybe (return []) (mapM getItem) getItem :: Integer -> IO ( Maybe HackerNewsItem ) getItem itemId = do print itemId r <- try(get $ itemsApi itemId) :: IO ( Either SomeException ( Response ByteString )) case r of Left e -> print e >> return Nothing Right bs -> return (decode (bs ^. responseBody) :: Maybe HackerNewsItem ) getItemMaybe :: Maybe Integer -> IO ( Maybe HackerNewsItem ) getItemMaybe = maybe (return Nothing ) getItem createDbTableIfNotExists :: IO () createDbTableIfNotExists = DB.createDbTableIfNotExists dbTableName attributes where attributes = DT.unpack [NI.text | _id INTEGER PRIMARY KEY \ , by TEXT , score INTEGER , time INTEGER , title TEXT , typee TEXT , url TEXT | ] getItemInDb :: Integer -> IO ( Maybe HackerNewsItem ) getItemInDb itemId = withConnection' queryForItem >>= firstItem where queryForItem :: Connection -> IO [ HackerNewsItem ] queryForItem con = query con "SELECT * FROM ? where _id = ? LIMIT 1;" (dbTableName, itemId) :: IO [ HackerNewsItem ] firstItem :: [ HackerNewsItem ] -> IO ( Maybe HackerNewsItem ) firstItem (x : y) = return ( Just x) firstItem [] = return Nothing getUrlsInDb :: IO [ String ] getUrlsInDb = withConnection' queryForUrls >>= urls where queryForUrls :: Connection -> IO [[ String ]] queryForUrls con = query con "SELECT url FROM ?;" ( Only dbTableName) :: IO [[ String ]] urls :: [[ String ]] -> IO [ String ] urls [] = return [] urls (x : y) = return (Prelude.map extractUrl (x : y)) extractUrl :: [ String ] -> String extractUrl [] = "" extractUrl (x : y) = x getAllInDb :: IO [ HackerNewsItem ] getAllInDb = withConnection' queryForItems where queryForItems :: Connection -> IO [ HackerNewsItem ] queryForItems con = query con "SELECT * FROM ?;" ( Only dbTableName) getStoriesInDb :: IO [ HackerNewsItem ] getStoriesInDb = withConnection' queryForItems where queryForItems :: Connection -> IO [ HackerNewsItem ] queryForItems con = query con "SELECT * FROM ? where typee = 'story' and _id in (select _id from ?);" ( dbTableName , FPI.dbTableName ) existsInDb :: Integer -> IO Bool existsInDb itemId = getItemInDb itemId >>= maybe (return False ) (\ _ -> return True ) idIfNotInDb :: Integer -> IO ( Maybe Integer ) idIfNotInDb itemId = existsInDb itemId >>= \ exists -> if exists then return Nothing else return ( Just itemId) insertOrReplaceMaybe :: Maybe HackerNewsItem -> IO () insertOrReplaceMaybe = maybe (return ()) insertOrReplace insertOrReplace :: HackerNewsItem -> IO () insertOrReplace HackerNewsItem { _id = i , by = b , score = s , time = t , title = title' , typee = typee' , url = u } = withConnection' insertOrReplace where insertOrReplace :: Connection -> IO () insertOrReplace con = execute con "REPLACE INTO ? (\ \_id\ \, by\ \, score\ \, time\ \, title\ \, typee\ \, url\ \) values (?, ?, ?, ?, ?, ?, ?);" ( dbTableName , i , b , s , t , title' , typee' , u )

AlgoliaHits.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE DeriveGeneric, QuasiQuotes, OverloadedStrings #-} module AlgoliaHits where import GHC.Generics import Control.Lens import Control.Monad import Control.Exception import Control.Applicative import Network.URI hiding (query) import Data.Maybe import Data.List as DL import Data.Text as DT import Data.ByteString.Lazy import Data.Aeson import Data.Aeson.Types import Text.EditDistance import Network.Wreq import Database.SQLite.Simple import Database.SQLite.Simple.FromRow import NeatInterpolation as NI import DB import URIUtil import qualified Common as CO import qualified HackerNewsItems as HNI data AlgoliaHit = AlgoliaHit { _id :: Integer , author :: String , points :: Integer , createdAt :: Integer , title :: String , url :: Maybe String , hackerNewsItemsId :: Integer } deriving ( Show , Generic ) data AlgoliaHits = AlgoliaHits { hits :: [ AlgoliaHit ] } deriving ( Show , Generic ) instance FromJSON AlgoliaHit where parseJSON = withObject "hit" $ \ o -> do _id' <- o .: "objectID" author <- o .: "author" points <- o .: "points" createdAt <- o .: "created_at_i" title <- o .: "title" url <- o .: "url" let _id'' = read _id' :: Integer return ( AlgoliaHit _id'' author points createdAt title url 0 ) instance FromJSON AlgoliaHits where parseJSON ( Object v) = AlgoliaHits <$> v .: "hits" instance FromRow AlgoliaHit where fromRow = AlgoliaHit <$> field <*> field <*> field <*> field <*> field <*> field <*> field instance ToRow AlgoliaHit where toRow ( AlgoliaHit _id author points createdAt title url hackerNewsItemsId) = toRow ( _id , author , points , createdAt , title , url , hackerNewsItemsId ) dbTableName :: String dbTableName = "algoliaHits" algoliaSearchApi :: String algoliaSearchApi = "http://hn.algolia.com/api/v1/search" getAlgoliaHits :: HNI.HackerNewsItem -> IO [ AlgoliaHit ] getAlgoliaHits hni = do exists' <- urlExistsInDb url' if exists' then print ( "Skipping " ++ url') >> return [] else do print url' r <- try(getWith opts algoliaSearchApi) :: IO ( Either SomeException ( Response ByteString )) case r of Left e -> print e >> return [] Right bs -> do let json = bs ^. responseBody let eitherAlgoliaHits = eitherDecode json :: Either String AlgoliaHits case eitherAlgoliaHits of Left s -> print s >> return [] Right ah -> return $ Prelude.map (\ h -> h { hackerNewsItemsId = foreignId }) (hits ah) where url' = HNI.url hni foreignId = HNI._id hni opts = defaults & param "query" .~ [DT.pack $ uriNoQuery url'] & param "tags" .~ [ "story" ] createDbTableIfNotExists :: IO () createDbTableIfNotExists = DB.createDbTableIfNotExists dbTableName attributes where attributes = DT.unpack [NI.text | _id INTEGER PRIMARY KEY , author TEXT , points INTEGER , createdAt INTEGER , title TEXT , url TEXT , hackerNewsItemsId INTEGER , FOREIGN KEY ( '${hniDbTableName}Id' ) REFERENCES $ {hniDbTableName}(_id) | ] where hniDbTableName = DT.pack HNI.dbTableName getHitInDb :: Integer -> IO ( Maybe AlgoliaHit ) getHitInDb hitId = withConnection' queryForHit >>= firstHit where queryForHit :: Connection -> IO [ AlgoliaHit ] queryForHit con = query con "SELECT * FROM ? WHERE _id = ? LIMIT 1;" (dbTableName, hitId) :: IO [ AlgoliaHit ] firstHit :: [ AlgoliaHit ] -> IO ( Maybe AlgoliaHit ) firstHit (x : y) = return ( Just x) firstHit [] = return Nothing getHitsWithUrlInDb :: String -> IO [ AlgoliaHit ] getHitsWithUrlInDb url' = withConnection' queryForHits where queryForHits :: Connection -> IO [ AlgoliaHit ] queryForHits con = query con "SELECT * FROM ? WHERE url LIKE ?;" (dbTableName, url' ++ "%" ) :: IO [ AlgoliaHit ] getUrlMatchesInDb :: IO [( Integer , Integer , Maybe String , Maybe String )] getUrlMatchesInDb = do result <- withConnection' queryDB let matches = Prelude.filter (( /= - 1 ) . CO.fst') ( Prelude.map (\ (ahId, hniId, ahUrl, hniUrl) -> if match ahUrl hniUrl then (ahId, hniId, ahUrl, hniUrl) else ( - 1 , - 1 , Just "" , Just "" ) ) result ) return matches where queryDB :: Connection -> IO [( Integer , Integer , Maybe String , Maybe String )] queryDB con = query_ con ( Query $ sqlString (DT.pack dbTableName) (DT.pack HNI.dbTableName)) sqlString dbTableName hniDbTableName = [NI.text | SELECT ah . _id, hni . _id, ah . url, hni . url FROM $ {dbTableName} AS ah INNER JOIN $ {hniDbTableName} AS hni ON ah .$ {hniDbTableName} Id = hni . _id | ] match :: Maybe String -> Maybe String -> Bool match Nothing Nothing = True match Nothing _ = False match _ Nothing = False match ( Just ahUrl) ( Just hniUrl) = tldDist == 0 && domainDist == 0 && subDomainDist < 10 && pathDist < 2 && queryDist == 0 where ahUri = uri $ cleanUri ahUrl hniUri = uri $ cleanUri hniUrl [ahTLD, hniTLD] = Prelude.map uriTLD [ahUri, hniUri] [ahDomain, hniDomain] = Prelude.map uriDomain [ahUri, hniUri] [ahSubdomain, hniSubdomain] = Prelude.map uriSubdomain [ahUri, hniUri] [ahPath, hniPath] = Prelude.map uriPath [ahUri, hniUri] [ahQuery, hniQuery] = Prelude.map uriQuery [ahUri, hniUri] ahQuery' = if CO.notNull ahPath then "" else ahQuery hniQuery' = if CO.notNull hniPath then "" else hniQuery [tldDist, domainDist, subDomainDist, pathDist, queryDist] = Prelude.map ( uncurry (levenshteinDistance defaultEditCosts)) [ (ahTLD, hniTLD) , (ahDomain, hniDomain) , (ahSubdomain, hniSubdomain) , (ahPath, hniPath) , (ahQuery', hniQuery') ] algoliaHitsIdsFromUrlMatches :: IO [ Integer ] algoliaHitsIdsFromUrlMatches = getUrlMatchesInDb >>= mapM (return . CO.fst') fixHitsForeignIds :: HNI.HackerNewsItem -> IO () fixHitsForeignIds hni = hits' >>= mapM_ insertOrReplace where foreignId = HNI._id hni updateForeignId h = do let h' = h { hackerNewsItemsId = foreignId } print h' return h' hits' = getHitsWithUrlInDb (HNI.url hni) >>= mapM updateForeignId existsInDb :: Integer -> IO Bool existsInDb hitId = getHitInDb hitId >>= maybe (return False ) (\ _ -> return True ) urlExistsInDb :: String -> IO Bool urlExistsInDb url = ( try (withConnection' queryForUrl >>= exists) :: IO ( Either SomeException Bool ) ) >>= either (\ e -> print e >> return False ) return where queryForUrl :: Connection -> IO [[ String ]] queryForUrl con = query con "SELECT url FROM ? WHERE url LIKE ? LIMIT 1;" (dbTableName, url ++ "%" ) :: IO [[ String ]] exists :: [[ String ]] -> IO Bool exists [] = return False exists (x : y) = if Prelude.null x then return False else return True idIfNotInDb :: Integer -> IO ( Maybe Integer ) idIfNotInDb hitId = existsInDb hitId >>= \ exists -> if exists then return Nothing else return ( Just hitId) insertOrReplaceMaybe :: Maybe AlgoliaHit -> IO () insertOrReplaceMaybe = maybe (return ()) insertOrReplace insertOrReplace :: AlgoliaHit -> IO () insertOrReplace AlgoliaHit { _id = i , author = a , points = p , createdAt = c , title = t , url = u , hackerNewsItemsId = hniId } = void (try (withConnection' insertOrReplace) :: IO ( Either SomeException ())) where insertOrReplace :: Connection -> IO () insertOrReplace con = execute con "REPLACE INTO ? (\ \ _id \ \ , author \ \ , points \ \ , createdAt \ \ , title \ \ , url \ \ , hackerNewsItemsId \ \ ) values (?, ?, ?, ?, ?, ?, ?);" ( dbTableName , i , a , p , t , u , hniId )

Elasticsearch.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE OverloadedStrings #-} module Elasticsearch where import GHC.Generics import Control.Lens hiding ((.=)) import Control.Monad import Network.URI hiding (query) import Data.Maybe import Data.Text as DT import Data.ByteString.Lazy import Data.Aeson import Data.Aeson.Types import Data.Map import Network.Wreq as W import DB import DateTimeUtil as DTU import qualified FrontPageItems as FPI import qualified HackerNewsItems as HNI import qualified AlgoliaHits as AH import qualified FrontsAndBacks as FAB import qualified StopWords as SW instance ToJSON HNI.HackerNewsItem where toJSON hni = object [ "id" .= HNI._id hni , "user" .= HNI.by hni , "points" .= HNI.score hni , "url" .= HNI.url hni , "title" .= HNI.title hni , "date" .= ( 1000 * HNI.time hni) , "month" .= DTU.monthOfYearEdtZonedTime (HNI.time hni) , "day" .= DTU.dayOfWeekEdtZonedTime (HNI.time hni) , "hour" .= DTU.hourOfDayEdtZonedTime (HNI.time hni) , "mininute" .= DTU.minOfHourEdtZonedTime (HNI.time hni) ] instance ToJSON AH.AlgoliaHit where toJSON ah = object [ "id" .= AH._id ah , "user" .= AH.author ah , "points" .= AH.points ah , "url" .= AH.url ah , "title" .= AH.title ah , "date" .= ( 1000 * AH.createdAt ah) , "month" .= DTU.monthOfYearEdtZonedTime (AH.createdAt ah) , "day" .= DTU.dayOfWeekEdtZonedTime (AH.createdAt ah) , "hour" .= DTU.hourOfDayEdtZonedTime (AH.createdAt ah) , "minute" .= DTU.minOfHourEdtZonedTime (AH.createdAt ah) , "frontId" .= AH.hackerNewsItemsId ah ] elasticsearchLocation :: String elasticsearchLocation = "http://localhost:9200/" elasticsearchIndexName :: String elasticsearchIndexName = "hackernewsfrontback/" elasticsearchIndex = elasticsearchLocation ++ elasticsearchIndexName deleteIndex :: IO () deleteIndex = void $ W.delete elasticsearchIndex createIndex :: IO () createIndex = void(post elasticsearchIndex json) where json = object [ "mappings" .= object [ "front" .= object [ "properties" .= object [ "id" .= object [ "type" .= ( "integer" :: String ) , "index" .= ( "not_analyzed" :: String ) ] , "user" .= object [ "type" .= ( "string" :: String ) , "index" .= ( "not_analyzed" :: String ) ] , "points" .= object [ "type" .= ( "integer" :: String ) ] , "url" .= object [ "type" .= ( "string" :: String ) , "index" .= ( "not_analyzed" :: String ) ] , "title" .= object [ "type" .= ( "string" :: String ) , "analyzer" .= ( "english" :: String ) ] , "date" .= object [ "type" .= ( "date" :: String ) ] , "month" .= object [ "type" .= ( "integer" :: String ) ] , "day" .= object [ "type" .= ( "integer" :: String ) ] , "hour" .= object [ "type" .= ( "integer" :: String ) ] , "minute" .= object [ "type" .= ( "integer" :: String ) ] ] ] , "back" .= object [ "_parent" .= object [ "type" .= ( "front" :: String ) ] , "properties" .= object [ "id" .= object [ "type" .= ( "integer" :: String ) , "index" .= ( "not_analyzed" :: String ) ] , "user" .= object [ "type" .= ( "string" :: String ) , "index" .= ( "not_analyzed" :: String ) ] , "points" .= object [ "type" .= ( "integer" :: String ) ] , "url" .= object [ "type" .= ( "string" :: String ) , "index" .= ( "not_analyzed" :: String ) ] , "title" .= object [ "type" .= ( "string" :: String ) , "analyzer" .= ( "english" :: String ) ] , "date" .= object [ "type" .= ( "date" :: String ) ] , "month" .= object [ "type" .= ( "integer" :: String ) ] , "day" .= object [ "type" .= ( "integer" :: String ) ] , "hour" .= object [ "type" .= ( "integer" :: String ) ] , "minute" .= object [ "type" .= ( "integer" :: String ) ] , "frontId" .= object [ "type" .= ( "integer" :: String ), "index" .= ( "not_analyzed" :: String ) ] ] ] ] ] indexFronts :: IO () indexFronts = FAB.getFrontsInDb >>= mapM_ index where index x = put (elasticsearchIndex ++ "front/" ++ _id x) $ toJSON x _id = show . HNI._id indexBacks :: IO () indexBacks = FAB.getBacksInDb >>= mapM_ index where index x = putWith (opts x) (elasticsearchIndex ++ "back/" ++ _id x) $ toJSON x opts x = defaults & param "parent" .~ [DT.pack $ show $ AH.hackerNewsItemsId x] _id = show . AH._id createAndFillIndex :: IO () createAndFillIndex = deleteIndex >> createIndex >> indexFronts >> indexBacks -------------------------------------------------------------------------------- runEsAgg :: Value -> IO Value runEsAgg v = postWith opts (elasticsearchIndex ++ "_search" ) v >>= parseResponse where opts = defaults & param "search_type" .~ [ "count" ] parseResponse :: Response ByteString -> IO Value parseResponse r = return (fromMaybe (object []) (decode (r ^. responseBody) :: Maybe Value )) parseAgg :: ( Value -> Parser [a]) -> Value -> Either String [a] parseAgg = parseEither aggResult :: ( Value -> Either String [a]) -> IO Value -> IO [a] aggResult f = fmap (either (const []) id . f) keyDocCountParser :: FromJSON a => Value -> Parser (a, Integer ) keyDocCountParser = withObject "keyDocCount" (\ obj -> do k <- obj .: "key" v <- obj .: "doc_count" return (k, v) ) -------------------------------------------------------------------------------- esAggDayHour :: IO Value esAggDayHour = runEsAgg json where json = object [ "aggs" .= object [ "type" .= object [ "terms" .= object [ "field" .= ( "_type" :: String ) , "size" .= ( 2 :: Integer ) , "order" .= object [ "_term" .= ( "asc" :: String ) ] ] , "aggs" .= object [ "day" .= object [ "terms" .= object [ "field" .= ( "day" :: String ) , "size" .= ( 7 :: Integer ) , "order" .= object [ "_term" .= ( "asc" :: String ) ] ] ] , "hour" .= object [ "terms" .= object [ "field" .= ( "hour" :: String ) , "size" .= ( 24 :: Integer ) , "order" .= object [ "_term" .= ( "asc" :: String ) ] ] ] ] ] ] ] data AggDayHourResult = AggDayHourResult { aggDayHourTypeKey :: String , aggDay :: [( Integer , Integer )] , aggHour :: [( Integer , Integer )] } deriving ( Show ) aggDayHourParser :: Value -> Parser [ AggDayHourResult ] aggDayHourParser = withObject "agg" (\ obj -> do aggs <- obj .: "aggregations" typee <- aggs .: "type" (bucket1 : bucket2 : y) <- typee .: "buckets" bucket1Key <- (bucket1 .: "key" ) :: Parser String dayBucket1 <- bucket1 .: "day" dayBucket1Buckets <- dayBucket1 .: "buckets" dayBucket1Buckets' <- mapM keyDocCountParser dayBucket1Buckets :: Parser [( Integer , Integer )] hourBucket1 <- bucket1 .: "hour" hourBucket1Buckets <- hourBucket1 .: "buckets" hourBucket1Buckets' <- mapM keyDocCountParser hourBucket1Buckets :: Parser [( Integer , Integer )] bucket2Key <- (bucket2 .: "key" ) :: Parser String dayBucket2 <- bucket2 .: "day" dayBucket2Buckets <- dayBucket2 .: "buckets" dayBucket2Buckets' <- mapM keyDocCountParser dayBucket2Buckets :: Parser [( Integer , Integer )] hourBucket2 <- bucket2 .: "hour" hourBucket2Buckets <- hourBucket2 .: "buckets" hourBucket2Buckets' <- mapM keyDocCountParser hourBucket2Buckets :: Parser [( Integer , Integer )] return [ AggDayHourResult { aggDayHourTypeKey = bucket1Key, aggDay = dayBucket1Buckets', aggHour = hourBucket1Buckets' } , AggDayHourResult { aggDayHourTypeKey = bucket2Key, aggDay = dayBucket2Buckets', aggHour = hourBucket2Buckets' } ] ) parseAggDayHour :: Value -> Either String [ AggDayHourResult ] parseAggDayHour = parseAgg aggDayHourParser aggDayHourResult :: IO [ AggDayHourResult ] aggDayHourResult = aggResult parseAggDayHour esAggDayHour -------------------------------------------------------------------------------- esAggTitle :: IO Value esAggTitle = runEsAgg json where json = object [ "aggs" .= object [ "type" .= object [ "terms" .= object [ "field" .= ( "_type" :: String ) , "size" .= ( 2 :: Integer ) , "order" .= object [ "_term" .= ( "asc" :: String ) ] ] , "aggs" .= object [ "title" .= object [ "terms" .= object [ "field" .= ( "title" :: String ) , "size" .= ( 10000000 :: Integer ) , "min_doc_count" .= ( 1 :: Integer ) , "exclude" .= SW.stopWords , "order" .= object [ "_term" .= ( "asc" :: String ) ] ] ] ] ] ] ] data AggTitleResult = AggTitleResult { aggTitleTypeKey :: String , aggTitle :: [( String , Integer )] } deriving ( Show ) aggTitleParser :: Value -> Parser [ AggTitleResult ] aggTitleParser = withObject "agg" (\ obj -> do aggs <- obj .: "aggregations" typee <- aggs .: "type" (bucket1 : bucket2 : y) <- typee .: "buckets" bucket1Key <- (bucket1 .: "key" ) :: Parser String titleBucket1 <- bucket1 .: "title" titleBucket1Buckets <- titleBucket1 .: "buckets" titleBucket1Buckets' <- mapM keyDocCountParser titleBucket1Buckets :: Parser [( String , Integer )] bucket2Key <- (bucket2 .: "key" ) :: Parser String titleBucket2 <- bucket2 .: "title" titleBucket2Buckets <- titleBucket2 .: "buckets" titleBucket2Buckets' <- mapM keyDocCountParser titleBucket2Buckets :: Parser [( String , Integer )] return [ AggTitleResult { aggTitleTypeKey = bucket1Key, aggTitle = titleBucket1Buckets' } , AggTitleResult { aggTitleTypeKey = bucket2Key, aggTitle = titleBucket2Buckets' } ] ) parseAggTitle :: Value -> Either String [ AggTitleResult ] parseAggTitle = parseAgg aggTitleParser aggTitleResult :: IO [ AggTitleResult ] aggTitleResult = aggResult parseAggTitle esAggTitle -------------------------------------------------------------------------------- esAggUser :: IO Value esAggUser = runEsAgg json where json = object [ "aggs" .= object [ "type" .= object [ "terms" .= object [ "field" .= ( "_type" :: String ) , "size" .= ( 2 :: Integer ) , "order" .= object [ "_term" .= ( "asc" :: String ) ] ] , "aggs" .= object [ "user" .= object [ "terms" .= object [ "field" .= ( "user" :: String ) , "size" .= ( 10000000 :: Integer ) , "min_doc_count" .= ( 1 :: Integer ) , "order" .= object [ "_count" .= ( "desc" :: String ) ] ] ] ] ] ] ] data AggUserResult = AggUserResult { aggUserTypeKey :: String , aggUser :: [( String , Integer )] } deriving ( Show ) aggUserParser :: Value -> Parser [ AggUserResult ] aggUserParser = withObject "agg" (\ obj -> do aggs <- obj .: "aggregations" typee <- aggs .: "type" (bucket1 : bucket2 : y) <- typee .: "buckets" bucket1Key <- (bucket1 .: "key" ) :: Parser String userBucket1 <- bucket1 .: "user" userBucket1Buckets <- userBucket1 .: "buckets" userBucket1Buckets' <- mapM keyDocCountParser userBucket1Buckets :: Parser [( String , Integer )] bucket2Key <- (bucket2 .: "key" ) :: Parser String userBucket2 <- bucket2 .: "user" userBucket2Buckets <- userBucket2 .: "buckets" userBucket2Buckets' <- mapM keyDocCountParser userBucket2Buckets :: Parser [( String , Integer )] return [ AggUserResult { aggUserTypeKey = bucket1Key, aggUser = userBucket1Buckets' } , AggUserResult { aggUserTypeKey = bucket2Key, aggUser = userBucket2Buckets' } ] ) parseAggUser :: Value -> Either String [ AggUserResult ] parseAggUser = parseAgg aggUserParser aggUserResult :: IO [ AggUserResult ] aggUserResult = aggResult parseAggUser esAggUser

FrontsAndBacks.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE QuasiQuotes, OverloadedStrings #-} module FrontsAndBacks where import GHC.Generics import Control.Lens import Control.Monad import Control.Exception import Control.Applicative import Network.URI hiding (query) import Data.Maybe import Data.Either import Data.List as DL import Data.Text as DT import Data.ByteString.Lazy import Data.Aeson import Data.Aeson.Types import Database.SQLite.Simple import Database.SQLite.Simple.FromRow import NeatInterpolation as NI import DB import qualified FrontPageItems as FPI import qualified HackerNewsItems as HNI import qualified AlgoliaHits as AH backsSqlSelectionString :: IO String backsSqlSelectionString = fmap makeSqlString AH.algoliaHitsIdsFromUrlMatches where makeSqlString ids = DT.unpack $ [NI.text | SELECT * FROM $ {ahDbTableName} WHERE _id NOT IN ( SELECT _id FROM $ {fpiDbTableName} ) AND url in ( SELECT url FROM $ {ahDbTableName} GROUP BY url HAVING COUNT (url) > 1 ) AND points <= 3 AND LENGTH (url) > 0 AND _id in ( SELECT ah . _id FROM $ {ahDbTableName} AS ah INNER JOIN $ {hniDbTableName} AS hni ON ah .$ {hniDbTableName} Id = hni . _id WHERE ah . createdAt <= hni . time ) AND _id IN ( $ {ids'} ) AND $ {hniDbTableName} Id NOT IN ( SELECT ah .$ {hniDbTableName} Id FROM $ {ahDbTableName} AS ah INNER JOIN $ {hniDbTableName} AS hni ON ah .$ {hniDbTableName} Id = hni . _id WHERE ah . _id != hni . _id AND ah . createdAt != hni . time AND ah . points > 3 AND LENGTH (ah . url) > 0 ) | ] where ids' = DT.pack $ DL.intercalate "," (Prelude.map show ids) [ahDbTableName, hniDbTableName, fpiDbTableName] = Prelude.map DT.pack [ AH.dbTableName , HNI.dbTableName , FPI.dbTableName ] getBacksInDb :: IO [ AH.AlgoliaHit ] getBacksInDb = withConnection' queryForHits where queryForHits :: Connection -> IO [ AH.AlgoliaHit ] queryForHits con = backsSqlSelectionString >>= (\ sql -> query_ con ( Query $ DT.pack (sql ++ ";" )) ) :: IO [ AH.AlgoliaHit ] getFrontsInDb :: IO [ HNI.HackerNewsItem ] getFrontsInDb = withConnection' queryForItems where queryForItems :: Connection -> IO [ HNI.HackerNewsItem ] queryForItems con = backsSqlSelectionString >>= (\ backSqlString -> query_ con ( Query $ makeSqlString $ DT.pack backSqlString ) ) makeSqlString backSqlString = [NI.text | SELECT * FROM $ {hniDbTableName} WHERE _id IN ( SELECT $ {hniDbTableName} Id FROM ( $ {backSqlString}) ); | ] where hniDbTableName = DT.pack HNI.dbTableName

SentimentApi.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE DeriveGeneric, OverloadedStrings #-} module SentimentApi where import GHC.Generics import Control.Lens hiding ((.=)) import Control.Monad import Control.Exception import Control.Applicative import Network.URI hiding (query) import Data.Maybe import Data.Either import Data.Text import Data.ByteString.Lazy import Data.Aeson import Data.Aeson.Types import Network.Wreq -- https://github.com/mikelynn2/sentimentAPI data SentimentApiResult = SentimentApiResult { score :: Double , result :: String } deriving ( Show , Generic ) instance FromJSON SentimentApiResult where parseJSON ( Object v) = SentimentApiResult <$> v .: "score" <*> v .: "result" -- curl -H "Content-Type: application/json" -X POST -d '{"text":"text"}' http://127.0.0.1:8000/api/sentiment/v1 getSentimentApiResults :: [ String ] -> IO [ SentimentApiResult ] getSentimentApiResults [] = return [] getSentimentApiResults (x : y) = fmap catMaybes (mapM getSentimentApiResult (x : y)) getSentimentApiResult :: String -> IO ( Maybe SentimentApiResult ) getSentimentApiResult s = post apiUrl json >>= parseResponse where apiUrl = "http://localhost:8000/api/sentiment/v1" json = object [ "text" .= ( s :: String ) ] parseResponse :: Response ByteString -> IO ( Maybe SentimentApiResult ) parseResponse r = return (decode (r ^. responseBody) :: Maybe SentimentApiResult )

TimeAnalysis.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE OverloadedStrings #-} module TimeAnalysis where import GHC.Generics import GHC.OldList as GOL import Text.Printf import Data.Maybe import Data.List as DL import Data.Default.Class import Data.Colour import Data.Colour.Names import Graphics.Rendering.Chart import Graphics.Rendering.Chart.Easy import Graphics.Rendering.Chart.Backend.Cairo import qualified Elasticsearch as ES import qualified Common as CO degreesOfFreedom :: Int -> Int -> Int degreesOfFreedom rows cols = (rows - 1 ) * (cols - 1 ) expectedFreq :: Integer -> Integer -> Integer -> Double expectedFreq rowTotal tableTotal colTotal = (fromInteger rowTotal * fromInteger colTotal) / fromInteger tableTotal expectedFreqs :: Integer -> Integer -> [ Integer ] -> [ Double ] expectedFreqs _ _ [] = [] expectedFreqs rowTotal tableTotal colTotals = Prelude.map (expectedFreq rowTotal tableTotal) colTotals chiSquare :: [( Integer , Double )] -> Double chiSquare [] = - 1.0 chiSquare (x : y) = Prelude.sum ( Prelude.map (\ (o, e) -> ((fromInteger o - e) ** 2.0 ) / e ) (x : y) ) cramersV :: Integer -> Int -> Int -> Double -> ( Double , Int ) cramersV n rows cols chi = ((chi / (fromIntegral n * fromIntegral df)) ** ( 1 / 2 ), df) where df = min (rows - 1 ) (cols - 1 ) chiSquareHour :: IO ( Double , Int ) chiSquareHour = chiSquareAgg (hoursList . ES.aggHour) chiSquareDay :: IO ( Double , Int ) chiSquareDay = chiSquareAgg (daysList . ES.aggDay) chiSquareAgg :: ( ES.AggDayHourResult -> [ Integer ]) -> IO ( Double , Int ) chiSquareAgg f = do aggDayHourResult <- ES.aggDayHourResult if Prelude.length aggDayHourResult == 2 then do let backs = f $ Prelude.head aggDayHourResult let fronts = f $ Prelude.last aggDayHourResult let rowTotals = [Prelude.sum backs, Prelude.sum fronts] let colTotals = Prelude.zipWith ( + ) backs fronts let tableTotal = Prelude.sum rowTotals print tableTotal print (Prelude.sum colTotals) print rowTotals let observed = backs ++ fronts let backsExpected = expectedFreqs (Prelude.head rowTotals) tableTotal colTotals let frontsExpected = expectedFreqs (Prelude.last rowTotals) tableTotal colTotals printCountExpectedValueRow "Back" backs backsExpected printCountExpectedValueRow "Front" fronts frontsExpected print colTotals let expected = backsExpected ++ frontsExpected let expectedLtFive = Prelude.filter ( < 5.0 ) expected putStrLn $ "% Expected Values less than 5.0: " ++ show ( 100.0 * ( fromIntegral (Prelude.length expectedLtFive) / fromIntegral (Prelude.length expected) )) let observedExpected = Prelude.zip observed expected let numCols = Prelude.length colTotals let numRows = Prelude.length rowTotals let chi = chiSquare observedExpected let dof = degreesOfFreedom numRows numCols putStrLn $ "CramersV: " ++ show (cramersV tableTotal numRows numCols chi) return (chi, dof) else return ( - 1.0 , 0 ) where printCountExpectedValueRow t c ev = putStrLn $ t ++ "," ++ DL.intercalate "," ( Prelude.map (\(x,y) -> show x ++ " (" ++ (Text.Printf.printf "%.2f" y :: String ) ++ ")" ) $ Prelude.zip c ev ) timeList :: Int -> [ Integer ] -> [( Integer , Integer )] -> [ Integer ] timeList t _ [] = Prelude.take t [ 0 , 0 .. ] timeList t r (x : y) = Prelude.map (\ a -> maybe 0 snd (GOL.find (( == a) . fst) (x : y)) ) r daysList :: [( Integer , Integer )] -> [ Integer ] daysList = timeList 7 [ 1 .. 7 ] hoursList :: [( Integer , Integer )] -> [ Integer ] hoursList = timeList 24 [ 0 .. 23 ] backsDayAgg :: IO [( Integer , Integer )] backsDayAgg = do aggDayHourResult <- ES.aggDayHourResult let backs = (daysList . ES.aggDay) $ Prelude.head aggDayHourResult return (Prelude.zip [ 1 .. 7 ] backs) frontsDayAgg :: IO [( Integer , Integer )] frontsDayAgg = do aggDayHourResult <- ES.aggDayHourResult let fronts = (daysList . ES.aggDay) $ Prelude.last aggDayHourResult return (Prelude.zip [ 1 .. 7 ] fronts) chartDayAgg :: IO () chartDayAgg = do backsDayAgg' <- backsDayAgg frontsDayAgg' <- frontsDayAgg let backsTotal = CO.secondSum backsDayAgg' :: Integer let frontsTotal = CO.secondSum frontsDayAgg' :: Integer let dow = [ "Mon" , "Tue" , "Wed" , "Thur" , "Fri" , "Sat" , "Sun" ] let plotData = Prelude.map ( \ ((d, b), (_, f)) -> (dow !! fromInteger (d - 1 ), [ (fromInteger b / fromInteger backsTotal) :: Double , (fromInteger f / fromInteger frontsTotal) :: Double ]) ) $ Prelude.zip backsDayAgg' frontsDayAgg' print $ Prelude.sum $ map (\ (_, b : f : _) -> b) plotData print $ Prelude.sum $ map (\ (_, b : f : _) -> f) plotData let plotDataDiff = Prelude.map ( \ (d, b : f : _) -> (d, [abs $ ( f :: Double ) - ( b :: Double )]) ) plotData toFile ( FileOptions ( 1500 , 1000 ) SVG ) "./charts/chartDayAgg.svg" $ do layout_title .= "Hacker News Fronts vs Backs - Days of the Week" layout_y_axis . laxis_generate .= scaledAxis def ( 0.0 , 0.2 ) layout_x_axis . laxis_generate .= autoIndexAxis (map fst plotData) setColors $ map opaque [lightskyblue, lightcoral] plot $ plotBars <$> bars' [ "Backs" , "Fronts" ] (addIndexes (map snd plotData)) toFile ( FileOptions ( 1500 , 1000 ) SVG ) "./charts/chartDayDiffAgg.svg" $ do layout_title .= "Hacker News Fronts vs Backs - Days of the Week Difference" layout_y_axis . laxis_generate .= scaledAxis def ( 0.0 , 0.2 ) layout_x_axis . laxis_generate .= autoIndexAxis (map fst plotDataDiff) setColors $ map opaque [lightsteelblue] plot $ plotBars <$> bars' [ "|Fronts - Backs|" ] (addIndexes (map snd plotDataDiff)) return () backsHourAgg :: IO [( Integer , Integer )] backsHourAgg = do aggDayHourResult <- ES.aggDayHourResult let backs = (hoursList . ES.aggHour) $ Prelude.head aggDayHourResult return (Prelude.zip [ 0 .. 23 ] backs) frontsHourAgg :: IO [( Integer , Integer )] frontsHourAgg = do aggDayHourResult <- ES.aggDayHourResult let fronts = (hoursList . ES.aggHour) $ Prelude.last aggDayHourResult return (Prelude.zip [ 0 .. 23 ] fronts) chartHourAgg :: IO () chartHourAgg = do backsHourAgg' <- backsHourAgg frontsHourAgg' <- frontsHourAgg let backsTotal = CO.secondSum backsHourAgg' :: Integer let frontsTotal = CO.secondSum frontsHourAgg' :: Integer let hod = map (\ x -> show x ++ "00" ) [ 0 .. 23 ] let plotData = Prelude.map ( \ ((h, b), (_, f)) -> (hod !! fromInteger h, [ (fromInteger b / fromInteger backsTotal) :: Double , (fromInteger f / fromInteger frontsTotal) :: Double ]) ) $ Prelude.zip backsHourAgg' frontsHourAgg' print $ Prelude.sum $ map (\ (_, b : f : _) -> b) plotData print $ Prelude.sum $ map (\ (_, b : f : _) -> f) plotData let plotDataDiff = Prelude.map ( \ (h, b : f : _) -> (h, [abs $ ( f :: Double ) - ( b :: Double )]) ) plotData toFile ( FileOptions ( 1500 , 1000 ) SVG ) "./charts/chartHourAgg.svg" $ do layout_title .= "Hacker News Fronts vs Backs - Hours of the Day" layout_y_axis . laxis_generate .= scaledAxis def ( 0.0 , 0.1 ) layout_x_axis . laxis_generate .= autoIndexAxis (map fst plotData) setColors $ map opaque [lightskyblue, lightcoral] plot $ plotBars <$> bars' [ "Backs" , "Fronts" ] (addIndexes (map snd plotData)) toFile ( FileOptions ( 1500 , 1000 ) SVG ) "./charts/chartHourDiffAgg.svg" $ do layout_title .= "Hacker News Fronts vs Backs - Hours of the Day Difference" layout_y_axis . laxis_generate .= scaledAxis def ( 0.0 , 0.1 ) layout_x_axis . laxis_generate .= autoIndexAxis (map fst plotDataDiff) setColors $ map opaque [lightsteelblue] plot $ plotBars <$> bars' [ "|Fronts - Backs|" ] (addIndexes (map snd plotDataDiff)) return () chartDayHourAgg :: IO () chartDayHourAgg = chartDayAgg >> chartHourAgg -- Modified from -- https://hackage.haskell.org/package/Chart-1.8/docs/src/Graphics-Rendering-Chart-Easy.html#bars -- to remove borders. bars' :: ( PlotValue x, BarsPlotValue y) => [ String ] -> [(x,[y])] -> EC l ( PlotBars x y) bars' titles vals = liftEC $ do styles <- sequence [fmap mkStyle takeColor | _ <- titles] plot_bars_titles .= titles plot_bars_values .= vals plot_bars_style .= BarsClustered plot_bars_spacing .= BarsFixGap 30 5 plot_bars_item_styles .= styles where mkStyle c = (solidFillStyle c, Nothing )

TitleAnalysis.hs

{- David Lettier (C) 2016 http://www.lettier.com/ -} {-# LANGUAGE QuasiQuotes, OverloadedStrings #-} module TitleAnalysis where import System.Process import Control.Monad import Data.Maybe import Data.List as DL import Data.Map as DM import Data.Set as DS import Data.Text as DT import qualified Data.ByteString.Lazy as DBL import Data.Aeson import NeatInterpolation as NI import qualified HackerNewsItems as HNI import qualified AlgoliaHits as AH import qualified FrontsAndBacks as FOB import qualified Elasticsearch as ES import qualified SentimentApi as SA import qualified Common as CO esAggTitleResult :: IO ([( String , Integer )], [( String , Integer )]) esAggTitleResult = CO.esAggBackFrontFieldResult ES.aggTitle ES.aggTitleResult corpusIO :: IO [ String ] corpusIO = fmap corpus esAggTitleResult corpus :: ([( String , Integer )], [( String , Integer )]) -> [ String ] corpus (b, f) = DL.sort $ DS.toList $ DS.fromList $ Prelude.map fst (b ++ f :: [( String , Integer )]) backsAndFrontsTitles :: IO ([ String ],[ String ]) backsAndFrontsTitles = do backsTitles <- fmap (Prelude.map AH.title) FOB.getBacksInDb frontsTitles <- fmap (Prelude.map HNI.title) FOB.getFrontsInDb return (backsTitles, frontsTitles) chartTitleAgg :: IO () chartTitleAgg = do (backs, fronts) <- esAggTitleResult let backsTotal = CO.secondSum backs let frontsTotal = CO.secondSum fronts let backsMap = CO.makeKeyMap backsTotal backs let frontsMap = CO.makeKeyMap frontsTotal fronts let keys = corpus (backs, fronts) let backsMap' = CO.makeAllKeyMap keys backsMap let frontsMap' = CO.makeAllKeyMap keys frontsMap let backsFronts = [ (k, [CO.lookupKey k backsMap', CO.lookupKey k frontsMap']) | k <- keys, CO.lookupKey k frontsMap' >= 0.0 ] let tableString = DL.intercalate "

" $ "Type Word Value" : [ "Backs \"" ++ k ++ "\" " ++ show b ++ "

Fronts \"" ++ k ++ "\" " ++ show f | (k, [b, f]) <- backsFronts ] let tableFile = "./data/txt/chartTitleAggTable.txt" writeFile tableFile tableString let rfile = "./src/chartTitleAgg.r" let chartFile = "./charts/chartTitleAgg.svg" CO.writeROrPyFile makeRFile tableFile chartFile rfile CO.runRFile rfile return () where makeRFile :: Text -> Text -> Text makeRFile tableFile chartFile = [NI.text | library(ggplot2); data = read . table(file = '${tableFile}' , header = T ); p = ggplot( data , aes(reorder(factor( Word ), Value ), Value , fill = Type )) + geom_bar(stat = 'identity' , width = 0.5 , position = position_dodge(width = 0.5 )) + coord_flip() + scale_fill_manual(values = c( '#87cefa' , '#f08080' )) + labs(x = '' , y = '' , title = 'Hacker News Fronts vs Backs - Title Word Relative Frequency' ) + theme( legend . position = 'top' , legend . title = element_blank(), plot . margin = unit(c( 1 , 1 , 1 , 1 ), 'in' ), axis . text . y = element_text(size = 12 ) ); ggsave(filename = '${chartFile}' , plot = p, width = 20 , height = 400 , units = 'in' , limitsize = F ); | ] chartSentimentAnalysis :: IO () chartSentimentAnalysis = do (backsTitles, frontsTitles) <- backsAndFrontsTitles backsSents <- sents backsTitles frontsSents <- sents frontsTitles let backsFronts = [( "Backs" , s) | s <- backsSents] ++ [( "Fronts" , s) | s <- frontsSents] let tableString = DL.intercalate "

" $ "Type Score" : [ k ++ " " ++ show s | (k, s) <- backsFronts ] let tableFile = "./data/txt/chartTitleSentimentTable.txt" writeFile tableFile tableString let rfile = "./src/chartTitleSentiment.r" let chartFile = "./charts/chartTitleSentiment.svg" CO.writeROrPyFile makeRFile tableFile chartFile rfile CO.runRFile rfile return () where sents :: [ String ] -> IO [ Double ] sents t = fmap (Prelude.map SA.score) (SA.getSentimentApiResults t) makeRFile :: Text -> Text -> Text makeRFile tableFile chartFile = [NI.text | library(ggplot2); data = read . table(file = '${tableFile}' , header = T ); slice = data $$ Type == 'Fronts' data . fronts = data [slice,] $$ Score data . backs = data [ ! slice,] $$ Score t . test( data . fronts, data . backs) p = ggplot( data , aes(x = factor( Type ), y = Score ), fill = factor( Type )) + geom_boxplot(aes(fill = factor( Type )), outlier . colour = '#ff00a2' ) + geom_jitter() + scale_fill_manual(values = c( '#87cefa' , '#f08080' )) + coord_flip() + labs( x = '' , y = '

LinearSVC Confidence Score (Distance from Sample to Hyperplane)



0 - Neutral, >0 - Positive, <0 - Negative' , title = 'Hacker News Fronts vs Backs - Title Sentiment' ) + theme( legend . position = 'top' , legend . title = element_blank(), plot . margin = unit(c( 1 , 1 , 1 , 1 ), 'in' ), axis . text . y = element_text(size = 12 ) ); ggsave(filename = '${chartFile}' , plot = p, width = 20 , height = 20 , units = 'in' , limitsize = F ); | ] chartTitleProjection :: IO () chartTitleProjection = do (backsTitles, frontsTitles) <- backsAndFrontsTitles let json = encode $ object [ "fronts" .= backsTitles, "backs" .= frontsTitles ] let jsonFile = "./data/json/chartTitleProjection.json" DBL.writeFile jsonFile json let chartFile = "./charts/chartTitleProjection" let pyfile = "./venv/src/chartTitleProjection.py" CO.writeROrPyFile makePyFile jsonFile chartFile pyfile createProcess (proc "./venv/bin/python" [pyfile]) return () where makePyFile :: Text -> Text -> Text makePyFile jsonFile chartFile = [NI.text | import json import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from mpl_toolkits . mplot3d import axes3d from sklearn . manifold import TSNE , MDS , LocallyLinearEmbedding from sklearn . feature_extraction . text import TfidfVectorizer with open( '${jsonFile}' ) as f : data = json . load(f) tfidf = TfidfVectorizer (stop_words = 'english' , strip_accents = 'unicode' , norm = None ) tfidf . fit( data [ 'backs' ] + data [ 'fronts' ]) for projector in [ MDS , TSNE , LocallyLinearEmbedding ] : for two_d in [ True , False ] : backVecs = tfidf . transform( data [ 'backs' ]) frontVecs = tfidf . transform( data [ 'fronts' ]) n_components = 2 if two_d else 3 transBackVecs = projector(n_components = n_components) . fit_transform(backVecs . toarray()) transFrontVecs = projector(n_components = n_components) . fit_transform(frontVecs . toarray()) transBackVecs1 = [] transFrontVecs1 = [] for r in transFrontVecs : transFrontVecs1 . append(list(r) + [ 'Fronts' ]) for r in transBackVecs : transBackVecs1 . append(list(r) + [ 'Backs' ]) transBackFrontVecs = transBackVecs1 + transFrontVecs1 columns = [ 'x' , 'y' , 'Type' ] if two_d else [ 'x' , 'y' , 'z' , 'Type' ] df = pd . DataFrame (transBackFrontVecs, columns = columns) title = 'Hacker News Fronts vs Backs - Titles Projection ' + projector . __name__ + '

' if two_d : sns . plt . clf() sns . set_style( 'darkgrid' ) sns . set_palette(sns . color_palette([ '#87cefa' , '#f08080' ])) p = sns . lmplot( x = 'x' , y = 'y' , data = df, fit_reg = False , hue = 'Type' , size = 15 , markers = [ 'o' , 'p' ], scatter_kws = { 's' : 100 , 'edgecolor' : 'black' , 'linewidth' : 1.0 } ) p . fig . subplots_adjust(top = 0.90 , bottom = 0.10 , left = 0.10 , right = 0.90 ) chartFile = '${chartFile}' + projector . __name__ + '.svg' sns . plt . title( title, fontsize = 20 ) sns . plt . savefig(chartFile) else : fig = plt . figure() ax = fig . add_subplot( 1 , 1 , 1 , axisbg = '1.0' ) ax = fig . gca(projection = '3d' ) for coords, color in [(transBackVecs1, '#87cefa' ), (transFrontVecs1, '#f08080' )] : df = pd . DataFrame (coords, columns = columns) ax . scatter(df[ 'x' ] . tolist(), df[ 'y' ] . tolist(), df[ 'z' ] . tolist(), color = color) plt . title(title) plt . legend(loc = 2 ) plt . show() print( 'Done.' ) | ]

UserAnalysis.hs