\$\begingroup\$

First of all, good job! This is obviously a complex algorithm and it looks like it's working.

I'm going to do this incrementally. So I'll save this and keep editing as I go. And since it's so long, I won't get to everything. Plus I don't understand the algorithm too well.

First of all, doc strings in Clojure go before the arguments. I used to make this mistake all the time. The reason is that you can have multi-variate functions.

1:

(defn insert [m k] "Inserts a key k into a map m if it does not exist or increments the count if it does" (let [val (m k)] (assoc m k (inc (if (nil? val) 0 val)))))

My version:

(defn insert "Inserts a key k into a map m if it does not exist or increments the count if it does" [m k] (update-in m [k] (fnil inc 0)))

See update-in and fnil.

2:

#() construction is unnecessary here:

(defn sentence->tags [sentence] (map #(second %) sentence))

My version:

(defn sentence->tags [sentence] (map second sentence))

3:

(defn word-count [tagged-str] "Example of how to get word counts" (reduce #(insert %1 (first %2)) {} tagged-str))

Could this be replaced by using frequencies?

4:

Now it's starting to get hairy. I may make some mistakes here, because the code is not factored.

(defn viterbi-init [v path obs states start-p emit-p] "Initializes viterbi for us" (reduce #(into %1 {%2 [(* (start-p %2) (emit-p (first obs) %2)) (conj path %2)]}) {} states))

This definitely has too many levels of into/reduce. You could do a (reduce #(assoc %1 %s ...) {} states) pattern. Do we need v ? And why are we passing in a list obs when we only need the first? And it's often a good idea to put the driving sequence in the first or last position, so you can do threading. Let's try it this way:

(defn viterbi-init "Initializes viterbi for us" [states path ob start-p emit-p] (into {} (for [state states] [state [(* (start-p state) (emit-p ob state)) (conj path state)]])))

5:

This one is very hairy. I don't quite understand it, but I will try.

(defn viterbi-step [prior obs states trans-p emit-p] "Goes through one step of viterbi for us, taking a prior state and performing one step" (apply merge (map (comp (fn [[path v]] {(last path) [v path]}) #(apply max-key val %) #(apply merge %)) ((fn [obs] (map #(applyAll (map (comp (fn [[v past-st]] (fn [current-st] {(conj (second (prior past-st)) current-st) (* v (trans-p current-st past-st) (emit-p obs current-st))})) (partial extract-prob-state prior)) states) %) states)) obs))))

So, I tried and failed to refactor this myself. But I will give my general feedback. What this function, which is a map of a map of a map of a map, tells me is that there is a failure of abstraction. viterbi-step should be should be a high-level function which should read somewhat like the inner loop of a pseudo-code implementation of Viterbi. This function relies too much on the structure of the data structures involved. Deeply nested structures are common, but a single function that accesses them so deeply is not. A good rule of thumb is at most 1 nested map/reduce within a function.

There need to be functions which act as your primitive operations here. I can see that you began writing some near the top. You should continue that trend here. Then your functions would be operating at a certain level and calling functions from the level below.

An alternative approach would be to turn the algorithm into a sequential series of steps. This may or may not apply here, but it is hard for me to tell. As an example (not real code!):

(defn viterbi-step [prior obs states trans-p emit-p] (-> obs (calculate-priors) (extract-prob-states states) (extract-path) (merge)))

Again, it's just an example. But the idea is that each function takes the data it needs and creates a new data structure that is the result of that calculation. I don't know if this is possible with this algorithm. But it could be. One hint I can give is that you know you are on the right track when your functions are returning "appropriate" data structures. That is, when the data is a mapping, you return a map. When it's a set, you return a set. Also, the functions don't take much more data than they need to calculate the answer. I suggest you take the iterative algorithm description and work backwards from the final output.

Again, nice going. It was a pleasure to go through it.