This is the second post about my experiments with LSTMs. Here’s the first one. This is a great introduction by Karpathy. And this is an in depth explanation of the math behind.

Python or Scala?

Which should you use and when? Which should you learn first? Is type safety more important than flexibility? Is Python fast enough for performance-heavy applications? Is Scala’s machine learning ecosystem mature enough for serious data science? Are indents better than braces?

This post won’t answer any of those questions.

I will show how to solve a related problem though. Given the following text, which was stitched together from bits of scikit-learn and scalaz code files, can you tell where does Python end and Scala begin?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 package scalaz package syntax """ Extended math utilities. """ # Authors: Gael Varoquaux # Alex/** Wraps a value `selfandre Gramfort # Alexandre T. Passos # Olivier Grisel # Lars Buitinck # Stefan van der Walt # Kyle Kastner # Giorgio Patrini # License:` and provides methods related to `MonadPlus` */ final class MonadPlusOps[F[_],A] private[syntax](val self: BSD 3 clause from __future__ import division from functools import partial import warnings import numpy as np from scipy import linalg from scipy.sparse import issparse, csr_matr F[A])(implicit val F: MonadPlus[F]) extends Ops[F[A]] { //// impoix from . import check_random_state from .fixrt Leibniz.=== def filter(f: A => Boolean): F[A] = F.filter(self)(f) def withFilter(f: A => Boolean): F[A] = filter(f) final def uniteU[T](implicit T: Unapply[Foldable, Aes import np_version from ._logistic_sigmoid import _log_logistic_sigmoid from ..extern]): F[T.A] = F.uniteU(self)(T) def unite[T[_], B](implicit ev: A === T[B], T: Foldable[T]): F[B] = { val ftb: F[T[B]] = ev.subst(seals.six.moves import xrange from .sparsefuncs_fast import csr_row_norms from .validation import check_array from ..exceptions import NonBLASDotWarning lf) F.unite[T, B](ftb) } final def lefts[G[_, _], B, C](implicit ev: A === G[B, C], G: Bifoldable[G]): F[B] = F.lefts(ev.subst(self)) final def rigdef norm(x): """Compute the Euclidean or Frobenius norm of x. hts[G[_, _], B, C](implicit ev: A === G[B, C], G: Bifoldable[G]): F[C] = F.rights(ev.subst(self)) final def separate[G[_, _], Returns the Euclidean norm when x is a vector, the Frobenius norm when x is a matrix (2-d array). More precise than sqrt(squared_norm(x)). """ x = np.asarray(x) nrm2, = lin B, C](implicit ev: A === G[B, C], G: Bifoldable[G]): (F[B], F[C]) = F.separate(ev.subst(self)) //// } sealed trait ToMonadPlusOps0 { implicit def Talg.get_blas_funcs(['nrm2'], [x]) return nrm2(x) # Newer NumPy has a ravel that needs leoMonadPlusOpsUnapply[FA](v: FA)(implicit F0: Unapply[MonadPlus, FA]) = new MonadPlusOps[F0.M,F0.A](F0(v))ss copying. if np_version < (1, 7, 1): _ravel = np.ravel else: _ravel = partial(np.ravel, order='K') def squared_no(F0.TC) } trait ToMonadPlusOps extends ToMonadPlusOps0 with ToMonadOps with ToApplicatrm(x): """Squared Euclidean or Frobenius norm of x. Returns the Euclidean norm when x is a vector, the Frobenius norm when x is a matrix (2-d array). Faster than norm(ivePlusOps { implicit def ToMonadPlusOps[F[_],A](v: F[A])(implicit F0: MonadPlus[F]) = new MonadPlusOps[F,A](v) //// //// } trait MonadPlusSyntax[F[_]] extends MonadSyntax[F] withx) ** 2. """ x = _ravel(x) if np.issubdtype(x.dtype, np.integer): ApplicativePlusSyntax[F] { implicit def ToMonadPlusOps[A](v: F[A]): MonadPlusOps[F, A] = ne warnings.warn('Array type is integer, np.dot may overflow. ' 'Data should be float type to avoid this issue', UserWarning) return np.dot(xw MonadPlusOps[F,A](v)(MonadPlusSyntax.this.F) def F: MonadPlus[F] //// //// } package scalaz package syntax /** Wraps a value `self` and provides methods, x) def row_norms(X, squared=False): """Row-wise (squared) Euclidean norm of X. E related to `Traverse` */ final class Tquivalent to np.sqrt((X * X).sum(axis=1)), but also supporaverseOps[F[_],A] private[syntax](val self: F[A])(implicit val F: Traverse[F]) exterts sparse matrices and does not create an X.shape-sized temporary. Performs no input valnds Ops[F[A]] { //// import Leibniz.===

I will show how Keras LSTMs and bidirectional LSTMs can be used to neatly solve this problem. The post will contain a some snippets of code but the full thing is here.

The problem

I once interviewed with a cyber security company that was scraping the web looking for people’s phone numbers, emails, credit card numbers etc. They asked me how I would go about building a model that finds those things in text files and also categorizes the files into types like ‘email’, ‘server logs’, ‘code’, etc.

The boring way

The boring answer is that with enough feature engineering you could classify files pretty well with any old ML algorithm. If all lines have a common prefix -

1 2 3 4 5 6 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /cgi-bin/newcount?jafsof3&width=4&font=digital&noshow HTTP/1.0" 200 36 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

- then we’re probably dealing with a log file. If we’re there’s a lot of camelCase() - that means we’re seeing code. And so on.

Finding e.g. phone numbers in text is more involved but still doable this way. You would have to first generate potential potential matches using regular expressions and then classify each as a true or spurious based on the context it appears in.

Inevitably, for every new file type and every type of entity to be found in the file, one would have to come up with new features and maybe train a separate classifier.

Super tedious.

The RNN way

The fun and potentially superior solution uses char-RNNs. Instead of all those handcrafted features and regular expressions and different models, we can train a single recurrent neural network to label each character in the text as either belonging to a phone number (credit card number, email …) or not. If we do it right and have enough training data, the network should be able to learn that phone numbers are more likely to occur in emails than in server logs and that Java code tends to use camel case while Python has indented blocks following a colon - and all kinds of other features that would otherwise have to be hardcoded.

Let’s do it!

Implementation

As it turned out, the hardest part was getting and preparing the data. Since I don’t have access to a labeled dataset with phone numbers and emails, I decided to create an artificial one. I took all the Python files from scikit-learn repository and all the Scala files from scalaz and spliced them together into one giant sequence of characters. The sequence takes a few dozen consecutive characters from a Python file, then a few dozen from a Scala file, then Python again and so on. The result is the Frankenstein’s monster at the top of the post (except tens of megabytes more of it).

Preparing training data

The sequence made up of all the Python and Scala files wouldn’t fit in my RAM (Big Data, as promised ;), so it is generated online during training, using a generator:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 from random import choice def chars_from_files ( list_of_files ): # reads a list of files in random order and yields # one character at a time while True : filename = choice ( list_of_files ) with open ( filename , 'rb' ) as f : chars = f . read () for c in chars : yield c def splice_texts ( files_a , files_b ): """ Takes two lists of files and generates a sequence of characters from those files. Yields pairs: (character, index of the source - 0 or 1) """ a_chars = chars_from_files ( files_a ) b_chars = chars_from_files ( files_b ) generators = [ a_chars , b_chars ] # take between 20 and 50 characters from one source # before moving to the other source jump_range = range ( 20 , 50 ) source_ind = choice ([ 0 , 1 ]) while True : jump_size = choice ( jump_range ) gen = generators [ source_ind ] for _ in range ( jump_size ): yield ( gen . next (), source_ind ) source_ind = 1 - source_ind # it can be used like this gen = splice_texts ([ "file1.txt" , "file2.txt" ], [ "file3.txt" , "file4.txt" ]) char_1 , label_1 = gen . next () char_2 , label_2 = gen . next () # and so on ...

The other reason for using a generator is that the sequence can be randomized (both the order of files and the number of consecutive characters taken from one source). This way the network will never see the same sequence twice which will reduce overfitting.

Next step is encoding the characters as vectors (one-hot-encoding):

1 2 3 4 5 6 7 8 9 10 import numpy as np # Only allowing these characters: chars = '

!"#$%& \' ()*+,-./0123456789:;<=>?@[ \\ ]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~' char2ind = dict (( c , i ) for i , c in enumerate ( chars )) char2vec = {} for c in chars : vec = np . zeros ( len ( chars )) vec [ char2ind [ c ]] = 1 char2vec [ c ] = vec

To take advantage of the parallel processing powers of the GPU, the input vectors need to be shaped into batches. Keras requires that batches for LSTM be 3-dimensional arrays, where first dimension corresponds to the number of samples in a batch, second - number of characters in a sequence and third - dimensionality of the input vector. The latter is in our case equal to the number of characters in our alphabet.

For example, if there were only two sequences to encode, both of length 4, and only 3 letters in the alphabet, this is how we would construct a batch:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # sequences to encode: # 'abca' # 'cacb' # vectors corresponding to characters a = [ 1 , 0 , 0 ] b = [ 0 , 1 , 0 ] c = [ 0 , 0 , 1 ] batch = np . array ([ [ a , b , c , a ], [ c , a , c , b ] ]) # batch.shape gives (2, 4, 3) # which is = (number of sequences, length of a sequence, number of available chars)

If the sequences are too long to fit in one batch - as they are in our case - they need to be split into multiple batches. This would ordinarily mean losing some context information for characters that are near the boundary of a sequence chunk. Fortunately Keras LSTM has a setting stateful=True which tells the network that the sequences from one batch are continued in the next one. For this to work, the batches must be prepared in a specific way, with n-th sequence in a batch being continued in the n-th sequence of the next batch.

1 2 3 4 5 6 7 8 9 10 11 12 13 # sequences to encode: # 'abcdefgh' # 'opqrstuv' batch_1 = np . array ([ [ a , b , c , d ], # first element of first batch [ o , p , q , r ] # second element of first batch ]) # i-th element of second batch is the continuation of i-th element of first_batch batch_2 = np . array ([ [ e , f , g , h ], # first element of second batch [ s , t , u , v ] # second element of second batch ])

In our case, each sequence is produced by a generator reading from files. We will have to start a number of generators equal to the desired batch size.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 def generate_batches ( files_a , files_b , batch_size , sequence_len ): gens = [ splice_texts ( files_a , files_b ) for _ in range ( batch_size )] while True : X = [] y = [] for g in gens : vecs = [] labels = [] for _ in range ( sequence_len ): c , l = g . next () vecs . append ( char2vec [ c ]) labels . append ([ l ]) X . append ( vecs ) y . append ( labels ) yield ( np . array ( X ), np . array ( y ))

Done. This generator produces batches accepted by Keras’ LSTM. batch_size and sequence_len settings influence GPU/CPU utilisation but otherwise shouldn’t make any difference (as long as stateful=True !).

The network

Now for the easy part. Construct the network:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from keras.layers import Dense , Dropout , LSTM , TimeDistributed from keras.models import Sequential batch_size = 1024 seq_len = 100 n_chars = 96 rnn_size = 128 batch_shape = ( batch_size , seq_len , n_chars ) model = Sequential () # Let's use 3 LSTM layers, because why not model . add ( LSTM ( rnn_size , return_sequences = True , batch_input_shape = batch_shape , stateful = True )) model . add ( Dropout ( dropout_rate )) model . add ( LSTM ( rnn_size , return_sequences = True , batch_input_shape = batch_shape , stateful = True )) model . add ( Dropout ( dropout_rate )) model . add ( LSTM ( rnn_size , return_sequences = True , batch_input_shape = batch_shape , stateful = True )) model . add ( Dropout ( dropout_rate )) model . add ( TimeDistributed ( Dense ( units = 1 , activation = 'sigmoid' ))) model . compile ( optimizer = 'adam' , loss = 'mse' , metrics = [ 'accuracy' , 'binary_crossentropy' ])

And train it:

1 2 3 4 5 6 from keras.callbacks import ModelCheckpoint model_path = "models/my_model" generator = generate_batches ( files_a , files_b , batch_size , seq_len ) checkpointer = ModelCheckpoint ( model_path ) model . fit_generator ( generator , steps_per_epoch = 1000 , epochs = 10 , callbacks = [ checkpointer ])

Making predictions is just as easy:

1 predictions = model . predict_generator ( generator , steps = 50 )

That’s it! The full code I used has a few more bells and whistles, but this is the core of it.

I have split the Python and Scala files into train and test sets (80:20) and trained the network on the training set for a few hours. This is what the network’s prediction on the test set (same text as on top of of this post) looks like:

p a c k a g e s c a l a z

p a c k a g e s y n t a x



" " "

E x t e n d e d m a t h u t i l i t i e s .

" " "

# A u t h o r s : G a e l V a r o q u a u x

# A l e x / * * W r a p s a v a l u e ` s e l f a n d r e G r a m f o r t

# A l e x a n d r e T . P a s s o s

# O l i v i e r G r i s e l

# L a r s B u i t i n c k

# S t e f a n v a n d e r W a l t

# K y l e K a s t n e r

# G i o r g i o P a t r i n i

# L i c e n s e : ` a n d p r o v i d e s m e t h o d s r e l a t e d t o ` M o n a d P l u s ` * /

f i n a l c l a s s M o n a d P l u s O p s [ F [ _ ] , A ] p r i v a t e [ s y n t a x ] ( v a l s e l f : B S D 3 c l a u s e



f r o m _ _ f u t u r e _ _ i m p o r t d i v i s i o n

f r o m f u n c t o o l s i m p o r t p a r t i a l

i m p o r t w a r n i n g s



i m p o r t n u m p y a s n p

f r o m s c i p y i m p o r t l i n a l g

f r o m s c i p y . s p a r s e i m p o r t i s s p a r s e , c s r _ m a t r F [ A ] ) ( i m p l i c i t v a l F : M o n a d P l u s [ F ] ) e x t e n d s O p s [ F [ A ] ] {

/ / / /

i m p o i x



f r o m . i m p o r t c h e c k _ r a n d o m _ s t a t e

f r o m . f i x r t L e i b n i z . = = =



d e f f i l t e r ( f : A = > B o o l e a n ) : F [ A ] =

F . f i l t e r ( s e l f ) ( f )



d e f w i t h F i l t e r ( f : A = > B o o l e a n ) : F [ A ] =

f i l t e r ( f )



f i n a l d e f u n i t e U [ T ] ( i m p l i c i t T : U n a p p l y [ F o l d a b l e , A e s i m p o r t n p _ v e r s i o n

f r o m . _ l o g i s t i c _ s i g m o i d i m p o r t _ l o g _ l o g i s t i c _ s i g m o i d

f r o m . . e x t e r n ] ) : F [ T . A ] =

F . u n i t e U ( s e l f ) ( T )



d e f u n i t e [ T [ _ ] , B ] ( i m p l i c i t e v : A = = = T [ B ] , T : F o l d a b l e [ T ] ) : F [ B ] = {

v a l f t b : F [ T [ B ] ] = e v . s u b s t ( s e a l s . s i x . m o v e s i m p o r t x r a n g e

f r o m . s p a r s e f u n c s _ f a s t i m p o r t c s r _ r o w _ n o r m s

f r o m . v a l i d a t i o n i m p o r t c h e c k _ a r r a y

f r o m . . e x c e p t i o n s i m p o r t N o n B L A S D o t W a r n i n g





l f )

F . u n i t e [ T , B ] ( f t b )

}

f i n a l d e f l e f t s [ G [ _ , _ ] , B , C ] ( i m p l i c i t e v : A = = = G [ B , C ] , G : B i f o l d a b l e [ G ] ) : F [ B ] =

F . l e f t s ( e v . s u b s t ( s e l f ) )



f i n a l d e f r i g d e f n o r m ( x ) :

" " " C o m p u t e t h e E u c l i d e a n o r F r o b e n i u s n o r m o f x .



h t s [ G [ _ , _ ] , B , C ] ( i m p l i c i t e v : A = = = G [ B , C ] , G : B i f o l d a b l e [ G ] ) : F [ C ] =

F . r i g h t s ( e v . s u b s t ( s e l f ) )



f i n a l d e f s e p a r a t e [ G [ _ , _ ] , R e t u r n s t h e E u c l i d e a n n o r m w h e n x i s a v e c t o r , t h e F r o b e n i u s n o r m w h e n x

i s a m a t r i x ( 2 - d a r r a y ) . M o r e p r e c i s e t h a n s q r t ( s q u a r e d _ n o r m ( x ) ) .

" " "

x = n p . a s a r r a y ( x )

n r m 2 , = l i n B , C ] ( i m p l i c i t e v : A = = = G [ B , C ] , G : B i f o l d a b l e [ G ] ) : ( F [ B ] , F [ C ] ) =

F . s e p a r a t e ( e v . s u b s t ( s e l f ) )



/ / / /

}



s e a l e d t r a i t T o M o n a d P l u s O p s 0 {

i m p l i c i t d e f T a l g . g e t _ b l a s _ f u n c s ( [ ' n r m 2 ' ] , [ x ] )

r e t u r n n r m 2 ( x )





# N e w e r N u m P y h a s a r a v e l t h a t n e e d s l e o M o n a d P l u s O p s U n a p p l y [ F A ] ( v : F A ) ( i m p l i c i t F 0 : U n a p p l y [ M o n a d P l u s , F A ] ) =

n e w M o n a d P l u s O p s [ F 0 . M , F 0 . A ] ( F 0 ( v ) ) s s c o p y i n g .

i f n p _ v e r s i o n < ( 1 , 7 , 1 ) :

_ r a v e l = n p . r a v e l

e l s e :

_ r a v e l = p a r t i a l ( n p . r a v e l , o r d e r = ' K ' )





d e f s q u a r e d _ n o ( F 0 . T C )



}



t r a i t T o M o n a d P l u s O p s e x t e n d s T o M o n a d P l u s O p s 0 w i t h T o M o n a d O p s w i t h T o A p p l i c a t r m ( x ) :

" " " S q u a r e d E u c l i d e a n o r F r o b e n i u s n o r m o f x .



R e t u r n s t h e E u c l i d e a n n o r m w h e n x i s a v e c t o r , t h e F r o b e n i u s n o r m w h e n x

i s a m a t r i x ( 2 - d a r r a y ) . F a s t e r t h a n n o r m ( i v e P l u s O p s {

i m p l i c i t d e f T o M o n a d P l u s O p s [ F [ _ ] , A ] ( v : F [ A ] ) ( i m p l i c i t F 0 : M o n a d P l u s [ F ] ) =

n e w M o n a d P l u s O p s [ F , A ] ( v )



/ / / /



/ / / /

}



t r a i t M o n a d P l u s S y n t a x [ F [ _ ] ] e x t e n d s M o n a d S y n t a x [ F ] w i t h x ) * * 2 .

" " "

x = _ r a v e l ( x )

i f n p . i s s u b d t y p e ( x . d t y p e , n p . i n t e g e r ) :

A p p l i c a t i v e P l u s S y n t a x [ F ] {

i m p l i c i t d e f T o M o n a d P l u s O p s [ A ] ( v : F [ A ] ) : M o n a d P l u s O p s [ F , A ] = n e w a r n i n g s . w a r n ( ' A r r a y t y p e i s i n t e g e r , n p . d o t m a y o v e r f l o w . '

' D a t a s h o u l d b e f l o a t t y p e t o a v o i d t h i s i s s u e ' ,

U s e r W a r n i n g )

r e t u r n n p . d o t ( x w M o n a d P l u s O p s [ F , A ] ( v ) ( M o n a d P l u s S y n t a x . t h i s . F )



d e f F : M o n a d P l u s [ F ]

/ / / /



/ / / /

}

p a c k a g e s c a l a z

p a c k a g e s y n t a x



/ * * W r a p s a v a l u e ` s e l f ` a n d p r o v i d e s m e t h o d s , x )





d e f r o w _ n o r m s ( X , s q u a r e d = F a l s e ) :

" " " R o w - w i s e ( s q u a r e d ) E u c l i d e a n n o r m o f X .



E r e l a t e d t o ` T r a v e r s e ` * /

f i n a l c l a s s T q u i v a l e n t t o n p . s q r t ( ( X * X ) . s u m ( a x i s = 1 ) ) , b u t a l s o s u p p o r a v e r s e O p s [ F [ _ ] , A ] p r i v a t e [ s y n t a x ] ( v a l s e l f : F [ A ] ) ( i m p l i c i t v a l F : T r a v e r s e [ F ] ) e x t e r t s s p a r s e

m a t r i c e s a n d d o e s n o t c r e a t e a n X . s h a p e - s i z e d t e m p o r a r y .



P e r f o r m s n o i n p u t v a l n d s O p s [ F [ A ] ] {

/ / / /



i m p o r t L e i b n i z . = = =



f i n a l d e f t m a p [ B ] ( f : A = > B ) : F [ B ] =

F . m a p ( s e i d a t i o n .

" " "

i f i s s p a r s e ( X ) :

i f n o t i s i n s t a n c e ( X , c s r _ m a t r i x ) :



Font size shows the true label (small - Python, big - Scala) and background color represents the network’s prediction (white - Python, dark red - Scala).

It’s pretty good overall, but network keeps making a few unforced errors. Consider this bit:

p a c k a g e s c a l a z

p a c k a g e s y n t a x



" " "

it is very unsure about the first few characters of the input. Even though package scalaz should be a dead giveaway, the prediction only becomes confident at about the character ‘g’

should be a dead giveaway, the prediction only becomes confident at about the character ‘g’ it is sometimes too slow to change the prediction. Like in the case of Python’s triple quotation marks """" following a stretch of Scala code. Triple quotes should immediately be labeled as Python but only the third one is.

These mistakes stem from the fact that the RNN doesn’t look ahead and can only interpret a character in the context of characters that came before. Triple quotes almost certainly come from a stretch of Python code, but you don’t know that you’re seeing triple quotes until you’ve seen all three. That’s why the prediction gradually changes from Scala to Python (red to white) as the RNN encounters the second and third consecutive quote.

This problem actually has a straightforward solution - bidirectional RNN. It’s a type of RNN where the sequence is fed to it from both ends at the same time. This way, the network will be aware of the second and third quotation marks already when it’s producing the label for the first one.

To make the LSTM bidirectional in Keras one needs simply to wrap it with the Bidirectional wrapper:

1 2 3 4 5 6 from keras.layers import Bidirectional model . add ( Bidirectional ( LSTM ( rnn_size , return_sequences = True , stateful = True ), batch_input_shape = batch_shape )) # instead of # model.add(LSTM(rnn_size, return_sequences=True, batch_input_shape=batch_shape))

Everything else stays the same.

Here’s a sample of results from a bidirectional LSTM:

p a c k a g e s c a l a z

p a c k a g e s t d



i m p o r t s t d . A l l I n s t a n c e s . _

i m p o r t s c a l a z . s c a l a c h e c k . S c a l a z P r o p e r t i e s . _

i m p o r t s c a l a z . s c a l a c " " "

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = h e c k . S c a l a z A r b i t r a r y . _

i m p o r t o r g . s c a l a c h e c k . { G e n , A r b i t r a r y }

i m p o r t L e n s . { l e n s = > _ , _ }

i m p o r t o r g . s c a l a c h e c k . P r o p . f o = = = = = = = = =

C o m p a r i s o n o f C a l i b r a t i o n o f C l a s s i f r A l l



o b j e c t L e n s T e s t e x t e n d s S p e c L i t e {



{

i m p l i c i t d e f l e n s A r b = A r b i t r a r y ( G e n . c o n s t ( L e n s . l e n s I d [ I n t ] ) )

i m p l i c i t d e f l e n s E q u a l = n e w E q u a l [ L e n s [ I n t , I i e r s

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =



W e l l c a l i b r a t e d c l a s s i f i e r s a r e p r o b a b i n t ] ] {

d e f e q u a l ( a 1 : L e n s [ I n t , I n t ] , a 2 : L e n s [ I n t , I n t ] ) : B o o l e a n = a 1 . g e t ( 0 ) = = a 2 . g e t ( 0 )

}

c h e c k A l l ( " L e n s " , c a t e g o r y . l a w s [ L e n s ] ) / / n o t r e a l l y t e s t i n g m u c h !

}



c h e c k A l l ( " i d " , l i s t i c c l a s s i f i e r s f o r w h i c h t h e o u t p u t

o f t h e p r e d i c t _ p r o b a m e t h o d c a n b e d i r e c t l y i n t e r p r e t e d a s a c o n f i d e n c e l e v e l .

F o r i n s t a n c e a w e l l c a l i b r a t e d ( b i n a r y ) c l a s s i f i e r s h o u l d c l a s s i f y t h e s a m p l e n s . l a w s ( L e n s . l e n s I d [ I n t ] ) )

c h e c k A l l ( " t r i v i a l l e s

s u c h t h a t a m o n g t h e s a m p l e s t o w h i c h i t g a v e a p r e d i c t _ p r o b a " , l e n s . l a w s ( L e n s . t r i v i a l L e n s [ I n t ] ) )

c h e c k A l l ( " c o d i a g L e n s " , l e n s . l a w s ( L e n s . c o d i a g L e n s [ I n t ] ) )

c h e c k A l l ( " T u p l e 2 . f i r s t " , l e n s . l a w s ( L e n s . f i r s t L e n s [ I n t , I n t ] ) )

c h e c k A l l ( " T u p l e 2 . s e c o n d " , l e v a l u e c l o s e t o

0 . 8 , a p p r o x . 8 0 % a c t u a l l y b e l o n g t o t h e p o s i t i v e c l a s s .



L o g i s t i c n s . l a w s ( L e n s . s e c o n d L e n s [ I n t , I n t ] ) )

c h e c k A l l ( " S e t . c o n t a i n R e g r e s s i o n r e t u r n s w e l l c a l i b r a t e d p r e d i c t i o n s a s i t d i r e c t l y

o s " , l e n s . l a w s ( L e n s . l e n s I d [ S e t [ I n t ] ] . c o n t a i n s ( 0 ) ) )

c h e c k A l l ( " M a p . m e m b e r " , l e n s . l a w s ( L e n s . l e n s I d [ M a p [ B o o l e a n , I n t ] ] . p t i m i z e s l o g - l o s s . I n c o n t r a s t , t h e o t h e m e m b e r ( t r u e ) ) )

c h e c k A l l ( " s u m " , l e n s . l a w s ( L e n s . f i r s r m e t h o d s r e t u r n b i a s e d p r o b a b i l i t i e s ,

w i t h d i f f e r e n t b i a s e s p e r m e t h o d :



* G a u s s i a n N a i v e B a y e s t e n d s t o p u s h p r o b a b i l i t i e s t o 0 o t L e n s [ I n t , S t r i n g ] . s u m ( L e n s . f i r s t L e n s [ I n t , S t r i n g ] ) ) )



" N u m e r i c L e n s " s h o u l d {

" + = " ! f o r A l l ( ( i : I n t ) = > ( L e n s . l e n s I d [ I n t ] + = i ) . r u n ( 1 ) m u s t _ = = = ( ( i + 1 ) - > ( i +

I think this looks better overall. The problem of updating prediction too slowly is mostly gone - package scalaz is marked as Scala immediately, starting with the letter ‘p’. However, now the network started making weird mistakes in the middle of a word for no reason. Like this one:

C o m p a r i s o n o f C a l i b r a t i o n

Why is the middle of the ‘Calibration’ all of a sudden marked as Scala?

The culprit is statefulness. Remember that stateful=True means that for each sequence in a batch, the state of the network at the beginning of a sequence is reused from the state at the end of the previous sequence*. This acts as if there were no batches, just one unending sequence. But in a bidirectional layer the sequence is fed to the network twice, from both directions. So half of the state should be borrowed from the previous sequence, and half from the next sequence that has not been seen yet! In reality all of the state is reused from previous sequence, so half of the network ends up in the wrong state. This is why those weird mispredictions appear and appear at regular intervals. At the beginning of a new batch, half of the network is in the wrong state and starts predicting the wrong label.

* or more precisely, the state at the end of the corresponding sequence in the previous batch

Let’s get rid of statefulness in the bidirectional version of the network:

1 model . add ( Bidirectional ( LSTM ( rnn_size , return_sequences = True , stateful = False ), batch_input_shape = batch_shape ))

Unfortunately this means that we will have to use longer sequences (in the previous experiments I used 128 characters, now 200) to give the network more context for labeling a character. And even with that, prediction for characters near the boundary between consecutive sequences is bound to be poorer - like in regular unidirectional LSTM. To make up for it I decided to give the network more layers (4) and more time to train (a day). Let’s see how it worked out:

p a c k a g e s c a l a z



i m p o r t s c a l a z . s y n t a x . e q u a l . _

i m p o r t s c a l a z . s y n t a x . s h o w . _



s e a l e d a b s t r a c t c l a s s E i t h e r 3 [ + A , + B , + C ] e x t e n d s P r o " " " B a y e s i a n G a u s s i a n M i x t u r e M o d d u c t w i t h S e r i a l i z a b l e {

d e f f o l d [ Z ] ( l e f t : A = > Z , m i d d l e : B = > Z , r i g h t : C = > Z ) : Z = t h i s m a t c h {

c a s e L e f t 3 ( a ) = > l e f t ( a )

c a s e e l . " " "

# A u t h o r : W e i X u e < x u e w e i 4 d M i d d l e 3 ( b ) = > m i d d l e ( b )

c a s e R i g h t 3 ( c ) = > r i g h t ( c )

}



d e f e i t h e r L e f t : ( A \ / B ) \ / C = t h i s m a t c h {

c a s e L e f t 3 ( a ) = > - \ @ g m a i l . c o m >

# T h i e r r y G u i l l e m o t < t h i e r r y . g u i l l e m o t . w o r k @ g m a i l . c o m >

# L i c e n s e : B S D 3 c l a u s e



i m p o r t m a t h

i m p o r t n u m p y a s n p

f r o m s c i p y . s p e c i a l i m p o r t b e t a l n , d i g a m m a , / ( - \ / ( a ) )

c a s e M i d d l e 3 ( b ) = > - \ / ( \ / - ( b ) )

c a s e R i g h t 3 ( c ) = > \ / - ( c )

}



g a m m a l n



f r o m . b a s e i m p o r t B a s e M i x t u r e , _ c h e c k _ s h a p e

f r o m . g a u s s i a n _ m i x t u r e i m p o r t _ c h e c k _ p r e c i s i o n _ m a t r i x

f r o m . g a u s s i a n _ m i x t u r e i m p o r t _ c h e c k _ p r e c i s i o n _ p o s i t i v i t y

f r o m . g a u s d e f e i t h e r R i g h t : A \ / ( B \ / C ) = t h i s m a t c h {

c a s e L e f t 3 ( a ) = > - \ / ( a )

c a s e M i d d l e 3 ( b ) = > \ / - ( - \ / ( b ) )

c a s e R i g h t 3 ( c ) s i a n _ m i x t u r e i m p o r t _ c o m p u t e _ l o g _ d e t _ c h o l e s k y

f r o m . g a u s s i a n _ m i x t u r e i m p o r t _ c o m p u t e _ p r e c i s i o n _ c h o l e s k y

f r o m . g a u s s i a n _ m i x t u r e i m p o r t _ e s t i m a t e _ g a u s s i a n _ p = > \ / - ( \ / - ( c ) )

}



d e f l e f t O r [ Z ] ( z : = > Z ) ( f : A = > Z ) : Z = f o l d ( f , _ = > z , _ = > z )

d e f m i d d l e O r [ Z ] ( z a r a m e t e r s

f r o m . g a u s s i a n _ m i x t u r e i m p o r t _ e s t i m a t e _ l o g _ g a u s s i a n _ p r o b

f r o m . . u t i l s i m p o r t c h e c k _ a r r a y

f r o m . . u t i l s . v a l i d a t i o n i m p o r t c h e c k _ i s _ f i t t e d





d e f _ l o g _ d i r i c h l e t _ n o r m ( d i r i c h l e t _ c o n c e n t r a t i o n : = > Z ) ( f : B = > Z ) : Z = f o l d ( _ = > z , f , _ = > z )

d e f r i g h t O r [ Z ] ( z : = > Z ) ( f : C = > Z ) : Z = f o l d ( _ = > z , _ = > z , f )

}



f i n a l c a s e c l a s s L e f t 3 [ + A , + B , + C ] ( a : A ) e x t e n d s E i t h e r 3 [ A , B , C ]

f i n a l c a s e c l a ) :

" " " C o m p u t e t h e l o g o f t h e D i r i c h l e t d i s t r i b u t i o n n o r m a l i z a t i o n t e r m .



P a r a m e t e r s

- - - - - - - - - -

d i r i c h l e t s s M i d d l e 3 [ + A , + B , + C ] ( b : B ) e x t e n d _ c o n c e n t r a t i o n : a r r a y - l i k e , s h a p e ( n _ s a m p l e s , )

T h e s E i t h e r 3 [ A , B , C ]

f i n a l c a s e c l a s s R i g h t 3 [ + A , + B , + C ] ( c : p a r a m e t e r s v a l u e s o f t h e D i r i c h l e t d i s t r i b u t i o n .



R e t u r n s

- - - - - - -

l o g _ d i r i c h l e t _ n o r m : f l o a t

T h e l o g n o r m a l i z a t i o n o f t h e D i r i c h l e C ) e x t e n d s E i t h e r 3 [ A , B , C ]



o b j e c t E i t h e r 3 {

d e f l e f t 3 [ A , B , C ] ( a : A ) : E i t h e r 3 [ A , B , C ] = L e f t 3 ( a )

d e f m i d d l e 3 [ A , B , C ] ( b : B ) t d i s t r i b u t i o n .

" " "

r e t u r n ( g a m m a l n ( n p . s u m ( d i r i c h l e t _ c o n c e n t r a t i o n ) ) -

n p . s u m ( g a m m a l n ( d i r i c h l e t _ c o n c e n t r a t i o n ) ) )





d e f _ l o g _ w i s h a r t _ n o r m ( d e g r e e s _ o : E i t h e r 3 [ A , B , C ] = M i d d l e 3 ( b )

d e f r i g h t 3 [ A , B , C ] ( c : C ) : E i t h e r 3 [ A , B , C ] = R i g h t 3 ( c )



i m p l i c i t d e f e q u a l [ A : E q u a l , B : E q u a l , C : E q u a l f _ f r e e d o m , l o g _ d e t _ p r e c i s i o n s _ c h o l , n _ f e a t u r e s ) :

" " " C o m p u t e t h e l o g o f t h e W i s h a r t d i s t r i b u t i o n n o r m a l i z a t i o n t e r m .



P a r a m e t e r s

- - - - - - - - - -

d e g r e e s _ o f _ f r e e d o m : a r r a y - l i k e , s h a p e ] : E q u a l [ E i t h e r 3 [ A , B , C ] ] = n e w E q u a l [ E i t h e r 3 [ A , B , C ] ] {

d e f e q u a l ( e 1 : E i t h e r 3 [ A , B , C ] , e 2 : E i t h e r 3 [ A , B , C ] ) = ( e 1 , e 2 ) m a t c h {

c a s e ( L e f t 3 ( a 1 ) ( n _ c o m p o n e n t s , )

T h e n u m b e r o f d e g r e e s o f f r e e d o m o n t , L e f t 3 ( a 2 ) ) = > a 1 = = = a 2

c a s e ( M i d d l e 3 ( b 1 ) , M i d d l e 3 ( b 2 ) ) = > b 1 = = = b 2

c a s e ( R i g h t 3 ( c 1 ) , R i g h t 3 ( c 2 ) ) = > c 1 = = = c 2

c a s e _ = > f a l s e

}

}



i m p l i c i h e c o v a r i a n c e W i s h a r t

t d e f s h o w [ A : S h o w , B : S h o w , C : S h o w ] : S h o w [ E i t h e r 3 [ A , B , C ] ] = n e d i s t r i b u t i o n s .



l o g _ d e t _ p r e c i s i o n _ c h o l : a r r a y - l i k e , s h a p w S h o w [ E i t h e r 3 [ A , B , C ] ] {

o v e r r i d e d e f s h o w ( v : E i t h e r 3 [ A , B , C ] ) = v m a t c h {

c a s e L e f t 3 ( a ) = > C o r d ( " L e f t 3 ( " , a . s h o w s , e ( n _ c o m p o n e n t s , )

T h e d e t e r m i n a n t o f t h e p r e c i s i o n m a t r i x f o r e a c h c o m p o n e n t .



n _ f e a t " ) " )

c a s e M i d d l e 3 ( b ) = > C o r d ( " M i d d l e 3 ( " , b . s h o w s , " ) " )

c a s e R i g h t 3 ( c ) = > C o r d ( " R i g h t 3 ( " , c . s h o w s , " ) " )

}

}

}





/ / v i m : s e t t s = 4 s w = 4 e t :

p a c k a g e s c a l a z

p a c k a g e s y n t u r e s : i n t

T h e n u m b e r o f f e a t u r e s .



R e t u r n

- - - - - -

l o g _ w i s h a r t _ n o r m : a r r a y - l i k e , s h a p e ( n _ c o m p o n e n t s , )

T h e l o g n o a x



/ * * W r a p s a v a l u e ` s e l f ` a n d p r o v i d e s m e t h o d s r e l a t e d t o ` U n z i p ` * /

f i n a l c l a s s U n z i p O p s [ F [ _ ] , A ] p r i v a t e [ s y n t a x ] ( v a l s e l f : F [ A ] ) ( i m p l i c i t v a l F : U n z i p [ F ] ) e x t e n d s O p s [ F [ A r m a l i z a t i o n o f t h e W i s h a r t d i s t r i b u t i o n .

" " "

# T o s i m p l i f y t h e c o m p ] ] {

/ / / /

/ / / /

}



s e a l e d t r a i t T o U n z i p O p s 0 {

i m p l i c i t d e f T o U n z i p O p s U n a p p l y [ F A ] ( v : F A ) ( i m p l i c i t F 0 : U n a p p l y [ U n z i p , F A ] ) u t a t i o n w e h a v e r e m o v e d t h e n p . l o g ( n p . p i ) t e r m

r e t u r n - ( d e g r e e s _ o f _ f r e e d o m * l o g _ d e t _ p r e c i s i =

n e w U n z i p O p s [ F 0 . M , F 0 . A ] ( F 0 ( v ) ) ( F 0 . T C )



}



t r a i t T o U n z i p O p s e x t e n d s T o U n z i p O p s 0 {

i m p l i c i t d e f T o U n z i p O p s [ F [ _ ] , A ] ( v : F o n s _ c h o l +

d e g r e e s _ o f _ f r e e d o m * n _ f e a t u r e s * . 5 * m a t h . l o g ( 2 . ) +

n p . s u m ( g a m m a l n ( . 5 * ( d e g r e e s _ o f _ f r e e d o m -

[ A ] ) ( i m p l i c i t F 0 : U n z i p [ F ] ) =

n e w U n z i p O p s [ F , A ] ( v )



/ / / /

i m p l i c i t d e f T o U n z i p P a i r O p s [ F [ _ ] , A , B ] ( v : F [ ( A , B ) ] ) ( i m p l i c i t F 0 : U n z i p [ F ] ) =

n e w U n z i p P a i r O p s [ F , A , B ] ( v ) ( F 0 )



f i n a l c n p . a r a n g e ( n _ f e a t u r e s ) [ : , n p . n e w a x i s ] ) ) , 0 ) )





c l a s s B a y e s i a n G a u s s i a n M i x t u r e ( B a s e M i x l a s s U n z i p P a i r O p s [ F [ _ ] , A , B ] p r i v a t e [ s y n t a x ] ( s e l f : F [ ( A , B ) ] ) ( i m t u r e ) :

" " " V a r i a t i o n a l B a y e s i a n e s t i m a t i o n o f a G a u s s i a n m i x t

Weird mislabelings are gone, boundaries between labels are crisp, overall accuracy improved. It’s practically perfect. Thank you François Chollet!

This is it for now. More experiments in the next post.

As a bonus, this is a prediction from a network trained collected works of Shakespeare mixed with .R files from caret repository:





S C E N E I I I .

C Y M B E L I N E ' S p a l a c e . A n a n t e - c h a m b e r a d j o i n i n g I M O G E N ' S a p a r t m e n t s



E n t e r C L t i m e s t a m p < - S y s . t i m e ( )

l i b r a r y ( c a r e t )



m o d e l < - " n b S e a r c h "



# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # O T E N a n d L O R D S



F I R S T L O R D . Y o u r l o r d s h i p i s t h e m o s t p a t i e n t m a n i n l o s s , t h e m o s t

c o l d e s t t h a t e v e r t u r n ' d u p a c # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #



s e t . s e e d ( 2 )

t r a i n i n g < - L P H 0 7 _ 1 ( 1 0 0 , f a c t o r s = T R U E , c l a s s = T R U E )

t e s t i n g < - L P H 0 7 _ 1 ( 1 0 0 , f a c t o r s = T R U E , c l a s s = T R U E )

t r a i n X < - t r a i n i n g [ , - n c o l ( t e .

C L O T E N . I t w o u l d m a k e a n y m a n c o l d t o l o s e .

F I R S T L O R D . B u t n o t e v e r y m a n p a r a i n i n g ) ]

t r a i n Y < - t r a i n i n g $ C l a s s



c c t r l 1 < - t r a i n C o n t r o l ( m e t h o d = " c v " , t i e n t a f t e r t h e n o b l e t e m p e r o f

y o u r l o r d s h i p . Y o u a r e m o s t h o t a n d f u r i o u s w h e n y o u w i n .

C L O T E N . W i n n i n g w i l l p u t a n y m a n i n t o c o u r a g e . I f I c o u n u m b e r = 3 , r e t u r n R e s l d g e t t h i s

f o o l i s h I m o g e n , I s h o u l d h a v e g o l d e n o u g h . I t ' s a l m o s t m o r n i n g ,

i s ' t n o t ?

F I R S T L O R D . D a y , m y l o r d .

C L a m p = " a l l " ,

c l a s s P r o b s = T R U E ,

s u m m a r y F u n c t i o n = t w o C l a s s S u m m a r y )

c c t r l 2 < O T E N . I w o u l d t h i s m u s i c w o u l d c o m e - t r a i n C o n t r o l ( m e t h o d = " L O O C V " ,

c l a s s P r o b s = T R U E , s u m m a r y F u n c t i o n = t w o C l a s s S u m m a r y )

c c t r l 3 < - t r a i n C o n t r o l ( m e t h o d = " . I a m a d v i s e d t o g i v e h e r

m u s i c a m o r n i n g s ; t h e y s a y i t w i l l p e n e t r a t e .



E n t e r m u s i c i a n s



C o m e o n , t u n e . I f y o u n o n e " ,

c l a s s c a n p e n e t r a t e h e r w i t h y o u r f i n g e r i n g , s o .

W e ' l l t r y w i t h t o n g u e t o o . I f n o n e w i l P r o b s = T R U E , s u m m a r y F u n c t i o n = t w o C l a s s S u m m a r y )

c c t r l R < l d o , l e t h e r r e m a i n ; b u t

I ' l l n e v - t r a i n C o n t r o l ( m e t h o d = " c v " , n u m b e r = 3 , r e t u r n R e s a m p = " a l l " , s e a r c h = " r a n d o m " )



s e t . s e e d ( 8 4 9 )

t e s t _ c l a s s _ c v _ m o d e l < - t r a i n ( t r a i n X , t r a i n Y ,

e r g i v e o ' e r . F i r s t , a v e r y e x c e l l e n t g o o d - c o n c e i t e d

t h i n g ; a f t e r , a w o n d e r f u l s w e e t a i r , w i t h a d m i r a b l e r i c h w o r d s t o

i t - a n d t h e n l e t h e r c o n s i d e r .



S O N G



H a r k , h a r m e t h o d = " n b S e a r c h " ,

k ! t h e l a r k a t h e a v e n ' s g a t e s i n g s ,

A n d P h o e b u s ' g i n s a r i s e ,

H i s s t e e d s t o w a t e r a t t h o s e s p r i n g s

O n c h a l i c ' d f l o w ' r s t h a t l i e s ;

A n d w i n k i n g M a t r C o n t r o l = c c r y - b u d s b e g i n

T o o p e t h e i r g o l d e n e y e s .

W i t h e v e r y t h i n g t h a t p r e t t y b i n ,

M y l a d y s w e e t , a r i s e ;

A r i s e , a r i s e !



S o , g e t y o u g o n e . I f t h i s p e n e t r a t e , I t r l 1 ,

m e t r i c = " R O C " )



t e s t _ c l a s s _ p r e d < - p r e d i c t ( t e s t _ c l a s s _ c v _ m o d e l , t e s t i n g [ , - n c o l ( t e s t i n g ) ] )

t e s t _ c l a s s _ p r o b < - p r e d i c t ( t e s t _ c l a s s w i l l c o n s i d e r y o u r m u s i c

t h e b e t t e r ; i f i t d o n o t , i t i s a v i c e i n h e r e a r s w h i c h

h o r s e h a i r s a n d c a l v e s ' g u t s , n o r t h e v o i c e o f u n p a v e d e u n u c h t o

b o o t , c a n _ c v _ m o d e l , t e s t i n g [ , - n c o l ( t e s t i n g ) ] , t y p e = " p r o b " )

n e v e r a m e n d . E x e u n t m u s i c i a n s



E n t e r C Y M B E L I N E a n d Q U E E N



S E C O N D L O R D . H e r e c o m e s t h e K i n g .

C L O T E N . I a m g l a d I w a s u p s o l a t e , f o r t h a t ' s t h e r e

s e t . s e e d ( 8 4 9 )

t e s t _ c l a s s _ r a n d < - t r a i n a s o n I w a s u p

s o e a r l y . H e c a n n o t c h o o s e b u t t a k e t h i s s e r v i c e I h a v ( t r a i n X , t r a i n Y ,

m e t h o d = " n b S e a r c h " ,

t r C o n t r o l = c c t r l R ,

e d o n e

f a t h e r l y . - G o o d m o r r o w t o y o u r M a j e s t y a n d t o m y g r a c i o u s m o t h e r .

C Y M B E L I N E . A t t e n d y o u h e r e t h e d o o r o f o u r s t e r n d a u g h t e r ?

W i l l s h e n o t u n e L e n g t h = 4 )



s e t . s e e d ( 8 4 9 )

t e s t _ c l a s s _ l o o _ m o d e l < - t r a i n ( t r a i n X , t r a i n Y ,

m e t h o d = " n b t f o r t h ?

C L O T E N . I h a v e a s s a i l ' d h e r w i t h m u s i c s , b u t s h e v o u c h s a f e s n o

n o t i c e .

C Y M B E L I N E . S e a r c h " ,



Conclusions

What have we learned?

constructing and training a network with Keras is embarassingly easy

but preparing data for char RNNSs is still very manual and awkward

RNN can’t be both stateful and bidirectional. Duh!

distinguishing between programming languages with char-RNN works remarkably well with no parameter tuning or feature engineering

looks promising as a method of tagging special entities (code snippets or emails or phone numbers or…) included in text

Neural networks are in many ways overhyped. On most supervised machine learning problems you would be better off with a good old random forest. But tagging sequences is one of those applications that are difficult and tedious to even translate into a regular supervised learning task. How do you feed a stream of characters into a decision tree? And an RNN solves it straight out of the box, no questions asked.