Wednesday, August 26, 2020, by artyom ; 2 comments

CppCMS moves to C++11 and cleanups code that was workaround for missing standard C++ libraries features.

Full changes can be read there: http://cppcms.com/wikipp/en/page/cppcms_2_0_whats_new

This is beta release so more backward incompatible modifications may be presented.

0 comments Monday, August 10, 2020, by artyom ; Posted in: Progress

I'm looking into implementing next major feature that I can complete with reasonable effort. I have a list of following topics:

HTTP/1.1 support - it would allow keep alive support and better performance overall for multiple requests. It would allow future implementation of web sockets. Multiple-Event-Loop support - today there is only 1 event loop for the service and it can be a bottleneck for small requests on systems with high core count. SSL support - for embedded systems that need proper web server. C++11 cleanup - replace all booster primitives that exist in C++11 with standard ( shared_ptr , thread , mutex , etc) provide move constructors for many objects, replace auto_ptr by unique_ptr .

Feel free to suggest other ideas that you can think of that would be beneficial.

Comment below or sent e-mail to cppcms-users mailing list.

ZX Spectrum was my first computer. Me and my brother got one when we were kids. I learning programming using one. I learned to write both BASIC and machine code on it. To this day, I know Z80's assembly and machine code better than of any other processor. I learned to do some system programming, working with interrupts and some graphics using this simple but genius machine.

Back than I studied at a school with strong emphasis on math and physics. I used to write some simple and not so simple simulations using this amazing machine. Even my brother had written some computational tasks during has physics degree at the university he attended.

Today I use much more powerful hardware I do lots of work in field of Deep Learning professionally. I run computations using powerful GPU's consuming huge amount of fast memory and measuring computational power in Terra FLOPS.

Recently I stumbled upon an interesting YouTube channel the 8-bit guy that talks a lot about "retro" hardware. It reminded me my first "computing love" that small 8 bit machine I used to study and play with it a lot. So I installed the simulator and started playing with it.

Than a crazy thought had came to my mind: can some of the state-of the AI art techniques that require enormous computational power be done on this simple 8-bit machine with 48KB of memory and 3.5MHz CPU? What was the simplest project I could start with?

There is a "Hello World" for AI: it is the MNIST challenge - hand written digits recognition.

Challenge and Network

MNIST database consists of 60,000 images of hand written digits 28x28 pixels size and 8 bit depth. It was clear that it wouldn't be possible to do the task as-is.

First of all I made smaller images - it will help both with data-memory and network size:

Remove 2 pixels padding that MNIST does to 24x24

Rescale it to 1:3 - so we get an image of 8x8 for each digit

Keep only 1 bit per pixel (it will help us later for computations)

Use our 6KB of video memory as training data buffer.

This way total train data-set would consist of 640 samples, test data set will be loaded from tape after training is completed and contain same amount of data

And this is how the training data set looks like:

Now I designed a simplest network that could fit the bill.

Getting Machine Learning Technical Alert {

The simplest network to implement was a Multilayer perceptron. However in order to get reasonable accuracy I needed to have enough hidden layers that made the network quite big for the RAM.

So I finally decided to implement more complex convolutional network like this:

Convolution layer with N kernels of 3x3 with bias, input 1x8x8, output Nx6x6

MAX Pooling 2x2: input Nx6x6, output Nx3x3

ReLU

Fully connected layer with bias giving M classes output

Euclidean Loss - since it is simpler and faster to implement than standard SoftMax and logistic loss.

So for 10 classes I used N=12 kernels with 1210 trainable parameters.

} End of Machine Learning Technical

Language

I decided to check first if the BASIC was feasible for the task? I tested simple matrix (72 by 10) by vector (72) multiplication. It took around 20 seconds. It clearly wasn't the way to go.

So I looked somewhere else and discovered z88dk C compiler for Z80 and ZX Spectrum that comes with decent set of tools. To my surprise the project was up-to-date and provided very good C compiler. Same test took around 1 second using C compiler, so I started with it.

I implemented first version and got around 2-hours training per epoch for 10 digits samples and the accuracy was around 77% - it is not high but not poor considering small samples set and low resolution it was OK.

But did we need full floating point computations to do all the calculations? It is known that deep learning can be done using half float (16 bit) operations and inference (the prediction only part) works with 8 bit integer operations as well?

So I rewritten the code for fixed point computations. With 1 bit for the sign, 3 bits for the integer part and 12 bits for the rational.

For example 1.5 is represented as 1.5*4096 = 6144 , and 1.25 represented as 1.25*4096 = 5120 . In order to compute 1.5*1.25 we calculate in integer numbers (6144 * 5120) / 4096 = 7680 equivalent of 1.875 in fixed representation. Since operation of / 4096 is basically shift operation in integers it is highly effective method of handling real numbers in integer representation.

However there is a catch, on 8-bit systems the typical integer size is 16 bit. So when you calculate 6144 * 5120 you get 31457280 that is larger than 16 bit. So if you write: int xy = x*y>>12; you will not get the result you are expecting. So the correct solution is to cast numbers to 32 bit variable: int xy=(long)x*y>>12; However than the costly multiplication becomes much heavier 32 bit operation.

So I found a sample assembly code pattern for multiplication of two 16 bit registers and getting 32 bit result. I adapted it for singed case, added shift and rounding. It boosted the performance even more and finally I managed to perform the training using fixed point computing.

Additional memory space if using 16 bit computations allowed me later to increase number of kernels number to 20 and parameters number to 2010.

Back to BASICs

However back than I din't have a good C compiler that allowed creation of an optimized C code with floating point support. So wasn't BASIC that unfeasible?

I decided to simplify the problem: train on 2 digits: 0 and 1 instead of all 10, reduce number of kernels to 4 and see what can we do in terms of performance. Another benefit of it got much higher accuracy - since it was much simpler to distinguish between "0" and "1" I got almost always 99% of accuracy.

Since I already had debugged C code it was quite easy to rewrite it in BASIC. I found a great program called bas2tap that significantly simplified my code typing allowing to create tape files directly from text source files outside ZX Spectrum and saving me lots of troubles typing the code without an original keyboard. The Sinclair basic had a unique feature: a single stoke on a key brought an entire keyword, for example pressing P lead to PRINT keyword to be inserted to the code. This method increased both typing speed and reduced code memory footprint, but on the other hand typing the code without the original keyboard with all keyword marks was quite hard.

So I had rewritten the code for BASIC and did 2 class training. It took around 10 hours for single epoch. I added some simple profiling and managed to cut the time in half to 5 hours per epoch. It was painful to train even with the emulator that allowed to increase the emulation speed by 100 times.

So I felt that BASIC was rather unfeasible and since back than I didn't have such a good C compiler, the entire concept of the project felt a little bit over-optimistic.

Than I discovered this BASIC compiler: ToBoS-FP. One of its key advantages was highly effective implementation of floating point routines.

So I compiled the code with it and it worked very fast! However it lacked documentation and I only managed to find some Russian source regarding compilation of big programs. Another problem was lack of "LOAD" function I related on to access the test data. But I found a workaround by simply calling to proper ROM routines and the problem was solved.

Performance

So how was the performance?

Two digits training. Note: train time is per epoch, total is for 2 train epochs and 1 testing,

BASIC BASIC C/float C/fixed - ToBoSFP z88dk z88dk+asm Train: 5h20m 12m 3.7m 1.5m Test: 2h19 6m 1.2m 0.5m Total: 12h59m 30m 8.6m 3.5m

10 digits training for 5 epochs:

BASIC C/float C/fixed ToBoSFP z88dk z88dk+asm Train: 3h15m 2h16m 41.4m Test: 1h21m 41m 13.6m Total: 17h36m 12h01m 3h40m

So I was really surprised that BASIC compiler had gave such a good results and that training times were quite feasible.

Summary

I had lot of fun doing such a project. It reminded me how well ZX Spectrum was designed. It was an excellent educational tool.

Now it is probably the time to find some real hardware and try it.

The full code is posted on github:

https://github.com/artyom-beilis/zx_spectrum_deep_learning/

one comment Friday, May 18, 2018, by artyom ; Posted in: Progress

Security Bug Fixes:

Fixed security bug fix in JSON parser module that can lead to DOS

Bugs Fixed:

Fixed issues #36 - building with GZIP disabled

Fixed issue #150 - incorrect parsing of multipart form

Changes:

By default CppCMS now uses OpenSSL instead of GNU-TLS if both available (you can change behavior back by adding -DDISABLE_OPENSSL=ON to cmake)

Special Thanks to Khaled Yakdan from code-intelligence.de for reporting this security issue

Tuesday, January 16, 2018, by artyom ; 0 comments

CppCMS 1.2.0 was released.

It is same as 1.1.1 with exception of modification of the license to MIT instead of LGPL.