Dr Piotr Wozniak, June 2018

Personal anecdote. Why use anecdotes?

I solved the problem of intermittent learning showing that it is unsolvable! One parameter is not able to describe the strength of memory related to the whole page of items. This shows that there are no optimal intervals for items with low E-factors ! Aug 27, 1990:

). By Jul 10, 1990, Tuesday, I reached Dev<3 and felt like the problem was almost "solved." On Jul 12, 1990, I improved to Dev=2.877 (incidentally, my Thesis speaks of 2.887241). However, by Aug 27, 1990, I declared the problem unsolvable. My notes from that day say:

On Aug 30 1990, I decribed the model for my Master's Thesis. The text covered 15 pages that don't make for a good reading. I bet nobody has ever had the patience to read this all. That chapter has not even been published at supermemo.com when my Master's Thesis was put on-line in excerpts in the late 1990s.

However, the conclusions drawn on the basis of the model had a profound effect on my thinking about memory in the decades that followed. The whole idea behind the model is actually reminiscent of the optimizations used to deliver Algorithm SM-17 (2014-2016).

When I declared the problem unsolvable, I meant that I could not accurately describe the memory of "difficult pages" as heterogenous materials require more complex models. However, Aug 31, 1990 notes sound far more optimistic:

Personal anecdote. Why use anecdotes?

Non-stop work on the intermittent learning model. By night, the computer did not manage to get me closer to the solution. However, I had a great idea to calculate optimal intervals using the record-breaking function of the IL model. When I saw the results on the screen I could not believe my [bleep] eyes. These were exactly the same intervals, which I found in 1985 while trying to formulate the SuperMemo method. I was happy like a dog with two tails jumping around the house. So I can say that I really solved the IL problem (compare August 27, 1990). But this success was not everything I was given to discover today. I found that: optimal factors decrease with successive intervals (previously I had an intuition that it is so) ,

, for the forgetting index equal 10% the retention is 94% (as in the EVF database)

retention is in a linear relation to the forgetting index [comment 2018: in a small range for heterogeneous material] (this could not be calculated from my simulation experiments carried in January)

[comment 2018: in a small range for heterogeneous material] the model says that the desirable value of the forgetting index is 5-10% (workload-retention trade-off)

strength of memory increases most if the interval is twice as long as the optimal one!!!

the strength of memory increases most if the forgetting index is 20%. [...] my formulas work only when intervals are not much shorter than the previous strength. Aug 31, 1990:[...]

Past (1990) vs. Present (2018)

Conclusions at the end of the chapter and the procedure itself are reminiscent of the methodology I used in 2005 when looking for the universal formula for memory stability increase and then, in 2014, when Algorithm SM-17 was based on a far more accurate mathematical description of memory. Like the newest SuperMemo algorithm, the model made it possible to compute retention for any repetition schedule. Naturally, it was far less accurate as it was based on inferior data. Moreover, what SuperMemo 17 does in real time, it took many hours of computations back in 1990.

This old seemingly boring portion of my Master's Thesis has then grown in importance by now. I dare say that only inferior data separated that work from Algorithm SM-17 that emerged long 25 years later. I quote the text with minor notational and stylistic improvements without the chapter on forgetting curves that was erroneous due to highly heterogeneous material used in computations:

Archive warning: Why use literal archives?

This text is part of: " Optimization of learning " by Piotr Wozniak (1990) Model of intermittent learning The SuperMemo model provides a basis for the calculation of optimal intervals that should separate repetitions in the process of time optimal learning. However, it does not allow to predict the changes of memory variables if repetitions are done in irregular intervals. Below I present an attempt to augment the SuperMemo model so that it can be used in the description of the process of intermittent learning. In Chapter 3, I mentioned the way in which I had learnt English and biology before the SM-0 algorithm was developed. Data collected during that time (1982-1984) provide an excellent basis for the construction of the model of intermittent learning. Items, formulated in compliance with the minimum information principle (usually having the form of pairs of words) were grouped in pages subject to the irregular repetitory process. The collected data, available in the computer readable form, include the description of repetitions of 71 pages, and in addition, 80 similar pages participating in a process supervised by the SM-0 schedule. The collected data, available in the computer readable form, include the description of repetitions of 71 pages, and in addition, 80 similar pages participating in a process supervised by the SM-0 schedule.

Similarity to Algorithm SM-17

Note that the formulation of the problem is reminiscent of the procedure used to compute the stability increase matrix (SInc[]) in Algorithm SM-17. Memory stability was rescaled to make it possible to interpret it as an interval. Even the symbols are similar: S and deviation D, and page lapses substituting for R.

I loved playing with various optimization algorithms. You can still visually observe in SuperMemo 17 how the algorithm runs surface fitting optimizations (see picture). Doing it with 12 variables might have been a bit inefficient, but I never cared about the method as long as I got interesting results that provided new insights into how memory works.

For those familiar with Algorithm SM-17, we changed the notation in the text below. In addition, symbols such as In and Ln in print could easily be misread as logarithms.

The list of changes:

Ln -> Laps n

In -> Int n

Dn -> Dev n

R -> RepNo

Formulation of the problem of intermittent learning

Archive warning: Why use literal archives?

11.1. Formulation of the problem of intermittent learning There are 161 pages. Each page contains about 40 items. For each page, the description of the learning process (collected during experimental repetitions) has the following form:

((-,Laps 1 ),(Int 2 ,Laps 2 ),(Int 3 ,Laps 3 ), ...,(Int n ,Laps n ))

where: Int i - inter-repetition interval used before the i-th repetition (it ranges from 1 to 800),

- inter-repetition interval used before the i-th repetition (it ranges from 1 to 800), Laps i - number of lapses of memory during the i-th repetition (it ranges from 0 to 40),

- number of lapses of memory during the i-th repetition (it ranges from 0 to 40), n - total number of repetitions (it ranges from 3 to 20).

Find the functions f and g described by the formulas:

S(1)=S1

S(n)=f(S(n-1),Int n ,Laps n )

Laps(n)=g(S(n-1),Int n )

where: S(n) - any variable corresponding to the strength of memory after the n-th repetition (compare Chapter 10),

Int n - interval used before the n-th repetition; taken from data collected during intermittent learning,

- interval used before the n-th repetition; taken from data collected during intermittent learning, Laps n - number of memory lapses in the n-th repetition; taken from data collected during intermittent learning,

- number of memory lapses in the n-th repetition; taken from data collected during intermittent learning, Laps(n) - estimation of the number of memory lapses in the n-th repetition (it should correspond with Laps n )

) S1 - a constant,

so that to minimize the function Dev:

Dev=sqrt((Dev 1 +Dev 2 + ... +Dev 161 )/RepNo)

Dev i =sqr(Laps(1)-Laps 1 )+sqr(Laps(2)-Laps 2 )+ ... +sqr(Laps(n)-Laps n ))

where: Dev - function that describes the difference between values yielded by the functions f and g, and values collected during intermittent learning (it reflects the difference between experimental and theoretically predicted data)

RepNo - total number of repetitions recorded on all pages

Dev i - component of the function Dev describing the deviation for the i-th page,

- component of the function Dev describing the deviation for the i-th page, Laps(j) - number of lapses calculated for the i-th page and j-th repetition using the functions f and g,

Laps j - number of lapses of memory for the i-th page and j-th repetition; taken from data collected during intermittent learning,

- number of lapses of memory for the i-th page and j-th repetition; taken from data collected during intermittent learning, sqrt(x) - square root of x,

sqr(x) - second power of x.

Note that functions f and g will provide a basis for valuable biological considerations only if they are simple and defined by a limited number of parameters (e.g. a*ln()+b or a*exp()+b etc.). Otherwise, one could always construct a gigantic, meaningless formula to automatically put Dev to zero. Note that functions f and g will provide a basis for valuable biological considerations only if they are simple and defined by a limited number of parameters (e.g. a*ln()+b or a*exp()+b etc.). Otherwise, one could always construct a gigantic, meaningless formula to automatically put Dev to zero.

Solution to the problem of intermittent learning

Archive warning: Why use literal archives?

11.2. Solution to the problem of intermittent learning In the search for functions f and g that minimize the value of Dev I used a numerical minimization procedure described in Wozniak, 1988b [ A new algorithm for finding local maxima of a function within the feasible region. Credit paper ]. Exemplary functions used in the search could look as follows: S(1)=x[1] S(n)=x[2]*Int n *exp(-Laps n *x[3])+x[4]) Laps(n)=x[5]*(1-exp(-Int n /S(n-1)))

where: x[i] - variables that are computed by the minimization procedure,

S(n), Laps(n), Laps n and Int n - as defined in 11.1. Note, that the function f describing S(n) does not use S(n-1) as its argument (the formulation of the problem allows, but does not require, that the new strength be calculated on the base of the previous strength). In order to retain simplicity and save time, I set a limit of 12 variables used in the process of minimization. I tested a great gamut of mathematical functions constructed in accordance with obvious intuitions concerning memory (e.g. that with time passing by, the number of lapses of memory will increase). These included exponential, logarithmic, power, hyperbolic, sigmoidal, bell-shaped, polynomial and reasonable combinations thereof. In most cases, the minimization procedure reduced the value of Dev to less than 3, and functions f and g assumed similar shape independent of their nature. The lowest value of Dev obtained with the use of fewer than 12 variables was 2.887241. The functions f and g were as follows: constant S(1)=0.2104031; function Sn(Intn,Lapsn,S(n-1)); begin S(n):=0.4584914*(Intn+1.47)*exp(-0.1549229*Lapsn-0.5854939)+0.35; if Lapsn=0 then if S(n-1)>In then S(n):=S(n-1)*0.724994 else S(n):=Intn*1.1428571; end; function Lapsn(Intn,S(n-1)); var quot; begin quot:=(Intn-0.16)/(S(n-1)-0.02)+1.652668; Lapsn:=-0.0005408*quot*quot+0.2196902*quot+0.311335; end; Without significantly changing the value of Dev, these functions can be easily converted to the following form: S(1)=1 for Int n >S(n-1): S(n)=1.5*Int n *exp(-0.15*Laps n )+1 Laps(n)=Int n /S(n-1) Note that: particular elements of the function where dropped or rounded whenever the operation did not considerably affect the value of Dev,

strength was rescaled to allow it to be interpreted as an interval for which the number of lapses equals 1 and the forgetting index equals 2.5% (there are 40 items on a page and 1/40=2.5%),

the formula for strength can only be valid if Int n is not much less than S(n-1). This is because of the fact that the value S(n-1) must be used in calculation of S(n) if the number of lapses is low, e.g. for Int n <=S(n-1): S(n)=S(n-1)*(1+0.5/(1-exp(S(n-1))*(1-exp(-Int n ))) The function g intentionally did not involve S(n-1) to avoid recursive accumulation of errors in calculations for successive repetitions (note, that the formula used does not consider the history of the process), the formulas cannot be used to describe any process in which intervals are manifold longer than the optimal ones. This is because of the fact that for Int n ->? the value of Laps(n) exceeds 100%,

->? the value of Laps(n) exceeds 100%, the formulas describe learning of collective items characterized by more or less uniform distribution of E-Factors. Therefore it cannot be used universally for items of variable difficulty. As for now, the above formulas make up the best description of the process of intermittent learning, and will later be referred to as the model of intermittent learning (IL model for short). As for now, the above formulas make up the best description of the process of intermittent learning, and will later be referred to as the model of intermittent learning (IL model for short).

Simulations based on the model of intermittent learning

With the formula found above, I could run a whole series of simulation experiments that would help me answer many hypothetical scenarios on the behavior of memory in various circumstances. Those simulations shaped the progress of SuperMemo for many years to follow. In particular, the trade-off between workload and retention played a major role in optimization of learning as of SuperMemo 6 (1991). Until this day, it is the forgetting index (or retrievability) that provide the guiding criterion in learning, not the intuitively natural increase in memory stability that may occur at lower levels of recall. Set level of memory lapses played the role of the forgetting index below.

Archive warning: Why use literal archives?

11.4. Verification of the model of intermittent learning To verify the consistency of the model of intermittent learning with the SuperMemo theory, let us try to calculate optimal intervals that should separate repetitions. The optimal interval will be determined by the moment at which the number of lapses reaches a selected value Laps o . The algorithm proceeds as follows: i:=1 S(i):=1 Find Int(i+1) such that Laps(i+1) equals Laps o . Use the formula:

Int(n)=Laps o *S(n-1) (taken from IL model)

where: Int(n) denotes the n-1 optimal interval. i:=i+1 S(i):=1.5*Int(i)*exp(-0.15*Laps o )+1 (taken from the IL model) goto 3 If Laps o equals 2.5 ( forgetting index 6.25%) and the exact variant of the model of intermittent learning is used then an amazing correspondence can be observed (compare the experiment presented on page 16, Chapter 3.1): Rep - number of the repetition

Interval - optimal interval preceding the repetition, determined by Laps o =2.5 on the base of the IL model,

=2.5 on the base of the IL model, Factor - optimal factor equal to the quotient of the optimal interval and previously used optimal interval,

SM-0 - optimal interval calculated on the base of experiments leading to the algorithm SM-0 Rep Interval Factor SM-0 2 1.8 1 3 7.8 4.36 7 4 16.8 2.15 16 5 30.4 1.80 35 6 50.4 1.66 7 80.2 1.59 8 124 1.55 9 190 1.53 10 288 1.52 11 436 1.51 12 654 1.50 13 981 1.50 14 1462 1.49 15 2179 1.49 16 3247 1.49 17 4838 1.49 18 7209 1.49 Obviously, the exact correspondence, to some extent, is a coincidence because the experiment leading to the formulation of the algorithm SM-0 was not that sensitive. It is worth noticing, that optimal factors tend to decrease gradually! This fact seems to confirm recent observations based on the analysis of the matrix of optimal factors used in the algorithm SM-5. If Laps o equals 4 ( forgetting index 10%, as in the algorithm SM-5) then the sequence of optimal factors resembles a column of the OF matrix in the algorithm SM-5. Also the knowledge retention matches almost ideally the one found in SM-5 databases. Rep Interval Retention Factor 2 3 93.21678 3 16 93.80946 4.89 4 43 93.97184 2.74 5 102 94.04083 2.39 6 232 94.06886 2.27 7 517 94.08418 2.23 8 1138 94.09256 2.20 9 2502 94.09737 2.20 10 5481 94.09967 2.19 The value of retention was obtained by averaging its value calculated for each day of the optimal process: R=(R(1)+R(2)+...+R(n))/n R(d)=100-2.5*Laps(d-dlr)

where: R - average retention

R(d) - retention on the d-th day of the process

Laps(Int) - expected number of lapses after the interval I

dlr - day of the process on which the last repetition was scheduled

Workload vs. Retention trade-off

Despite the inaccuracies coming from heterogeneous material, solid conclusion could be drawn about the impact of the forgetting index on the amount of time needed to invest in learning. Those observations survived the test of time:

Archive warning: Why use literal archives?

Conclusions: model of intermittent learning

The ultimate conclusions drawn at the end of the chapter stood the test of 3 decades. Only the claim on non-exponential shape of forgetting curves is inaccurate. As the entire model was based on heterogeneous data, the exponential nature of forgetting could not have been revealed.

Archive warning: Why use literal archives?

This text is part of: " Optimization of learning " by Piotr Wozniak (1990) Interim summary The model of intermittent learning was constructed making it possible to estimate knowledge retention upon different repetition schedules

The model strongly indicates that the forgetting curve is not exponential [ wrong : see Exponential nature of forgetting]

: see Exponential nature of forgetting] The model satisfactorily corresponds to experimental data

With a striking accuracy, the model approximates optimal intervals and knowledge retention implied by the SuperMemo model

The model indicates that optimal factors decrease in successive repetitions and asymptotically approach the ultimate value

The model indicates that the desirable value of the forgetting index used in time optimal learning should fall into the range 5% to 10%

The model indicates an almost linear relation of the forgetting index and knowledge retention

The model shows that the greatest increase of the strength of memory occurs when intervals are approximately 2 times longer than that used in the SuperMemo method. This is equivalent to the forgetting index equal to 20%

1991: Employing forgetting curves

Painful birth of SuperMemo World (1991)

1991 was the most important year since the birth of SuperMemo. It was a year of big decisions, stress, drama, discovery and hard work. At the start of the year, there were three greatest believers in SuperMemo: Biedalak, Murakowski and myself. We were all in the same spot in our lives: transitioning from the years of unconcern in university to the uncertainty of independent adulthood. By default, we all dreamt of big science in the US: Biedalak dreamt of artificial intelligence, Murakowski of quantum physics, and I wanted to crack the secrets of molecular memory. In retrospect, graduate students from Eastern Bloc with good transcripts, exam results and rock-solid recommendations are pretty welcome in the US. Things get more complex if they demand full financial support. I did not have a penny. Moreover, eager Easterners were often treated as dutiful labor. Zeal for their own projects and great ideas might have been less welcome. I will never know. The three believers had all different visions for SuperMemo.

On Jan 3, 1991, I started the implementation of the new spaced repetition algorithm for SuperMemo 6. On the same day, Murakowski left for London where he would pursue his educational dreams while trying to sell SuperMemo 2. He would not sell via a distribution channel or in a shop. He would need to go from person to person, explain the merits of the program and hopefully collect a few bucks to keep the hope going.

In the meantime, Biedalak and I met regularly for a 10 km jogging that would be combined with winter swimming and a brainstorming session on the way back home. We mostly spoke of studying in the US and selling SuperMemo. More and more frequently, the idea of our own company started coming up.

I started my work over the new spaced repetition algorithm with some ideas that would change SuperMemo for ever. Algorithm SM-6 used in SuperMemo 6 was a breakthrough that would power further development over the next 25 years. It would re-employ the simple experimental procedure that led to spaced repetition in 1985 but would do it in an automated manner. It would collect performance data and choose the best time for review: it would plot the user's forgetting curve. This would also mean that the user would be able to decide the acceptable probability of forgetting for every single item (i.e. the optimum level of retention-workload tradeoff).

At that time, I was still bound to 360 kB diskettes. For that reason, SuperMemo still could not keep all repetition histories that would fully replicate the 1985 approach on a massive scale. However, on Jan 6, 1990, I had a simple idea. I could just collect data about the forgetting curves for classes of items of different difficulty and stability. Instead of the full record, I would only update the approximation of how many items in a given class are retained in memory at a given time (i.e. at a given level of retrievability). That idea survives at the core of SuperMemo to this day. Even with the full record of repetition histories today, SuperMemo still instantly knows the expected retrievability of items in a given class.

At the crossroads in life, I was finally free from school. There is a powerful emotion that millions of teens and young adults face in their lives: a traumatic move from a slave called "pupil" or "student", to the freedom of becoming an "unemployed adult". The psychological shakeup can be even more dramatic if one turns from "good student" to "unemployed 28-year-old living with his mom". Like a lightswitch: the whole world seems to change from cheerful support to a gloomy-faced condemnation mixed up with pity.

I kept learning and worked on new ideas for SuperMemo in the atmosphere of freedom mixed up with uncertainty. For me, uncertainty is an energizer. However, on Feb 12, 1991, I learned that my mom was diagnosed with terminal cancer. To the mix of freedom and uncertainty, it added the sense of gloom. For me again, gloom can also be an energizer. I tripled my learning about cancer as if in hope of finding out some magic therapy on my own. This shows how unreasonable optimism can be a key to productivity, and surviving hard times. By working harder I could dispel the gloom. High productivity is a sure anti-depressant. My hard work left no room for dark thoughts. I was confident, I would cure my mom!

Incidentally, at the moment of mom's diagnosis, I was also writing a program to simulate the optimum behavior of memory in response to the environment; a way to prove what math would make the two component model of memory optimum. At the diagnosis, I threw that effort out of my schedule to learn about cancer. I never completed that program and that idea still lives in limbo pushed away by other projects.

On Mar 6, 1991, during one of our jogging-cum-brainstorming days with Biedalak, someone tossed the name SuperMemo World. Little did we know that four months later that would be the name of our company that has survived 27 years today.

On Mar 12, 1991, I made my first repetitions with the new algorithm in SuperMemo 6 while my mom rested on her deathbed. A week later, she died peacefully in sleep at the young age of 70. In similar circumstances, the usual picture involves family meetings, mourning, funeral, and a whole host of traditions with roots in religion that I could never accept as rational. Instead, 9 hours after my mom's death, I worked on a better method for a fast approximation of the OF matrix. In that work, I capitalized on the job I once did for ZX Spectrum. I would employ linear regression along the difficulty columns and negative exponential regression along the repetition rows. Years later, I found that power regression is more appropriate for the latter. Only Algorithm SM-8 developed four years later would make a full use of those ideas. However, I mention it mostly to illustrate how hard work and productivity can work great as a remedy against gloom and possible depression. At that time I discovered that the impact of an emotional trauma follows a circadian curve. I would work hard in the morning, but the gloom would keep creeping back to my mind in the evening. Sleep would be the liberation and the best anti-depression. From those early days I am a firm believer in the idea that sleep and learning carry a solution to the problem of depression, however, I never truly had a chance to work on it. It would help if I suffered a bit myself, but either I have some good resilience endowment, or, more likely, I instinctively employ the tools of good sleep and high productivity at hard times. Ever since Sapolsky called depression the "worst disease in the world", I wanted to find a formula for preventing depression. I sense there is a simple formula. Perhaps that naive childlike optimism itself is part of the solution?

On Apr 13, 1991, we decided that SuperMemo 2 should be released as freeware. We hoped it might educate potential users abroad about the power of spaced repetition. However, initially, we had to send out diskettes with free SuperMemo at our own cost. Only in 1993, we uploaded SuperMemo to a local BBS called "Onkonet". It would take some more years before we could upload future versions to Simtel and freeware sites. That freeware idea had an interesting side effect: by the end of the year, it was clear: people would start using the program and then give up. This was a hint of an inherent problem with spaced repetition: poor motivation resulting from poor skills would produce a high drop out rate. We also heard that others would try to sell SuperMemo 2 as if this was a commercial product.

On May 2, 1991, I implemented the option for setting the requested forgetting index in SuperMemo 6. On July 5, 1991, SuperMemo World was born. One of the first investment was a PC with a hard disk that would finally help me move away from the slow era of floppy disks.

On Nov 23, 1991: SuperMemo was announced as the finalist of Software for Europe competition. This saved SuperMemo World.

Slow start of commercial SuperMemo

When we set up SuperMemo World with Krzysztof Biedalak on Jul 5, 1991, the future looked so bright we needed to buy shades. The earth is populated with the highly intelligent population that all need to learn things. This whole population is our market. The only problem was how to convince all those smart people that two poor students educated behind the Iron Curtain got anything of value to offer. We could not have used the web for that job. SuperMemo is older than the web itself. We could not afford advertising for lack of capital. There was no venture capital culture in Poland in 1991. All we could do is put the first few copies of SuperMemo in file folders and place them on shelves of nearby computer shops. As we aimed at global domination, we did not even have a manual in Polish. Instead of first sales, we had a long summer of silence and creeping doubts.

Figure: In 1991, we delivered the first copies of SuperMemo 5 for DOS to shops in Poznan (Poland) in pink folders with a sticker. The manual did not include a translation to Polish. Amazingly, we found a few buyers. The first sale took place some time between September 9 and 11, 1991 (computer shop Axe Prim)

(reconstructed on the basis of original folders and stickers)

Why was it hard to sell the first copy? I can reconstruct the scenario from the words of one of our first customers who actually visited a shop and had a look at the first SuperMemo displayed in public. On a shelf with computer programs, along with shiny boxes from Microsoft, he noticed a shabby folder with enticing words: " Your breakthrough speed-learning software ". He picked up the folder and opened a manual, which was a stack of poorly xeroxed pages in English. With lofty words, he read a story that defied belief. It was all too good to be true. Faster learning, great retention, new scientific method, a little cost in time, etc. He did not contemplate an investment, the package was pretty costly (around $100, which was a lot in Poland 1991), however, he approached the salesperson to find out who the people behind SuperMemo were. The owner of the shop knew SuperMemo pretty well and explained. The story started looking credible. The customer never forgot the episode. A few months later, he heard of SuperMemo from some local journal and became one of the first paying customers. His registration coupon arrived in January 1992, and the history of his upgrades says he stayed with SuperMemo for decades and now his son is one of the regular customers.

However, back in summer 1991, we had no sales and by fall, everyone except for myself started having serious doubts. Not about SuperMemo, but about the viability of the business.

It should help to know how we have met. With Biedalak, we were friends since forever. I attended a school with his brother, we lived 200 meters apart and qualified for the same year of computer science in university. I cannot say how I convinced Biedalak that SuperMemo is great. We have just been too close and he has always been in the circle. This part was easy. Tomek Kuehn was one of the first great believers in SuperMemo. He was also a great programmer, a great inspiration, and he grasped the idea instantly. He wrote two versions of SuperMemo himself: for Atari 800 in 1988, and for Atari ST in 1989. In January 1989, he even sold 10 copies of SuperMemo 2 using an advert in one of the computer journals: Komputer. I presume, he did not recover the invested money. Upon graduation, he already had his own business: a computer shop. This shop was also one of the first to present SuperMemo to its customers. His partner and friend was Marczello Georgiew who did not need much convincing either. Last but not least, I met Janusz Murakowski during GRE exams in Budapest in 1990. A great mathematical mind, he might be the fastest convert to SuperMemo ever. During our train trip back to Poland, I mentioned SuperMemo. He was instantly captivated. A few days later, he was already an enthusiastic user of SuperMemo 2 (as of Jun 13, 1990). In our company rap anthem, we sang " we are the guys who sell SuperMemo". It was very hard to convince people that SuperMemo works, but the guys on the team have always been enthusiastic.

By November 1991, the enthusiasm was thawing. If we continued without success, we would have gradually lost the team in proportion to their involvement and passion. With a few more months, the company might have died. SuperMemo would not have died. I would certainly look for a buyer, or continue one way or another. I was too tied to the product. I used it myself and all my knowledge was invested in my databases. I might have thought of returning to the idea of a PhD in the US. In the same way as I was able to combine work at the university in Holland in 1989 with programming "after hours", I would probably continue until some breakthrough, e.g. on the web. Perhaps it would be an open source product? Luckily, Dr Wojciech Makałowski of the Department of Biolpolymer Biochemistry suggested we submit SuperMemo for Software for Europe competition. By some miraculous stroke of good luck, we qualified for the final and this was instantly noticed by the Polish media, esp. computer journals. As of that point, SuperMemo had an easy ride with the Polish press that became more and more intrigued. Andrzej Horodeński was first, and Pawel Wimmer was second and most faithful to this very day. Wimmer actually used SuperMemo 2, which he probably received from Tomasz Kuehn at the time of his KOMPUTER journal advert in 1989.

1.5 years after its birth, SuperMemo World had finally become profitable. Not bad.

SuperMemo World was a fantastic set up from the getgo. We had no injection of venture capital in Poland in 1991, so we had to pull ourselves by our own bootstraps by selling, what others considered to be "snake oil". We might have easily failed, but we survived by the sheer power of passion, belief, and a big stroke of good luck.

Origins of Algorithm SM-6

Algorithm SM-6 was first used in SuperMemo 6 (1991), however, it kept evolving in SuperMemo 7 (1992). There has never been the SM-7 version despite multiple changes. Most notably, as of 1994, the exponential function was used to approximate forgetting curves in SuperMemo 7 for Windows. OF matrix approximations have also been improved over time.

Figure: SuperMemo 7 for Windows (1992) displaying a forgetting curve based on averages. Figure: SuperMemo 7 for Windows (1994) displaying a forgetting curve approximated with an exponential function. Vertical axis represents recall in percent. Horizontal axis corresponds with time represented by U-Factor

The most important component of Algorithm SM-6 was to collect data on the rate of forgetting. Forgetting curves make it easy to accurately determine optimum intervals. This eliminated the need for a slow and inaccurate bang-bang approach of Algorithm SM-5:

Archive warning: Why use literal archives?

This text is part of: " Economics of Learning " by Piotr Wozniak (1995) In Algorithm SM-5, the process of determining the value of a single entry of the matrix of optimal factors looked as follows (see before): Set the initial value to an average optimal factor value (OF) obtained in previous experiments If the grade produced by the entry in question was (1) greater than the desired value then increase the value of OF, (2) less than the desired value then decrease OF, or (3) equal the desired value then do not change OF The above approach shows that the optimum value of OF could be reached only after a great number of repetitions, and what is worst, the greater the ordinal number of a repetition, the longer it would take to execute the modification-verification cycle (i.e. the cycle in which an OF entry is changed, and verified upon scheduling another repetition with a correspondingly long interval). Introducing the concept of the forgetting index The novelty of Algorithm SM-6 is to approximate the slope of the forgetting curve corresponding to a given entry of the matrix of optimal factors, and compute the new value of the relevant optimal factor directly from the approximated curve. In other words, no modification-verification cycle is necessary in Algorithm SM-6 because of establishing the deterministic relationship between the forgetting curve and the optimum inter-repetition interval. The modification of the optimal factor occurs immediately after a repetition upon approximating the new forgetting curve derived from data that include the grade provided in the recent response. This modification not only made it possible to greatly accelerate the process of determining the optimum values of the matrix of optimal factors, but also provided a means for establishing the desired level of knowledge retention that will be reached in the course of the learning process (see an exemplary forgetting curve). The desired level of knowledge retention is determined by the proportion of items that are not remembered at repetitions. This proportion is called the forgetting index (items are classified as remembered or forgotten on the basis of grades provided by the student in self-assessment of his or her progress). Figure: An exemplary forgetting curve plotted in the course of repetitions (over 40,000 repetition cases recorded). In the figure presented above, the lapse of time is represented by the interval in days. The vertical axis represents knowledge retention stated as percentage. The horizontal line located at the retention level of 90% determines the requested forgetting index, i.e. the desired proportion of items that should be forgotten at the moment of repetition. The optimum interval will then naturally come at the cross-section of the requested forgetting index line with the forgetting curve. In the example above, the optimum interval equals seven days. The presented forgetting curve has been plotted on the basis of 40489 recorded repetition cases. See later in the text for explanation of the values R-Factor (RF), O-Factor (OF), etc. Because of the highly irregular nature of the matrix of optimal factors computed directly from forgetting curves, in Algorithm SM-6, the matrix used in spacing repetitions represents a smoothed version of the so-called matrix of retention factors (matrix RF), which is derived directly from forgetting curves corresponding to particular entries of the matrix OF. In other words, forgetting curves determine the value of entries of the matrix RF, and only the smoothed equivalent of the latter, the matrix OF is used in computing optimum intervals. Because of the highly irregular nature of the matrix of optimal factors computed directly from forgetting curves, in Algorithm SM-6, the matrix used in spacing repetitions represents a smoothed version of the so-called matrix of retention factors (matrix RF), which is derived directly from forgetting curves corresponding to particular entries of the matrix OF. In other words, forgetting curves determine the value of entries of the matrix RF, and only the smoothed equivalent of the latter, the matrix OF is used in computing optimum intervals.

Algorithm SM-6

The description of the algorithm below is taken with some clarifications from my PhD Thesis, and refers to the status quo for 1994:

Archive warning: Why use literal archives?

This text is part of: " Economics of Learning " by Piotr Wozniak (1995) The learned knowledge is split into smallest possible pieces called items Items are formulated in the question-answer form Items are memorized by means of a self-paced drop-out technique, i.e., by responding to the asked questions as long as it takes to provide all correct answers After memorizing an item, the first repetition is scheduled after an interval that is the same for all of the items. Its value is determined by the desired level of knowledge retention, which in turn can be converted into an interval by using an average forgetting curve taken from an average database of an average student ( Wozniak 1994a). The desired retention is specified by means of the so-called forgetting index, which corresponds to the proportion of items forgotten at repetitions (to learn how to compute retention from the forgetting index, and vice versa). Note, that the first interval may be randomly shortened or lengthened for the sake of speeding up the optimization process (varying intervals increase the accuracy of approximating the forgetting curve). The first interval is computed as for an average student and an average database. However, as soon as the recorded value of the forgetting index deviates from the requested level, the length of the first interval is modified accordingly. The new value of the interval is derived from the approximation of the negatively exponential forgetting curve plotted in the course of repetition. With each repetition score recorded, the plot becomes more and more accurate and the used value of the optimum inter-repetition interval settles at the point that ensures the selected level of knowledge retention. After each repetition, the student produces a grade, which determines the accuracy and easiness of reproducing the correct answer. On the basis of the grades, items are classified into difficulty categories. Their difficulty is reestimated in each successive repetition. The difficulty of each item is characterized by the earlier mentioned E-factors (E stands for "easiness"). E-factors are equal to 2.5 for all items on the entry to the learning process, and modified after subsequent repetitions. For example, grades above four result in slightly increasing the E-factor (good grades indicate easy items), while grades below four reduce the E-factor. Historically, E-factors were used to determine how many times intervals should increase in successive repetitions of items of a given difficulty. At present, E-factors are only used to index the matrices of optimal factors and retention factors, and may bear little relevance to the actual interval increase. Different optimal intervals are applied to items of different difficulty. Different intervals are applied to items that have been repeated a different number of times. The function of optimal intervals is constantly modified in order to produce the desired knowledge retention determined by the forgetting index. In other words, the algorithm will detect how well the student copes with repetitions and adjust the length of inter-repetition intervals accordingly. The function of optimal intervals is represented as the matrix of optimal factors, OF-matrix in short, defined as follows: for n=1: I(n,EF)=OF(n,EF) for n>1: I(n,EF)=I(n-1,EF)*OF(n,EF) where: I(n,EF) - n-th interval for difficulty EF

OF(n,EF) - optimal factor for the n-th repetition and the difficulty EF The entries of the matrix of optimal factors are modified in the course of repetitions to ensure the desired level of knowledge retention Matrix of optimal factors is produced by smoothing the so-called matrix of retention factors, RF matrix in short. Matrix of retention factors is defined in the same way as the matrix of optimal factors. Entries of the matrix of retention factors are intended to estimate the values of the entries of the matrix of optimal factors. Each optimal factor corresponds to an optimal interval that produces the desired retention at repetition (determined by the requested forgetting index). Each entry of the matrix of retention factors corresponds to a different value of E-factor and repetition number Entries of the matrix of retention factors, called R-factors , are computed from forgetting curves whose shape is sketched on the basis of the history of repetitions The lapse of time on the forgetting curve graph is measured by the so-called U-factor , which is the ratio of the current and the previous interval, except for the first repetition where U-factor equals the interval in days (as in Figure). The record of repetitions makes it possible to compute retention for different values of U-factor. The graph of the retention plotted versus the lapse of time (U-factor) represents a forgetting curve. The cross-section of the forgetting curve with the desired retention level determines the optimum R-factor, which, upon smoothing the matrix of retention factors, yields the optimum O-factor Each difficulty category and repetition number has its own record of repetitions used to sketch a separate forgetting curve. In other words, different intervals will be used for items of different difficulty, and for items repeated a different number of times. Intervals used in learning, including the first interval, are slightly dispersed round the optimal value in order to increase the accuracy of forgetting curve sketching, and consequently, to increase the convergence rate of the optimization procedure. By slightly dispersing intervals, the approximation of the forgetting curve will use a more scattered set of points on the graph

1994: Exponential nature of forgetting

Forgetting curve: power or exponential

The shape of the forgetting curve is vital for understanding memory. The math behind the curve may even weigh in on the understanding of the role of sleep (see later). When Ebbinghaus first determined the rate of forgetting, he got a pretty nice set of data with a good fit to the power function. However, today we know forgetting is exponential. The discrepancy is explained here.

Forgetting curve adapted from Hermann Ebbinghaus (1885). The curve has been rendered from original tabular data published by Ebbinghaus (Piotr Wozniak, 2017)

Wrong thinking helped spaced repetition

For many years, the actual shape of the curve did not play much of a role in spaced repetition. My early intuitions were all over the place depending on the context. Back in 1982, I was thinking that the evolution has designed forgetting for the brain to make sure we do not run out of memory space. The optimum time for forgetting would be determined by the statistical properties of the environment. Decay would be programmed to maximize survival. Once the review did not take place, the memory would get deleted to provide space for new learning.

I was wrong thinking that there might be an optimum time for forgetting and this error was actually helpful for inventing spaced repetition. That "optimum time" intuition helped the first experiment in 1985. The optimum time for forgetting would imply sigmoidal forgetting curve with a clear inflection point that determines optimality. Before the review, forgetting would be minimal. A delayed review would result in rapid forgetting. This is why finding the optimum interval seemed so critical. When data started pouring in later on, with my confirmation bias, I still could not see my error. I wrote in my Master's Thesis about sigmoidal forgetting: " this follows directly from the observation that before the elapse of the optimal interval, the number of memory lapses is negligible". I must have forgotten my own forgetting curve plot produced in late 1984.

Today this seems preposterous, but even my model of intermittent learning provided some support for the theory. Exponential approximation yielded particularly high deviation error for data collected in my work on the model of intermittent learning, and the superposition of sigmoid curves for different E-Factors could easily mimic early linearity. Linear approximation seemed to excellently fit the model of intermittent learning within the recall range in the available data. No wonder, with whole pages of heterogeneous material, exponential nature of forgetting remained well hidden.

Contradictory models

I did not ponder forgetting curves much. However, my biological model dating back to 1988 spoke of exponential decay in retrievability. Apparently, in those days, the forgetting curve and retrievability could exist in my head as independent entities.

In my credit paper for a class in computer simulation (Dr Katulski, Jan 1988), my figures clearly show exponential forgetting curves:

Figure: Hypothetical mechanism involved in the process of optimal learning. (A) Molecular phenomena (B) Quantitative changes in the synapse.

By that time I might have picked the better idea from literature. In the years 1986-1987, I spent a lot of time in the university library looking for some good research on spaced repetition. I found none. I might have already been familiar with Ebbinghaus's forgetting curve. It is mentioned in my Master's Thesis.

Collecting data

I collected data for my first forgetting curve plot in late 1984. As all the learning was done for learning's sake over the course of 11 months, and the cost of the graph was minimal, I forgot about that graph and it lay unused for 34 years in my archives:

Figure: My very first forgetting curve for the retention of English vocabulary plotted back in 1984, i.e. a few months before designing SuperMemo on paper. This graph was not part of the experiment. It was simply a cumulative assessment of the results of intermittent learning of English vocabulary. The graph was soon forgotten. It was re-discovered 34 years later. After memorization, 49 pages of ~40 word pairs of English were reviewed at different intervals and the number of recall errors was recorded. After rejecting outliers and averaging, the curve appears to be far less steep that the curve obtained by Ebbinghaus (1885), in which he used nonsense syllables and a different measure of forgetting: saving on re-learning

My 1985 experiment could also be considered as a noisy attempt to collect forgetting curve data. However, first SuperMemos did not care about the forgetting curve. The optimization was bang-bang in nature, even though today, collecting retention data seems such an obvious solution (as in 1985).

Until I started collecting data with SuperMemo software, where each item could be scrutinized independently, I could not fully recover from early erroneous notions about forgetting.

SuperMemo 1 for DOS (1987) collected full repetition histories that would make it possible to determine the nature of forgetting. However, within 10 days (on Dec 23, 1987), I had to ditch the full record of repetitions. At that time, my disk space was 360KB. That's correct. I would run SuperMemo from old type 5.25in diskettes. Full repetition history record returned to SuperMemo only 8 long years later (Feb 15, 1996) after the hectic effort from Dr Janusz Murakowski who considered every ticking minute a waste of valuable data that could power future algorithms and memory research. Two decades later, we have more data that we can effectively process.

Without repetition history, I could still investigate forgetting with a help of the forgetting curve data collected independently. On Jan 6, 1991, I figured out how to record forgetting curves in a small file that would not bloat the size of the database (i.e. without the full record of repetition history).

Only SuperMemo 6 then, in 1991, started collecting forgetting curve data to determine optimum intervals. It was doing the same thing as my first experiment, except it did it automatically, on a massive scale, and for memories separated into individual questions (this solved the heterogeneity problem). SuperMemo 6 initially used a binary chop to find the best moment corresponding with the forgetting index. A good fit approximation was still 3 years into the future.

First forgetting curve data

By May 1991, I had some first data to peek at, and this was a major disappointment. I predicted I would need a year to see any regularity. However, every couple of months, I kept noting down my disappointment with minimum progress. The progress in collecting data was agonizingly slow and the wait was excruciating. A year later, I was no closer to the goal. If Ebbinghaus was able to plot a good curve with nonsense syllables, his pain of non- coherence must have been worth it. With meaningful data, the truth was very slow to emerge. Even with the convenience of having it all done by a computer while having fun with learning.

On Sep 3, 1992, SuperMemo 7 for Windows made it possible to have a first nice peek at a real forgetting curve. The view was mesmerizing:

Figure: SuperMemo 7 for Windows was written in 1992. As of Sep 03, 1992, it was able to display user's forgetting curve graph. The horizontal axis labeled U-Factor corresponded with days in this particular graph. The kinks between days 14 and 20 were one of the reasons it was difficult to determine the nature of forgetting. Old erroneous hypotheses were hard to falsify. Until the day 13, forgetting seemed nearly linear and might also provide a good exponential fit. It took two more years of data collecting to find answers (source: SuperMemo 7: User's Guide)

Forgetting curve approximations

By 1994, I still was not sure about the nature of forgetting. I took data collected in the previous 3 years (1991-1994) and set out to figure out the curve once and for all. I focused on my own data from over 200,000 repetitions. However, it was not easy. If SuperMemo schedules a repetition at R=0.9, you can draw a straight line from R=1.0 to R=0.9 and do great with noisy data:

Figure: Difficulty approximating forgetting curve. Back in 1994, it was difficult to understand the nature of forgetting in SuperMemo because most of the data used to be collected in high recall range.

My notes from May 6, 1994 illustrate the degree of uncertainty:

Personal anecdote. Why use anecdotes?

All day of crazy attempts to better approximate forgetting curves. First I tried R=1-i n /(H n +i n ) where i - interval, H - memory half-life, and n - cooperativity factor. Late in the evening, I had it work quite slowly, but ... it appeared that r=exp(-a*i) works not much worse! Even the old linear approximation was not very much worse (sigmoid: D=8.6%, exponential D=8.8%, and linear D=10.8%). Perhaps, forgetting curves are indeed exponential? Going to sleep at 2:50 May 6, 1994:

It was not easy to separate linear, power, exponential, Zipf, Hill, and other functions. Exponential, power and even linear approximations brought pretty good outcomes depending on circumstances that were hard to separate. Only when looking at forgetting curves well sorted for complexity at higher levels of stability, despite those graphs being data poor, could I see the exponential nature of forgetting more clearly.

One of the red herrings in 1994 was that, naturally, I had most data collected for the first review. New items at the entry to the process still provide a heterogeneous group that obeys the power law of forgetting.

The first review forgetting curve for newly learned knowledge collected with SuperMemo.

Later on, when they are sorted by complexity and stability, they start becoming exponential. In Algorithm SM-6, complexity and stability were imperfectly expressed by E-Factors and repetition number respectively. This resulted in algorithmic imperfections that made for imperfect sorting. In addition, SuperMemo stays within the area of high retention when forgetting is nearly linear.

By May 1994, the main first-review curve in my Advanced English database collected 18,000 data points and seemed like the best analytical material. However, that curve encompasses all the learning material that enters the process independent of its difficulty. Little did I know that this curve is covered by the power law. My best deviation was 2.0.

For a similar curve from 2018 see:

Figure: Forgetting curve obtained in 2018 with SuperMemo 17 for average difficulty (A-Factor=3.9). At 19,315 repetitions and least squares deviation of 2.319, it is pretty similar to the curve from 1994, except it is best approximated with an exponential function (for the power function example see: forgetting curve).

Exponential forgetting prevails

By summer 1994, I was reasonably sure of the exponential nature of forgetting. By 1995, we published "2 components of memory" with the formula R=exp(-t/S). Our publication remains largely ignored by mainstream science but is all over the web when forgetting curves are discussed.

Interestingly, in 1966, Nobel Prize winner Herbert Simon had a peek at Jost's Law derived from Ebbinghaus work in 1897. Simon noticed that the exponential nature of forgetting necessitates the existence of a memory property that today we call memory stability. Simon wrote a short paper and moved on to hundreds of other projects he was busy with. His text was largely forgotten, however, it was prophetic. In 1988, similar reasoning led to the idea of the two component model of long-term memory.

Today we can add one more implication: If forgetting is exponential, it implies a constant probability of forgetting in unit time, which implies neural network interference, which implies that sleep might build stability not by strengthening memories, but by simply removing the cause of interference: unnecessary synapses. Giulio Tononi might then be right about the net loss of synapses in sleep. However, he believes that loss is homeostatic. Exponential forgetting indicates that this could be much more. It might be a form of " intelligent forgetting" of things that interfere with key memories reinforced in waking.

Negatively exponential forgetting curve

Only in 2005, we wrote more extensively about the exponential nature of forgetting. In a paper presented by Dr Gorzelańczyk in a modelling conference in Poland, we wrote:

Archive warning: Why use literal archives?

small sample size

sample heterogeneity

confusion between forgetting curves, re-learning curves, practise curves, savings curves, trials to learn curves, error curves, and others in the family of learning curves Although it has always been suspected that forgetting is exponential in nature, proving this fact has never been simple. Exponential decay appears standardly in biological and physical systems from radioactive decay to drying wood. It occurs anywhere where expected decay rate is proportional to the size of the sample, and where the probability of a single particle decay is constant. The following problems have hampered the effort of modeling forgetting since the years of Ebbinghaus (Ebbinghaus, 1885) By employing SuperMemo, we can overcome all these obstacles to study the nature of memory decay. As a popular commercial application, SuperMemo provides virtually unlimited access to huge bodies of data collected from students all over the world. The forgetting curve graphs available to every user of the program ( Tools : Statistics : Analysis : Forgetting curves) are plotted on relatively homogenous data samples and are a bona fide reflection of memory decay in time (as opposed to other forms of learning curves). The quest for heterogeneity significantly affects the sample size though. It is important to note that the forgetting curves for material with different memory stability and different knowledge difficulty differ. Whereas memory stability affects the decay rate, heterogeneous learning material produces a superposition of individual forgetting curves, each characterized by a different decay rate. Consequently, even in bodies with hundreds of thousands of individual pieces of information participating in the learning process, only relatively small homogeneous samples of data can be filtered out. These samples rarely exceed several thousands of repetition cases. Even then, these bodies of data go far beyond sample quality available to researchers studying the properties of memory in controlled conditions. Yet the stochastic nature of forgetting still makes it hard to make an ultimate stand on the mathematical nature of the decay function (see two examples below). Having analyzed several hundred thousand samples we have come closest yet to show that the forgetting is a form of exponential decay. Figure: Exemplary forgetting curve sketched by SuperMemo. The database sample of nearly a million repetition cases has been sifted for average difficulty and low stability (A-Factor=3.9, S in [4,20]), resulting in 5850 repetition cases (less than 1% of the entire sample). The red line is a result of regression analysis with R=e -kt/S. Curve fitting with other elementary functions demonstrates that the exponential decay provides the best match to the data. The measure of time used in the graph is the so-called U-Factor defined as the quotient of the present and the previous inter-repetition interval. Note that the exponential decay in the range of R from 1 to 0.9 can plausibly be approximated with a straight line, which would not be the case had the decay been characterized by a power function. Figure: Exemplary forgetting curve sketched by SuperMemo. The database sample of nearly a million repetition cases has been sifted for average difficulty and medium stability (A-Factor=3.3, S > 1 year) resulting in 1082 repetition cases. The red line is a result of regression analysis with R=e -kt/S.

Forgetting curve: Retrievability formula

In Algorithm SM-17, retrievability R corresponds with the probability of recall and represents the exponential forgetting curve. Retrievability is derived from stability and the interval:

R[n]:=exp -k*t/S[n-1] where: R[n] - retrievability at the n-th repetition

k - decay constant

t - time ( interval)

S[n-1] - stability after the (n-1)th repetition

That neat theoretical approach is made a bit more complex when we consider that forgetting may not be perfectly exponential if items are difficult or with mixed difficulty. In addition, forgetting curves in SuperMemo can be marred by user strategies.

In Algorithm SM-8, we hoped that retrievability information might be derived from grades. This turned out to be false. There is very little correlation between grades and retrievability, and it primarily comes from the fact that complex items get worse grades and tend to be forgotten faster (at least at the beginning).

Retention vs. the forgetting index

Exponential nature of forgetting implies that the relationship between the measured forgetting index and knowledge retention can accurately be expressed using the following formula:

Retention = -FI/ln(1-FI) where: Retention - overall knowledge retention expressed as a fraction (0..1),

FI - forgetting index expressed as a fraction (forgetting index equals 1 minus knowledge retention at repetitions).

For example, by default, well-executed spaced repetition should result in retention 0.949 (i.e. 94.9%) for the forgetting index of 0.1 (i.e. 10%). 94.9% illustrates how much exponential decay resembles a linear function at first. For linear forgetting, the figure would be 95.000% (i.e. 100% minus half the forgetting index).

Forgetting curve for poorly formulated material

In 1994, I was lucky my databases were largely well-formulated. This often wasn't the case with users of SuperMemo. For badly-formulated items, the forgetting curve is flattened. It is not purely exponential (as superposition of several exponential curves). SuperMemo can never predict the moment of forgetting of a single item. Forgetting is a stochastic process and can only operate on averages. A frequently propagated fallacy about SuperMemo is that it predicts the exact moment of forgetting: this is not true, and this is not possible. What SuperMemo does is a search for intervals at which items of given difficulty are likely to show a given probability of forgetting (e.g. 10%). Those flattened forgetting curves led to a paradox. Neglecting complex items may lead to a great survival after long breaks from review. Even for a pure negatively exponential forgetting curve, a 10-fold deviation in interval estimation will result in R2=exp 10*ln(R1) difference in retention. This is equivalent to a drop from 98% to 81%. For a flattened forgetting curve typical of badly-formulated items, this drop may be as little as 98%->95%. This leads to a conclusion that keeping complex material at lower priorities is a good learning strategy.

Power law emerges in superposition of exponential forgetting curves

To illustrate the importance of homogenous samples in studying forgetting curves, let us see the effect of mixing difficult knowledge with easy knowledge on the shape of the forgetting curve. The figure below shows why heterogeneous samples may lead to wrong conclusions about the nature of forgetting. The heterogeneous sample in this demonstration is best approximated with a power function! The fact that power curves emerge through averaging of exponential forgetting curves has earlier been reported by others (Anderson&Tweney 1997; Ritter&Schooler, 2002).

Figure: Superposition of forgetting curves may result in obscuring the exponential nature of forgetting. A theoretical sample of two types of memory traces has been composed: 50% of the traces in the sample with stability S=1 (thin yellow line) and 50% of the traces in the sample with stability S=40 (thin violet line). The superimposed forgetting curve will, naturally, exhibit retrievability R=0.5*Ra+0.5*Rb=0.5*(e -k*t+e -k*t/40). The forgetting curve of such a composite sample is shown in granular black in the graph. The thick blue line shows the exponential approximation (R2=0.895), and the thick red line shows the power approximation of the same curve (R2=0.974). In this case, it is the power function that provides the best match to data, even though the forgetting of sample subsets is negatively exponential.

SuperMemo 17 also includes a single forgetting curve that is best approximated by a power function. This is the first forgetting curve after memorizing items. At the time of memorization, we do not know item complexity. This is why the material is heterogeneous and we get a power curve of forgetting.

Figure: The first review forgetting curve for newly learned knowledge collected with SuperMemo. Power approximation is used in this case due to the heterogeneity of the learning material freshly introduced in the learning process. Lack of separation by memory complexity results in superposition of exponential forgetting with different decay constants. On a semi-log graph, the power regression curve is logarithmic (in yellow), and appearing almost straight. The curve shows that in the presented case recall drops merely to 58% in four years, which can be explained by a high reuse of memorized knowledge in real life. The first optimum interval for review at retrievability of 90% is 3.96 days. The forgetting curve can be described with the formula R=0.9907*power(interval,-0.07), where 0.9907 is the recall after one day, while -0.07 is the decay constant. In this is case, the formula yields 90% recall after 4 days. 80,399 repetition cases were used to plot the presented graph. Steeper drop in recall will occur if the material contains a higher proportion of difficult knowledge (esp. poorly formulated knowledge), or in new students with lesser mnemonic skills. Curve irregularity at intervals 15-20 comes from a smaller sample of repetitions (later interval categories on a log scale encompass a wider range of intervals)

1995: Hypermedia SuperMemo

Birth of Algorithm SM-8 in a mountain hut

In 1995, SuperMemo was rewritten from grounds up and it was a great opportunity to implement a new spaced repetition algorithm based on data collected in 4 years of the use of Algorithm SM-6.

In March 1995 at CeBIT in Hannover, we saw a new fantastic development environment from Borland: Delphi. It has lifted the old Borland Pascal to a new level and opened dozens of development opportunities for SuperMemo. We decided to redesign the program along the lines depicted in my PhD dissertation. In addition to spaced repetition, we wanted to have knowledge structure and hypermedia. Instead of a mass of items, the users would build a knowledge tree. Instead of the old template of a question, answer, picture, and sound, we wanted to have all possible component types that could be mixed up into new hypermedia forms for expressing knowledge. There was also a dream of programmable SuperMemo in which developers could write their own procedures for any form of training, incl. procedural training, touch typing, or solving quadratic equations. At the same time, we have collected a lot of data that indicated that the algorithm used in SuperMemo could be improved. For example, the mathematical nature of the matrix of optimal factors has become pretty obvious.

In May 1995, I took my Pentium PC to a remote mountain hut in southern Poland to work on those ideas. That was a period of 100 days of total isolation interrupted only by a short visit from Krzysztof Biedalak during which we re-synchronized our vision for future SuperMemo. By September 1995, the new algorithm was ready and tested on my own data. Back in Poznan, I started gradually moving all my learning process from multiple collections in SuperMemo 7 to the new environment nicknamed "Genius". Genius became SuperMemo 8 only two years later when the new program added up all functionality that was originally available in SuperMemo 7.

The main data that helped develop Algorithm SM-8 were forgetting curves and OF matrix data collected with SuperMemo 6 and SuperMemo 7. This data took away a great deal of guesswork from the algorithm. The work was pretty easy in comparison to Algorithm SM-17 (2014-2016) when I had mountains of repetition histories to process, and the requirements for precision and good metrics have tripled. While Algorithm SM-17 took two years to develop, Algorithm SM-8 was designed, implemented, and well-tested in mere 100 days.

The main ideas behind Algorithm SM-8:

precise mathematical determination of the OF matrix based on live approximations. Instead of matrix smoothing known from SuperMemo 5, I wanted to know the exact mathematical function that could describe the matrix and perform live updates. It was easy to determine that a negative power function would determine OF=f(RepNo) (which is an expression of SInc=f(S)) in today's terms). A bit more guesswork went into the impact of difficulty on SInc. I opted for a linear approximation of the function mapping difficulty ( A-Factor) to the decay constant for SInc (D-Factor), which expressed the decline in stability increase with stability/interval. That linear bet has survived to this day. It was a good guess.

with a good definition of the OF matrix, I could provide a precise definition of item difficulty: instead of a fluid E-Factor that could be manually controlled by grades at the whim of the user, I wanted to have an absolute difficulty A-factor, which was defined as the stability increase after the first repetition timed for R=0.9. This made it possible for SuperMemo to adjust item difficulty with each repetition by correcting the fit of item's performance with the expected performance based on the OF matrix

faster determination of startup difficulty by correlating the first grade with the A-factor. This is a weak mechanism of little significance, as shown by the fact that even with multi-repetition histories, item difficulty is still a hazy concept. In that context, users should be reminded that the best approach is to formulate items well and just keep them easy

approximating the first post-lapse interval by an exponential fit based on the number of memory lapses. The biggest value of that approach was to abolish a myth that reducing the length of intervals in case of memory lapses could speed up learning (some authors of software based on Algorithm SM-2 opted for such a solution, which has been proven wrong)

the idea to correlate grades with the forgetting index was a failure and did not contribute to improving the algorithm. That truth transpired slowly. It took nearly a decade to come to the ultimate verdict: grades correlate poorly with the forgetting index. The intuition born with Algorithm SM-2 is only weakly correct

Interestingly, Algorithm SM-8 did not require full repetition history for elements. Full repetition histories were to be implemented only in Feb 1996. The advantage was an easier implementation. The disadvantage came with the fact that once the user intervened manually in the learning process, the algorithm had no record of that intervention, and could not defend itself from a possible inflow of incorrect data. Naturally, only full repetition history record made it possible to implement Algorithm SM-17 two decades later.

My first "live" repetition in Algorithm SM-8 on my own data took place on Aug 16, 1995, Wed. For the test, I "sacrificed" a small 100 item collection with mnemonic peg list for memorizing numbers. Over the next two years, I gradually converted all my other collections to work with the new algorithm and in the new SuperMemo environment. In 1997, all my knowledge have finally been integrated into a single well-structured database. In 1995-1997, we called such a database a "knowledge system". Today we just call it collection (as in a collection of pieces of knowledge).

To this day, the core of the algorithm born in 1995 runs in SuperMemo 17 in the background, and the user can still choose intervals based on that old algorithm in case he is unhappy with propositions of Algorithm SM-17.

Absolute item difficulty

In SuperMemo 1.0 through SuperMemo 3.0, E-Factors were defined in the same way as O-Factors (i.e. the ratio of successive intervals). They were an approximate measure of item difficulty (the higher the E-Factor, the easier the item). However, the spaced repetition optimization would force E-factors to correspond with stability increase which drops with stability. In other words, by definition, in Algorithm SM-2, items would be tagged as more and more "difficult" as they were subject to successive repetitions. This is a bit counter-intuitive and users never seemed to notice.

Starting with SuperMemo 4.0, E-Factors were used to index the matrix of O-Factors. They were still used to reflect item difficulty. They were still used to compute O-Factors. However, they could differ from O-Factors and thus make for a better reflection of difficulty.

In SuperMemo 4 through SuperMemo 7, difficulty of material in a given database would shape the relationship between O-Factors and E-Factors. For example, in an easy collection, the starting-point O-Factor (i.e. the one corresponding with the first repetition and the assumed starting difficulty) would be relatively high. As performance in repetitions determines E-Factors, items of the same difficulty in an easy collection would naturally have a lower E-Factor than the exactly same items in a difficult collection. This all changed in SuperMemo 8 where A-Factors where introduced. A-Factors are "bound" to the second row of the O-Factor matrix. This makes them an absolute measure of item difficulty. Their value does not depend on the content of the collection . For example, you know that if A-Factor is 1.5, the third repetition will take place in an interval that is 50% longer than the first interval.

Archive warning: Why use literal archives?

A-Factor is a number associated with every element in a collection . A-Factor determines how much intervals increase in the learning process. The higher the A-Factor, the faster the intervals increase. A-Factors reflect item difficulty . The higher the A-Factor the easier the item . The most difficult items have A-Factors equal to 1.2. A-Factor is defined as the quotient of the second optimum interval and the first optimum interval used in repetitions

Post-lapse interval

Post-lapse interval approximation in Algorithm SM-8 abolished two myths:

shortening intervals after a lapse is a good idea (this idea was advocated multiple times in the years 1991-2000)

the first interval should always be 1 day (as in some older SuperMemo solutions)

In the graph presented below, we can see that with successive lapses, the optimum post-lapse interval keeps getting slightly shorter. This expresses nothing else but the fact that those high-lapse counts are reached only by badly formulated items, or items that are really hard to remember for their semantic nature or knowledge interference. For memories starting with Lapse=10, I suggested a term " toxic" to express their impact on the learning process. If the brain rejected a piece of information that many times, we should get a message: this knowledge is badly formulated or has become toxic for other reasons (e.g. stress associated with learning, e.g at school).

Archive warning: Why use literal archives?

First interval - the length of the first interval after the first repetition depends on the number of times a given item has been forgotten. Note that the first repetition here means the first repetition after forgetting, not the first repetition ever. In other words, a twice-repeated item will have the repetition number equal to one after it has been forgotten; the repetition number will not equal three. The first interval graph shows exponential regression curve that approximates the length of the first interval for different numbers of memory lapses (including the zero-lapses category that corresponds with newly memorized items). In the graph below, blue circles correspond to data collected in the learning process (the greater the circle, the more repetitions have been recorded). Figure: In the graph above, which includes data from over 130,000 repetitions, newly memorized items are optimally repeated after seven days. However, the items that have been forgotten 10 times (which is rare in SuperMemo) will require an interval of two days. (Due to logarithmic scaling, the size of the circle is not linearly proportional to the data sample; the number of repetition cases for Lapses=0 is by far larger than for Lapses=10, as can be seen in Distributions : Lapses) - the length of the first interval after the first repetition depends on the number of times a given item has been forgotten. Note that the first repetition here means the first repetition after forgetting,the first repetition ever. In other words, a twice-repeated item will have the repetition number equal to one after it has been forgotten; the repetition number will not equal three. The first interval graph shows exponential regression curve that approximates the length of the first interval for different numbers of memory lapses (including the zero-lapses category that corresponds with newly memorized items). In the graph below, blue circles correspond to data collected in the learning process (the greater the circle, the more repetitions have been recorded).

First grade vs. A-Factor

Correlating the first grade with the estimated item difficulty was to help classify items by difficulty at the entry to the learning process. The correlation appears to be weak and is highly dependent on user's grading system. For some users, there is virtually no correlation (picture #1). For others, the correlation is good enough to cover the full range of difficulty (A-factor) (picture #2).

In addition, in Algorithm SM-11 derived from Algorithm SM-8, the user was allowed to execute premature repetitions. Those repetitions would account for the spacing effect, however, they would still contribute to the graph and overestimate the grade for difficult items. With extensive use of incremental reading, this would flatten the graph.

Algorithm SM-17 does not use grade-difficulty correlation and derives difficulty from the entire repetition history. Practice shows that even then the estimate is hard to make and the good practice of learning is to keep all items easy (i.e. in the accepted mnemonic fit with the rest of the student's knowledge).

Archive warning: Why use literal archives?

First Grade vs. A-Factor - G-AF graph correlates the first grade obtained by an item with the ultimate estimation of its A-Factor value. At each repetition, the current element's old A-Factor estimation is removed from the graph and the new estimation is added. This graph is used by - G-AF graph correlates the first grade obtained by an item with the ultimate estimation of its A-Factor value. At each repetition, the current element's old A-Factor estimation is removed from the graph and the new estimation is added. This graph is used by Algorithm SM-15 to quickly estimate the first value of A-Factor at the moment when all we know about an element is the first grade it has scored in its first repetition.

Grade vs. Forgetting Index

By correlating grades with the expected forgetting index (predicted retrievability), I hoped to be able to compute the estimated forgetting index (post-repetition estimate of the actual retrievability). This correlation appeared to be weak due to the fact that all users tend to deploy their own grading systems, which is often inconsistent. The grade and R correlation comes primarily from the fact that complex items get worse grades and tend to be forgotten faster (at least at the beginning). In that sense, grades provide a better reflection of complexity than a reflection of retrievability.

In the picture below, the entire range of the expected forgetting index seems to fall around the grade 3.

For Grade<=3 we can read the maximum estimated forgetting index, and for Grade>=4 we can read the minimum estimated forgetting index. In that light, two grade systems would have the exact same effect on the algorithm as the six grade system.

For other users, the curve might even peak at some levels of the expected forgetting index as if grading reflected a wish to remember items that are really hard to remember (lenient grading).

Algorithm SM-17 makes an extensive use of retrievability estimated after the repetition, however, it derives it from sheer recall data and the expected retrievability. Grade-retrievability correlations are also collected, however, their weight is negligible.

Archive warning: Why use literal archives?

Grade vs. Forgetting Index - FI-G graph correlates the expected forgetting index with the grade scored at repetitions. You need to understand estimated forgetting index that is, in turn, used to normalize grades (for delayed or advanced repetitions) and estimate the new value of item's A-Factor. The grade is computed using the formula: Grade=exp A*FI+B , where A and B are parameters of an exponential regression run over raw data collected during repetitions. - FI-G graph correlates thewith the grade scored at repetitions. You need to understand Algorithm SM-15 to understand this graph. You can imagine that the forgetting curve graph might use the average grade instead of the retention on its vertical axis. If you correlated this grade with the forgetting index , you would arrive at the FI-G graph. This graph is used to compute anthat is, in turn, used to normalize grades (for delayed or advanced repetitions) and estimate the new value of item's A-Factor. The grade is computed using the formula:, where A and B are parameters of an exponential regression run over raw data collected during repetitions. The FI-G graph is updated after each repetition by using the expected forgetting index and actual grade scores. The expected forgetting index can easily be derived from the interval used between repetitions and the optimum interval computed from the OF matrix. The higher the value of the expected forgetting index, the lower the grade. From the grade and the FI-G graph, we can compute the estimated forgetting index which corresponds to the post-repetition estimation of the forgetting probability of the just-repeated item at the hypothetical pre-repetition stage. Because of the stochastic nature of forgetting and recall, the same item might or might not be recalled depending on the current overall cognitive status of the brain; even if the strength and retrievability of memories of all contributing synapses is/was identical! This way we can speak about the pre-repetition recall probability of an item that has just been recalled (or not). This probability is expressed by the estimated forgetting index.

Algorithm SM-15

Algorithm SM-8 has been improved over years and evolved into Algorithm SM-11 (2002) and then Algorithm SM-15 (2011). Here I only present the latest version: Algorithm SM-15 (used in SuperMemo 15, SuperMemo 16, and as backup in SuperMemo 17).

The key improvements added to Algorithm SM-8 over two decades were:

improved stability indexing: instead of using repetition numbers, as of SuperMemo 8 (1997), the algorithm used the concept of "repetition category" which roughly translates to stability

tolerance for advanced and delayed repetitions, as of SuperMemo 11 (2002): a heuristic has been added to account for the spacing effect

extending the representation of time in U-Factors from 60 days to 15 years (2011)

correcting forgetting curve data for repetition delay beyond the original U-Factor span (2011)

Archive warning: Why use literal archives?

Algorithm SM-15 begins the effort to compute the optimum inter-repetition intervals by storing the recall record of individual items (i.e. grades scored in learning). This record is used to estimate the current strength of a given memory trace, and the difficulty of the underlying piece of knowledge (item). The item difficulty expresses the complexity of memories, and reflects the effort needed to produce unambiguous and stable memory traces. SuperMemo takes the requested recall rate as the optimization criterion (e.g. 95%), and computes the intervals that satisfy this criterion. The function of optimum intervals is represented in a matrix form (OF matrix) and is subject to modification based on the results of the learning process. Although satisfying the optimization criterion is relatively easy, the complexity of the algorithm derives from the need to obtain maximum speed of convergence possible in the light of the known memory models. Important! Algorithm SM-15 is used only to compute the intervals between repetitions of items. Topics are reviewed at intervals computed with an entirely different algorithm (not described here). The timing of topic review is optimized with the view to managing the reading sequence and is not aimed at aiding memory. Long-term memories are formed in SuperMemo primarily with the help of items, which are reviewed along the schedule computed by Algorithm SM-15. This is a more detailed description of the Algorithm SM-15: Optimum interval: Inter-repetition intervals are computed using the following formula: I(1)=OF[1,L+1] I(n)=I(n-1)*OF[n,AF] where: OF - matrix of optimal factors, which is modified in the course of repetitions

OF[1,L+1] - value of the OF matrix entry taken from the first row and the L+1 column

OF[n,AF] - value of the OF matrix entry that corresponds with the n-th repetition, and with item difficulty AF

L - number of times a given item has been forgotten (from " memory L apses")

apses") AF - number that reflects absolute difficulty of a given item (from " A bsolute difficulty F actor")

bsolute difficulty actor") I(n) - n-th inter-repetition interval for a given item Advanced repetitions: Because of possible advancement in executing repetitions (e.g. forced review before an exam), the actual optimum factor (OF) used to compute the optimum interval is decremented by dOF using formulas that account for the spacing effect in learning: dOF=dOF max * a/(t half + a) dOF max =(OF-1)*(OI+t half -1)/(OI-1) where: dOF - decrement to OF resulting from the spacing effect

- decrement to OF resulting from the a - advancement of the repetition in days as compared with the optimum schedule (note that there is no change to OF if a =0, i.e. the repetition takes time at optimum time)

- advancement of the repetition in days as compared with the optimum schedule (note that there is no change to OF if =0, i.e. the repetition takes time at optimum time) dOF max - asymptotic limit on dOF for infinite a (note that for a=OI-1 the decrement will be OF-1 which corresponds to no increase in inter-repetition interval)

- asymptotic limit on for infinite (note that for a=OI-1 the decrement will be OF-1 which corresponds to no increase in inter-repetition interval) t half - advancement at which there is half the expected increase to synaptic stability as a result of a repetition (presently this value corresponds roughly to 60% of the length of the optimum interval for well-structured material)

- advancement at which there is half the expected increase to synaptic stability as a result of a repetition (presently this value corresponds roughly to 60% of the length of the optimum interval for well-structured material) OF - optimum factor (i.e. OF[n,AF] for the n-th interval and a given value of AF)

OI - optimum interval (as derived from the OF matrix) Delayed repetitions: Because of possible delays in executing repetitions, the OF matrix is not actually indexed with repetitions but with repetition categories. For example if the 5-th repetition is delayed, OF matrix is used to compute the repetition category, i.e. the theoretical value of the repetition number that corresponds with the interval used before the repetition. The repetition category may, for example, assume the value 5.3 and we will arrive at I(5)=I(4)*OF[5.3,AF] where OF[5.3,AF] has a intermediate value derived from OF[5,AF] and OF[6,AF] Matrix of optimum intervals: SuperMemo does not store the matrix of optimum intervals as in some earlier versions. Instead it keeps a matrix of optimal factors that can be converted to the matrix of optimum intervals (as in the formula from Point 1). The matrix of optimal factors OF used in Point 1 has been derived from the mathematical model of forgetting and from similar matrices built on data collected in years of repetitions in collections created by a number of users. Its initial setting corresponds with values found for a less-than-average student. During repetitions, upon collecting more and more data about the student's memory, the matrix is gradually modified to make it approach closely the actual student's memory properties. After years of repetitions, new data can be fed back to generate more accurate initial OF matrix. In SuperMemo 17, this matrix can be viewed in 3D with Tools : Statistics : Analysis : 3-D Graphs : O-Factor Matrix Item difficulty: The absolute item difficulty factor ( A-Factor), denoted AF in Point 1, expresses the difficulty of an item (the higher it is, the easier the item). It is worth noting that AF=OF[2,AF]. In other words, AF denotes the optimum interval increase factor after the second repetition. This is also equivalent with the highest interval increase factor for a given item. Unlike E-Factors in Algorithm SM-6 employed in SuperMemo 6 and SuperMemo 7, A-Factors express absolute item difficulty and do not depend on the difficulty of other items in the same collection of study material Deriving OF matrix from RF matrix: Optimum values of the entries of the OF matrix are derived through a sequence of approximation procedures from the RF matrix which is defined in the same way as the OF matrix (see Point 1), with the exception that its values are taken from the real learning process of the student for who the optimization is run. Initially, matrices OF and RF are identical; however, entries of the RF matrix are modified with each repetition, and a new value of the OF matrix is computed from the RF matrix by using approximation procedures. This effectively produces the OF matrix as a smoothed up form of the RF matrix. In simple terms, the RF matrix at any given moment corresponds to its best-fit value derived from the learning process; however, each entry is considered a best-fit entry on its own, i.e. in abstraction from the values of other RF entries. At the same time, the OF matrix is considered a best-fit as a whole. In other words, the RF matrix is computed entry by entry during repetitions, while the OF matrix is a smoothed copy of the RF matrix Forgetting curves: Individual entries of the RF matrix are computed from forgetting curves approximated for each entry individually. Each forgetting curve corresponds with a different value of the repetition number and a different value of A-Factor (or memory lapses in the case of the first repetition). The value of the RF matrix entry corresponds to the moment in time where the forgetting curve passes the knowledge retention point derived from the requested forgetting index . For example, for the first repetition of a new item, if the forgetting index equals 10%, and after four days the knowledge retention indicated by the forgetting curve drops below 90% value, the value of RF[1,1] is taken as four. This means that all items entering the learning process will be repeated after four days (assuming that the matrices OF and RF do not differ at the first row of the first column). This satisfies the main premise of SuperMemo, that the repetition should take place at the moment when the forgetting probability equals 100% minus the forgetting index stated as percentage. In SuperMemo 17, forgetting curves can be viewed with Tools : Statistics : Analysis : Forgetting Curves (or in 3-D with Tools : Statistics : Analysis : 3-D Curves): Figure: Tools : Statistics : Analysis : Forgetting Curves for 20 repetition number categories multiplied by 20 A-Factor categories. In the picture, blue circles represent data collected during repetitions. The larger the circle, the greater the number of repetitions recorded. The red curve corresponds with the best-fit forgetting curve obtained by exponential regression. For ill-structured material the forgetting curve is crooked, i.e. not exactly exponential. The horizontal aqua line corresponds with the requested forgetting index, while the vertical green line shows the moment in time in which the approximated forgetting curve intersects with the requested forgetting index line. This moment in time determines the value of the relevant R-Factor, and indirectly, the value of the optimum interval. For the first repetition, R-Factor corresponds with the first optimum interval. The values of O-Factor and R-Factor are displayed at the top of the graph. They are followed by the number of repetition cases used to plot the graph (i.e. 21,303). At the beginning of the learning process, there is no repetition history and no repetition data to compute R-Factors. It will take some time before your first forgetting curves are plotted. For that reason, the initial value of the RF matrix is taken from the model of a less-than-average student. The model of average student is not used because the convergence from poorer student parameters upwards is faster than the convergence in the opposite direction. The Deviation parameter displayed at the top tells you how well the negatively exponential curve fits the data. The lesser the deviation, the better the fit. The deviation is computed as a square root of the average of squared differences (as used in the method of least squares). Figure: 3D representation of the family of forgetting curves for a single item difficulty and varying memory stability levels (normalized for U-Factor). Figure: Cumulative forgetting curve for learning material of mixed complexity, and mixed stability. The graph is obtained by superposition of 400 forgetting curves normalized for the decay constant of 0.003567, which corresponds with recall of 70% at 100% of the presented time span (i.e. R=70% on the right edge of the graph). 401,828 repetition cases have been included in the graph. Individual curves are represented by yellow data points. Cumulative curve is represented by blue data points that show the average recall for all 400 curves. The size of circles corresponds with the size of data samples. Deriving OF matrix from the forgetting curves: The OF matrix is derived from the RF matrix by: fixed-point power approximation of the R-Factor decline along the RF matrix columns (the fixed point corresponds to second repetition at which the approximation curve passes through the A-Factor value), for all columns, computing D-Factor which expresses the decay constant of the power approximation, linear regression of D-Factor change across the RF matrix columns, and deriving the entire OF matrix from the slope and intercept of the straight line that makes up the best fit in the D-Factor graph. The exact formulas used in this final step go beyond the scope of this illustration. Note that the first row of the OF matrix is computed in a different way. It corresponds to the best-fit exponential curve obtained from the first row of the RF matrix. All the above steps are passed after each repetition. In other words, the theoretically optimum value of the OF matrix is updated as soon as new forgetting curve data is collected, i.e. at the moment, during the repetition, when the student, by providing a grade, states the correct recall or wrong recall (i.e. forgetting) (in Algorithm SM-6, a separate procedure Approximate had to be used to find the best-fit OF matrix, and the OF matrix used at repetitions might differ substantially from its best-fit value) Item difficulty: The initial value of A-Factor is derived from the first grade obtained by the item, and the correlation graph of the first grade and A-Factor ( G-AF graph ). This graph is updated after each repetition in which a new A-Factor value is estimated and correlated with the item's first grade. Subsequent approximations of the real A-Factor value are done after each repetition by using grades, OF matrix, and a correlation graph that shows the correspondence of the grade with the expected forgetting index ( FI-G graph ). The grade used to compute the initial A-Factor is normalized, i.e. adjusted for the difference between the actually used interval and the optimum interval for the forgetting index equal 10% Grades vs. expected forgetting index correlation: The FI-G graph is updated after each repetition by using the expected forgetting index and actual grade scores. The expected forgetting index can easily be derived from the interval used between repetitions and the optimum interval computed from the OF matrix. The higher the value of the expected forgetting index , the lower the grade. From the grade and the FI-G graph (see: FI-G graph in Tools : Statistics : Analysis : Graphs), we can compute the estimated forgetting index which corresponds to the post-repetition estimation of the forgetting probability of the just-repeated item at the hypothetical pre-repetition stage. Because of the stochastic nature of forgetting and recall, the same item might or might not be recalled depending on the current overall cognitive status of the brain; even if the strength and retrievability of memories of all contributing synapses is/was identical! This way we can speak about the pre-repetition recall probability of an item that has just been recalled (or not). This probability is expressed by the estimated forgetting index Computing A-Factors: From (1) the estimated forgetting index , (2) length of the interval and (3) the OF matrix, we can easily compute the most accurate value of A-Factor. Note that A-Factor serves as an index to the OF matrix, while the estimated forgetting index allows one to find the column of the OF matrix for which the optimum interval corresponds with the actually used interval corrected for the deviation of the estimated forgetting index from the requested forgetting index . At each repetition, a weighted average is taken of the old A-Factor and the new estimated value of the A-Factor. The newly obtained A-Factor is used in indexing the OF matrix when computing the new optimum inter-repetition interval To sum it up. Repetitions result in computing a set of parameters characterizing the memory of the student: RF matrix, G-AF graph , and FI-G graph . They are also used to compute To sum it up. Repetitions result in computing a set of parameters characterizing the memory of the student: RF matrix,, and. They are also used to compute A-Factors of individual items that characterize the difficulty of the learned material. The RF matrix is smoothed up to produce the OF matrix, which in turn is used in computing the optimum inter-repetition interval for items of different difficulty ( A-Factor ) and different number of repetitions (or memory lapses in the case of the first repetition). Initially, all student's memory parameters are taken as for a less-than-average student (less-than average yields faster convergence than average or more-than-average), while all A-Factors are assumed to be equal (unknown).

1997: Employing neural networks

Neural Networks: Budding interest

In the mid-1980s, I read Michael Arbib's " Brains, Machines and Mathematics". It consolidated my view of the brain as an efficient computing machine.

For anyone with an interest in how the brain works, and this is almost everyone, neural networks are naturally fascinating. While studying computer science, I gained a new, computational perspective of the brain and the neural networks. As neural networks have an uncanny capacity to do their own modelling, it may seem natural to employ them to study memory data to provide answers on how memory works. However, neural networks have one major shortcoming, they do not easily share their findings. 