During my Data Structures course at The University at Buffalo Jesse Hartloff, one of my favorite professors, gave us the opportunity near the end of the class to do our assignments in Java, Python, or C++. I took the opportunity to learn a new language, my first interpreted language, Python. Python quickly became my preferred programming language to code in as a result of it being easy to learn and fast to write code in. Professor Hartloff made a great point in class that as computers get faster there is less importance on spending hours fine tuning code to work perfectly optimal and more importance on quickly deploying correct code. I continued to code in Python and agreed with the sentiment until a discovery about a month ago made me change my mind about interpreted languages. What I discovered begged the question,

Does it actually require more effort to write good code in interpreted languages than in compiled languages?









About a month ago while doing a technical interview over the phone I was poised with a very simple question.

Given a file of 1 million numbers, ranging from 1 to 100,000, find the number that occurs the most

In technical interviews we want to present a naive solution first in order to get something down just in case it takes too long to figure out the ideal solution. This is where Python excels. Python's qualities of being able to produce functionality quickly makes it an easy choice to interview in.

My naive solution was pretty simple. I would read in the data from file into a dictionary that would hold each number's count and then later go over that dictionary to find the key with the highest value. The solution I presented was as follows:

def find_max (input_file) : count = {} inf = open(input_file), 'r' ) for line in inf.readlines(): ﻿ if int(line) not in count: count[int(line)] = 1 else : count[int(line)] = count[int(line)] + 1 max_count = -1 max_count_key = -1 for key in count﻿﻿﻿﻿: if count[key] > max_count: max_count = count[key] max_count_key = key return max_count_key

When analyzing an algorithm in interviews and in general we talk about runtime - usually worst case, along with space efficiency. This specific algorithm is said to run in O(n) time since the runtime of our algorithm will grow linearly with the input. Since we know in this case that our file will always be 1 million numbers, in this case we can say that our runtime is O(1) or constant in that our input number is locked to 1 million. When looking at space efficiency of this algorithm we can consider it to be O(n) since it will also grow with our input, however similarly to runtime, since we know the numbers will only range from 1 - 100,000 our space is also O(1) constant since there will be at most 100,000 unique values in our dictionary.

After presenting the naive solution, I mentioned I think that asymptotically it is optimal. This means I don't think I can do better than O(n) since I will always have to read the full input. However, I mentioned that I believe I can at least reduce the constant. This means that I can reduce the number of loops we perform. When discussing algorithms if we read our input once or 100 times we still consider an algorithm to be O(n) as long as we are not nesting loops. The interviewer was happy with the two loop solution so we moved on but I wanted to test the improvement of a single loop on my own time.

A couple days after the interview I went ahead and wrote a single loop implementation to test the runtime difference. If we adjust how we find the max it is actually possible to get this done as we read the input in the following way:

def find_max_2 (input_file) : count = {} max_count = -1 max_count_key = -1 inf = open(input_file, 'r' ) for line in inf.readlines(): if int(line) not in count: count[int(line)] = 1 else : count[int(line] = count[int(line)] + 1 if count[int(line)] > max_count: max_count = count[int(line)] max_count_key = int(line) return max_count_key﻿﻿﻿﻿﻿

This new implementation, while admittedly still asymptotically the same will reduce our number of iterations by at max 100,000 since we no longer have to analyze the dictionary after the fact.

In order to test the run times I continued my script using the time module in python and the following lines of code:

start = time.time() print(find_max_2( 'nums.txt' )) end = time.time() print( 'Method two took ' + str ( end - start )) start = time.time() print(find_max( 'nums.txt' )) end = time.time() print( 'Method one took ' + str ( end - start ))

When I ended up running the script, to my surprise, I found out that method two, the one where we reduced our number of loops, ran longer than method one. Our output was:

58186 Method two took 1.3319640159606934 58186 Method one took 0.9964487552642822

Admittedly both times are still very fast; however, we reduced our iterations by 100,000, surely we should have a faster runtime.

I began to hunt for the culprit of the slowdown. I learned when taking my operating systems class from Geoffrey Challen that IO, or input output, in computing is very slow in relation to other operations. So what in our program related to IO could be causing the slowdown? If we look at method two there is one thing we can single out as happening pretty frequently. Since our input is from file we are reading strings. Since these strings are numbers we want to convert them to integers in order to interpret them as numbers in the computer. How we convert the strings to integers is through the int(line) call. With my inclination that this conversion could be an expensive operation, I wrote an alternate version of the function where we simply hold the integer value in a variable to eliminate the repeated conversion. The new function was as follows:

def find_max_2 (input_file) : count = {} max_count = -1 max_count_key = -1 inf = open(input_file, 'r' ) for line in inf.readlines(): num = int(line) if num not in count: count[num] = 1 else : count[num] = count[num] + 1 if count[num] > max_count: max_count = count[num] max_count_key = num return max_count_key﻿ ﻿

Re-running our timing script with the new function resulted in the new output:

58186 Method two took 0.58432936668396 58186 Method one took 0.9925329685211182

It looks like we found our culprit. The repeated conversion to an integer slowed down our script significantly. Even though we reduced the number of loops, as a result of the repeated conversion we completely negated our effort in improving our solution. The new time is significantly better as it should be. I brushed off the issue thinking I had learned a simple lesson to just avoid repeated type conversions if possible. That was until I had a discussion the following week.

I ended up going on-site with the company I had the phone interview with for further interviews the next week. During my second interview I had what I consider my favorite interview of all time with an engineering manager at the company. The interview was very casual and the manager seemed very interested in not only my background but also personal projects. While discussing my personal projects I touched on one of my personal projects I did at the LinkedIn Bay Area Intern Hackathon. The project was a typing recognition program which would learn from how you typed and lock your computer if it detected an intruder who does not type like you do. (source code for the project) I mentioned my decision to code the back-end in the C programming language. C is a language I had become extremely fond of after taking Operating Systems with Geoffrey Challen and also Compilers with, another one of my favorite professors, Carl Alphonce.

To my surprise my interviewer became very interested in the project and asked me what my role was in the team. I told him that my main focus was creating the data structures, something which I find very interesting and also specialized in for my Compilers class. While talking about the project my interviewer asked for a sample of the code that I wrote. I pulled up the source code and discussed the main data structure for the project. In the project we hold onto key pairs. That is when you type we take how long you press one key, how long you travel to a second key, and how long you press the second key. This was represented in a struct and held in a mass store struct in the following format:

struct keyBundle { char k_firstPressed; char k_secondPressed; double k_dataTimes[NUM_TIME_ARRAYS][NUM_REMEMBERED]; double k_dataTimesOldest[NUM_TIME_ARRAYS]; }; struct keyBundleStore { struct keyBundle *kbs_data[MASS_STORE_SIZE][MASS_STORE_SIZE]; struct keyBundle *kbs_lastInserted; };

The main idea for the mass storage was that we would index a 2-D array by the ASCII value of the first character for our first index, then the ASCII value of the second character for our second index. I admitted this was an inefficient and rushed implementation under the time constraints of the hackathon. I also mentioned if I had time to do it again I would change the implementation.

What my interviewer said next shocked and excited me as he mentioned:

Okay, you mentioned you can do it better; Let's make that your interview.

Even though I had previously mentioned that my main interview language was Python I was about to have an interview in C. This is a nightmare situation however, since it was a project I had worked on and cared about I was so excited about the idea.

What I really disliked about the mass_store data structure was the fact that we had a 2-D array. If it were possible to flatten the array to one dimension it would be significantly easier to use. However, one dimension means that we need to find a unique key to represent a key pair in order to index on a single dimension. This is a challenge since it is possible for two keys' ASCII values to add up to a single integer while two different keys could also add up to the same value. For those unfamiliar every character, or key press, is represented as a number in the computer. We refer to the number the key press refers to as the ASCII value. (More information here)

Take for example:

char 'A' has ASCII value 65 char 'O' has ASCII value 79 65 + 79 = 144 -> (Use 144 as the index key) char 'H' has ASCII value 72 char 'H' has ASCII value 72 72 + 72 = 144 -> (User 144 as the index key) These values conflict and should not index the same﻿﻿﻿﻿

Our issue here is how do we come up with a unique value to represent our key. When thinking about this further and also having a bit of help from my interviewer, we can do a bit of bit manipulation to resolve our issue.

The basic algorithm for our new key is:

convert char a to an int for its ASCII value bit shift a to the left 8 bits (chars are 8 bit values) convert char b to an int for its ASCII value add b 's ASCII value to a﻿﻿

Since we have shifted and combined the bits we now have a unique key for our key pairs. This can be accomplished with a simple bit of C code.

char a = 'a' ; char b = 'b' ; int asciiA = ( int )a; int asciiB = ( int )b; short key = asciiA << 8 ; key += asciiB;

However, the interview did not stop here. Shifting a value 8 times takes 8 CPU cycles, we can do better. A further refined solution involves taking advantage of the union structure within C and using the fact that the memory location is shared. We can hold the 2 chars as well as the short in the same memory location which would allow us to achieve the same outcome in less CPU cycles. In the interest of avoiding explaining too much C I won't go into too much detail for the final solution. The refined solution is as follows:

struct charPair { char first; char second; }; union data { short key; struct charPair pair; }; char a = 'a' ; char b = 'b' ; union data example; example.pair.first = b; example.pair.second = a;

What I love about C programming is what was just demonstrated. The refinement possible and the amount of efficiency you can squeeze out of the language is simply beautiful. However, when looking at all of this bit manipulation and refinement it just begs the question,

Is it worth it?

What I purposely did not mention earlier is the fact that during the hackathon the entire project had to be re-written. Our front-end guru Angus Lam found an awesome library to hook my C code up to his code however, the library itself was not functioning properly. Since we were calling this library on every single key press it just did not work as intended. That was when Angus Lam and I, at 4 am, re-wrote the entire project in Javascript, an interpreted language. Over the course of the re-write it became clear, what took 4-5 hours in C took about an hour in Javascript. I mentioned this to my interviewer stating:

I love the beauty of C but I have come to realize what can be done in a couple of hours of C can be done in half an hour of Javascript / Python

We both agreed on the sentiment and at this point I thought it would be interesting to bring up my discovery about the Python oddity of an integer cast destroying runtime. My interviewer mentioned that this was a fault of an interpreted language; that in a compiled language the compiler would be able to recognize the repeated type cast and eliminate it. I don't know how I had never made the connection before. Even after taking Carl Alphonce's wonderful Compilers class and learning so much I was still blinded by the ease of Python.

So here I am at a very interesting crossroads. I used to believe that compiled languages are "old school" and that in order to be successful with a language such as C or Java you need to think very hard about how you structure your code. Compiled languages traditionally come with a lot of extra baggage that the "new school" languages don't; such as braces, typing, and the sheer amount of extra lines necessary to write a program. What has become apparent however, is that a lazier coder can be more successful in a compiled language compared to an interpreted one since the compiler can clean up the code so much.

Python and other interpreted languages are often thought to be slower than compiled languages. This is by the nature of the language of interpreted versus compiled however, the realizations I have had begs the question,

How much of interpreted languages' speed is due to developer error?

It seems that when a developer is developing using an interpreted language there is much more responsibility put on the developer to structure the code in such a way that it performs well versus when developing using a compiled language.

Looking Forward

I have been a teaching assistant for intro computer science classes at my university for 3 years now. Our department is currently undergoing many curriculum changes in which there are many differing opinions. We have traditionally taught Java as the intro language for many years. I was previously extremely against any change to the language; it worked, I enjoyed it, why change anything? As the years went by I have been more accepting of changing our intro language to Javascript or Python. My previous fears being that there are so many things that are hidden by these languages have slowly gone away as I realize how easy it is to get work done in these languages and also get students excited by what they can do so easily in these languages.

However, this new realization has made me wary. Should we allow the simplicity of these interpreted languages to be enough of a reason to usher in a new generation of developers using them? Sure it is easy to learn but if we do not focus on being thoughtful about our code, as displayed in my first example, small mistakes can lead to exponential runtime increases.

Retrospect

To be honest, as of writing this article I still choose to use Python as my main personal project language. I would like to stress that this is for personal projects. Any large, scaling project I have worked on has been mostly in compiled languages. However, the ease of getting a project done quickly is too much to pass on. As of writing this I have written 2 projects this week in Python during my free time and to be honest I do not believe it would be as easy in a different language. Even through all of my realizations I have been more careful in my code but I cannot escape the laziness when wanting to write a new program.

Closing

If you have made it this far thank you. I would be very interested in hearing your opinion on anything I have discussed. Discussing programming languages is always an enjoyable experience even if at times it gets a bit redundant.





Thank you to those mentioned in this article:

Jesse Hartloff: A professor who has truly opened my mind up to ideas I would have never discovered without his encouragement.

Carl Alphonce: A professor who has guided and taught me so much over my college career. I would never be where I am without his guidance.

Angus Lam: A peer, Co-worker, and teammate that has inspired me over the short time we have known each other.





As well as:

Sanjeevani Choudhery and Keith Carolus for taking the time to help edit this article.











