天河

The University of California, Davis at the time of this writing (2019) had recently developed a Data Science focus for the Statistics major. They had begun offering a series of courses related specifically to the field of Data Science, teaching students about this interesting new up and coming field. I had recently just finished the newly created STA 141 series, of which there are three classes: STA 141A, STA 141B, and STA 141C. I wish to share my experience along with the different concepts that were covered over the three quarters taking these classes. It is with permission from the professors who have taught the above courses that I utilize their material to share with you all.

The courses gave us the opportunity to learn and master programming in the freely available programming languages R and Python. I will avoid giving a complete introduction to these tools, as there are many such guides available online right now. Instead, I am looking to help explain concepts and methodologies which will help make you into someone with a greater understanding of the critical skills and techniques which are required to go beyond the beginning stages and move instead towards the next level of Data Science.

Before going further, it is advised that you have at least some coding experience and are comfortable with reading about these more technical issues which a novice would be unaware of and have trouble understanding. As someone with at least a few months of experience writing code in a language such as Java, C++, R, Python, etc., it may seem that many programming goals can be reached with enough time and energy put into coding up a working solution. However, I feel that as a beginner it is too easy to fall into an ad-hoc way of thinking where we code only up to the point where we can solve one specific case of some issue, without knowing how to approach the task from a broader perspective.

I do still think that it is important for a beginner to practice within their skill range so that they can become familiar with the patterns and eventually learn new patterns that they can adapt to. This process helps them to grow, evolve, and eventually to become innovative problem solvers. To make more clear the problem which I am trying to explain to the beginner, it is that when we code in these small frameworks of homework assignments that teach us only the basic tools we are becoming trapped in a way of programming which prevents us from thinking outside the box. However, to say that a beginner does not yet know how to think outside the box is not much, but it is important instead to help the beginner bridge the gap between the framework of small tasks towards the goal of solving larger problems (the beginner which I am referring to is obviously not just myself, but someone who is looking to become more skilled in the practice of data science).

I will try to be more explicit now in what I am trying to express. When we first start out it is relatively typical to approach programming with the mindset that we are only looking to solve tasks that are relatively small in scale and don’t reach beyond a few megabytes of the memory on our computer. In the past decades, such a way of approaching programming was likely not such a serious issue. Before, it was unheard of to have to deal with datasets that are gigabytes in size or customer data stored in exabytes within large data centers. An exabyte is one trillion or 1,000,000,000 gigabytes. To approach programming at such a scale is daunting, but we must humble ourselves enough so that we are able to understand the scope of what we are doing.

If a beginner were to try and approach these larger datasets, they would become quickly overwhelmed by their inability to begin programming on data which is so huge. The issue is that the techniques and methodologies they learned teach them the basics, but fail to explain to them the narrowness of their thinking and methodology (I don’t want to say that what you have learned is wrong, it is just likely that it has not prepared you to work with datasets that are larger than a few hundred rows of data). We must then teach the beginner where the issue lies and how it is possible to overcome them for the purpose of developing a mindset which realizes the importance of thinking in terms of big data.

If we have a relatively small dataset, for instance only a few hundred observations, it is possible to scan through each observation, row by row and figure out some interesting information that would help lead us towards our goal. To perform such a task, the beginner would likely think that they can do it by writing a series of for-loops and if-else statements which in general are quite simple, but require some time to figure out all of the tiny nuances required to get their solution just right.

This is not an issue at first, as it helps beginners to familiarize themselves with the programming language which they are seeking to learn. Additionally, it will teach them to think critically about a problem and understand a trial-and-error process which is integral to approaching programming tasks.

NOTE: Also, it is important to note that during this beginning stage it is good to become familiar with searching for solutions to possibly common tasks by going online and using a search engine to find the answer. Mainly, we are looking towards a resource such as Stack Overflow which acts as a large database of questions and answers from people just like us who are stuck on an issue. A smart idea, in general, is to avoid reinventing the wheel constantly by trying to use up precious time trying to figure out our own solutions to issues which have been solved by others and have been available for others to use for free online (also it is likely these solutions are much more efficient than what we can think up on our own). It is customary to provide a link to the source where the solutions have been used. In the STA 141 course, it was a surprise to learn that we can go anywhere online to find solutions for certain steps of our assignments as long as we sourced where we had found them in our reference section.

In a programming language such as R, it is possible to time such a for-loop on a dataset using a function such as system.time(). The task on a small dataset may take a few micro or nanoseconds on average, so trying to optimize this code is somewhat of a trivial task. The problem lies in trying to extend the same procedure towards a significantly larger dataset. Not only is the data much larger, but it is also possibly broken into different files or more complicated data structures. Trying to sort through these types of data requires what I mean by outside-the-box thinking that a beginner would need experience first before they can comprehend such goals. Doing the simple for-loop process could take hours longer and could possibly never finish due to issues being encountered along the way.

In R and Python, when we first start out it makes sense to think of issues mainly in terms of for-loops and if-else statements. A beginning data scientist may utilize these methods for scanning through datasets, and it is not an issue at first. However, I will try to explain some of the steps needed to go beyond thinking within such a small scope.

Functional Programming apply/map Vectorize Parallelization/Cluster Computing

Functional Programming We will begin by discussing the style of functional programming. I have given a brief example of using some functional programming in R, here is a link. The point may not necessarily be that I have written a function and therefore am a functional programmer. It is instead that I am framing the problem which I would like to solve and finding a solution which utilizes customized functions which are capable of figuring out the task.

If you have been programming for a while and don’t immediately see the benefits of doing your data science work in such a manner it is fine. Understanding the difference between writing code this way versus doing it in a more Objected-Oriented-Programming (OOP) style similar to when we first started learning Java or C++ may not be apparent at the beginning. It takes practice to write code in such a style. We want to break the habit of thinking in that OOP style, that is if we want to become better data scientists.

To put functional programming in the context of this big data discussion, let us try to envision a possible real-world example. I will provide an example from the final project of my STA 141C class. We were utilizing a large dataset given by usaspending.gov, where there is information about federal spending transactions. Each observation in the dataset has 61 columns, although of course there are numerous ‘NA’ values. Say we would like to examine the patterns of each of the 50 states, dividing their spending into annual values. So we would go from a rather immense data frame with 61 columns and over a million observations.

As a beginner, it could make sense to first split the data into either states or years first and then somehow loop through from either state to years or years to states to start calculating some values. This is quite logical and it is in fact exactly what I had done. However, the methodology of doing so could be admittedly somewhat flawed for the beginner. They may begin thinking it’d be time to use some for-loops and if-else statements to accomplish this task. Generally, when dealing with such large datasets we do not want to think about using for-loops or if-else statements at all (if it is a relatively simple task and does not require optimization, it is perhaps doable in certain cases). Admittedly, I am a more proficient R programmer, but I know the issue still exists in Python and there are certain solutions out there. What I am looking to discuss is more or less trying to think about problems in a way which avoids the usual OOP-style of approaching problems. Ideally, this will provide speed and efficiency.

apply/map: In the previous example what I had done is utilize an lapply() function, there are a family of apply() functions in R which are not too different from a for-loop. So the logic of how to use them is quite similar to a for-loop, the difference is that they are optimized in such a way that use of them fits well into the process of functional programming. I will not go into the smaller details of the difference in using an lapply() versus using an sapply(). The difference is meaningful, but on this page, I would like to cover the concepts mainly without drilling too deep into the finer details.

In Python the matching version to the apply functions is the map() function, the thinking is similar where we are ‘mapping’ some function such as sum() to a list of numbers. Of course, here we are talking about functional programming, so we would be writing our customized functions to be ‘mapped’ on to other data structures.

Vectorize: As we start moving on towards working with functions that we have written ourselves, it is also necessary to start becoming familiar with the functions that have been written already either in the base packages of R, or through third-party packages.

The key about vectorizing your code is that when we get caught up in writing all these functions that we may be using lapply () to loop through, it may become easy to forget that there are already functions in R which can simplify a task that we are trying to over-complicate. A much more fundamental example would be the sum() function. It will sum together all the elements in a vector. Say we have a column in a data frame that we would like to take the sum of. It would be technically possible to loop through each row of the column and begin taking the cumulative sum of all the rows in the column. This would produce the result which is desired, but it would be far too time-consuming. Using the sum() function on the entire column would produce the same result in a much shorter time.

Admittedly, it is difficult to imagine that a beginner would know the vast amount of functions that are possible to use in the vectorizing process. To develop this skill, it would be a good idea to envision the problem that you are trying to solve and think of how this question can be asked to someone else. Then go on Google and type that into the search bar. Hopefully, someone else has asked this similar question on Stack Overflow already and you can simply go there to see what those with much more experience have said. It is through here that you can become familiar with the variety of different functions that would help in the vectorizing process. This may sound strange, but this is just one of the most simple ways that we can pick up the tricks that become commonplace as we become more skilled and proficient programmers.

Parallelization/Cluster Computing: Now that you are becoming more aware of the methodology of approaching and solving a problem, it would be good to get a better grasp of how this entire process is taken care of beyond just writing efficient code. There are packages like ‘parallel’ in R (I have not done it in Python, so I can’t quite recommend a package for such a case) which can help with this task.

Parallel computing is when we take a task and break it up to send to workers to do for us. These workers will then send the results back to the manager where the results are assembled. The workers are the different cores in your computer. For instance, with an 8-core laptop, I can use all the cores as workers for some parallel job. There are also more complicated parallel functions which can divide the job amongst the workers so that they will finish at roughly the same time if possible, avoiding the situation where most of the workers are idle while the last worker or two are stuck working on some rather larger than average files.

To understand when it is good to use this technique, there is a term called ’embarrassingly parallel.’ If we are writing a program that is doing some sort of task that is to be repeated over many times, it is likely that we are dealing with a problem which can benefit from the use of parallel computing. More specifically, it is when we have a list of data and the order of the data within the list does not matter. However, I am not sure about the benefit of doing so on our own machines, as it does seem to drain the life of the hardware in our system. We can quickly discuss an alternative.

Using a cluster of computers to perform a task is not something that many of us readily have access to. I was able to do this in my STA 141C class, utilizing the university’s servers. I believe that many companies out there would have a similar option for their employees to take advantage of. Basically, we can do the parallel process utilizing a network of computers that are connected. We send our files to the server where we can tell the server how many cores we would like to assign to our job. In such a case we can use many more cores to complete a rather large task. This is an interesting option if we have access to such a cluster, but for most of us doing data science on our own, it is not feasible. Although it is unfortunate that it may not be possible to take advantage of the parallel or cluster computing process on our own, I do not feel that this invalidates the entire methodology that I have discussed above.

Now that we have gone over some of the main ideas which help us to deal with the big data side of things, it would be good to think about another trick. When working through a problem and working in this functional programming style of doing things, it is easy to get caught up in a situation where we are writing functions nested in functions which are also buried other functions, etc. The issue here is that when we are in essence looping through many files using these series of functions, they can become quite easily slowed down. The issue would be that we are trying to apply some process to a subset of data that is buried within this series of functions. The result is that our program ends up running for far too long, when in fact it would have been possible to have done something to the entire data set at once in the beginning before sending it through the series of functions to calculate a result.

Also, as a last bit of advice which comes from the professor of the graduate student who taught my last STA 141C class. It is good practice to try and write code as neatly as possible at all times. When working with this neat code, we will always encounter errors. The errors are unavoidable, but if we write code in a neat manner we can always go back and easily diagnose the issue. In other cases, we may have situations where the code becomes too difficult to read and we are unable to solve the problem since we can’t easily trace back our steps to fix the issue.

Resources:

https://stackoverflow.com/questions/28580591/apply-a-function-over-a-list-of-lists-of-dataframes-in-r?rq=1

https://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.html