Machine learning is a branch in computer science that studies the design of algorithms that can learn.

Typical tasks are concept learning, function learning or “predictive modeling”, clustering and finding predictive patterns. These tasks are learned through available data that were observed through experiences or instructions, for example.

The hope that comes with this discipline is that including the experience into its tasks will eventually improve the learning. But this improvement needs to happen in such a way that the learning itself becomes automatic so that humans like ourselves don’t need to interfere anymore is the ultimate goal.

Today’s scikit-learn tutorial will introduce you to the basics of Python machine learning:

If you’re more interested in an R tutorial, take a look at our Machine Learning with R for Beginners tutorial.

Alternatively, check out DataCamp's Supervised Learning with scikit-learn and Unsupervised Learning in Python courses!

Loading Your Data Set The first step to about anything in data science is loading your data. This is also the starting point of this scikit-learn tutorial. This discipline typically works with observed data. This data might be collected by yourself, or you can browse through other sources to find data sets. But if you’re not a researcher or otherwise involved in experiments, you’ll probably do the latter. If you’re new to this and you want to start problems on your own, finding these data sets might prove to be a challenge. However, you can typically find good data sets at the UCI Machine Learning Repository or on the Kaggle website. Also, check out this KD Nuggets list with resources. For now, you should warm up, not worry about finding any data by yourself and just load in the digits data set that comes with a Python library, called scikit-learn . Fun fact: did you know the name originates from the fact that this library is a scientific toolbox built around SciPy? By the way, there is more than just one scikit out there. This scikit contains modules specifically for machine learning and data mining, which explains the second component of the library name. :) To load in the data, you import the module datasets from sklearn . Then, you can use the load_digits() method from datasets to load in the data: eyJsYW5ndWFnZSI6InB5dGhvbiIsInNhbXBsZSI6IiMgSW1wb3J0IGBkYXRhc2V0c2AgZnJvbSBgc2tsZWFybmBcbmZyb20gc2tsZWFybiBpbXBvcnQgX19fX19fX19cblxuIyBMb2FkIGluIHRoZSBgZGlnaXRzYCBkYXRhXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpXG5cbiMgUHJpbnQgdGhlIGBkaWdpdHNgIGRhdGEgXG5wcmludChfX19fX18pIiwic29sdXRpb24iOiIjIEltcG9ydCBgZGF0YXNldHNgIGZyb20gYHNrbGVhcm5gXG5mcm9tIHNrbGVhcm4gaW1wb3J0IGRhdGFzZXRzXG5cbiMgTG9hZCBpbiB0aGUgYGRpZ2l0c2AgZGF0YVxuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKVxuXG4jIFByaW50IHRoZSBgZGlnaXRzYCBkYXRhIFxucHJpbnQoZGlnaXRzKSIsInNjdCI6ImltcG9ydF9tc2c9XCJEaWQgeW91IGltcG9ydCBgZGF0YXNldHNgIGZyb20gYHNrbGVhcm5gP1wiXG5pbmNvcnJlY3RfaW1wb3J0X21zZz1cIkRvbid0IGZvcmdldCB0byBpbXBvcnQgdGhlIGBkYXRhc2V0c2AgbW9kdWxlIGZyb20gYHNrbGVhcm5gIVwiXG5ub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgdXNlIGBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpYCB0byBsb2FkIGluIHRoZSBgZGlnaXRzYCBkYXRhP1wiXG5pbmNvcnJlY3RfbXNnPVwiVXNlIGBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpYCB0byBsb2FkIGluIHRoZSBgZGlnaXRzYCBkYXRhIVwiXG5wcmVkZWZfbXNnPVwiRGlkIHlvdSBjYWxsIHRoZSBgcHJpbnQoKWAgZnVuY3Rpb24/XCJcbnRlc3RfaW1wb3J0KFwic2tsZWFybi5kYXRhc2V0c1wiLCBzYW1lX2FzID0gVHJ1ZSwgbm90X2ltcG9ydGVkX21zZyA9IGltcG9ydF9tc2csIGluY29ycmVjdF9hc19tc2cgPSBpbmNvcnJlY3RfaW1wb3J0X21zZylcbnRlc3RfZnVuY3Rpb24oXCJza2xlYXJuLmRhdGFzZXRzLmxvYWRfZGlnaXRzXCIsIG5vdF9jYWxsZWRfbXNnID0gbm90X2NhbGxlZF9tc2csIGluY29ycmVjdF9tc2cgPSBpbmNvcnJlY3RfbXNnKVxuIyBUZXN0IGBwcmludCgpYCBmdW5jdGlvblxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgbm90X2NhbGxlZF9tc2c9cHJlZGVmX21zZyxcbiAgICBpbmNvcnJlY3RfbXNnPXByZWRlZl9tc2csXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxuc3VjY2Vzc19tc2c9XCJQZXJmZWN0ISBZb3UncmUgcmVhZHkgdG8gZ28hXCIifQ==



Note that the datasets module contains other methods to load and fetch popular reference datasets, and you can also count on this module in case you need artificial data generators. Also, this data set is also available through the UCI Repository that was mentioned above: you can find the data here. If you had decided to pull the data from the latter page, your data import would’ve looked like this: eyJsYW5ndWFnZSI6InB5dGhvbiIsInNhbXBsZSI6IiMgSW1wb3J0IHRoZSBgcGFuZGFzYCBsaWJyYXJ5IGFzIGBwZGBcbmltcG9ydCBfX19fX18gYXMgX19cblxuIyBMb2FkIGluIHRoZSBkYXRhIHdpdGggYHJlYWRfY3N2KClgXG5kaWdpdHMgPSBwZC5yZWFkX2NzdihcImh0dHA6Ly9hcmNoaXZlLmljcy51Y2kuZWR1L21sL21hY2hpbmUtbGVhcm5pbmctZGF0YWJhc2VzL29wdGRpZ2l0cy9vcHRkaWdpdHMudHJhXCIsIGhlYWRlcj1Ob25lKVxuXG4jIFByaW50IG91dCBgZGlnaXRzYFxucHJpbnQoX19fX19fKSIsInNvbHV0aW9uIjoiIyBJbXBvcnQgdGhlIGBwYW5kYXNgIGxpYnJhcnkgYXMgYHBkYFxuaW1wb3J0IHBhbmRhcyBhcyBwZFxuXG4jIExvYWQgaW4gdGhlIGRhdGEgd2l0aCBgcmVhZF9jc3YoKWBcbmRpZ2l0cyA9IHBkLnJlYWRfY3N2KFwiaHR0cDovL2FyY2hpdmUuaWNzLnVjaS5lZHUvbWwvbWFjaGluZS1sZWFybmluZy1kYXRhYmFzZXMvb3B0ZGlnaXRzL29wdGRpZ2l0cy50cmFcIiwgaGVhZGVyPU5vbmUpXG5cbiMgUHJpbnQgb3V0IGBkaWdpdHNgXG5wcmludChkaWdpdHMpIiwic2N0IjoiaW1wb3J0X21zZz1cIkRpZCB5b3UgYWRkIHNvbWUgY29kZSB0byBpbXBvcnQgYHBhbmRhc2AgYXMgYHBkYD9cIlxuaW5jb3JyZWN0X2ltcG9ydF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gaW1wb3J0IHRoZSAncGFuZGFzJyBsaWJyYXJ5IGFzIGBwZGAhXCJcbmNzdl9tc2c9XCJEaWQgeW91IHVzZSB0aGUgYHJlYWRfY3N2KClgIG1ldGhvZCBmcm9tIHBhbmRhcyB0byBsb2FkIGluIHRoZSBkYXRhP1wiXG5jc3ZfaW5jb3JyZWN0X21zZz1cIlVzZSBgcmVhZF9jc3YoKWAgZnJvbSB0aGUgcGFuZGFzIGxpYnJhcnkgdG8gbG9hZCBpbiB0aGUgZGF0YSBcIlxucHJlZGVmX21zZz1cIkRpZCB5b3UgY2FsbCB0aGUgYHByaW50KClgIGZ1bmN0aW9uP1wiXG4jIFRlc3QgaW1wb3J0IGBwYW5kYXNgXG50ZXN0X2ltcG9ydChcInBhbmRhc1wiLCBzYW1lX2FzID0gVHJ1ZSwgbm90X2ltcG9ydGVkX21zZyA9IGltcG9ydF9tc2csIGluY29ycmVjdF9hc19tc2cgPSBpbmNvcnJlY3RfaW1wb3J0X21zZylcbiMgVGVzdCBgcmVhZF9jc3YoKWBcbnRlc3RfZnVuY3Rpb24oXCJwYW5kYXMucmVhZF9jc3ZcIiwgbm90X2NhbGxlZF9tc2cgPSBjc3ZfbXNnLCBpbmNvcnJlY3RfbXNnID0gY3N2X2luY29ycmVjdF9tc2cpXG4jIFRlc3QgYHByaW50KClgIGZ1bmN0aW9uXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICBub3RfY2FsbGVkX21zZz1wcmVkZWZfbXNnLFxuICAgIGluY29ycmVjdF9tc2c9cHJlZGVmX21zZyxcbiAgICBkb19ldmFsPUZhbHNlXG4pXG5zdWNjZXNzX21zZyhcIkF3ZXNvbWUgam9iIVwiKSJ9



Note that if you download the data like this, the data is already split up in a training and a test set, indicated by the extensions .tra and .tes . You’ll need to load in both files to elaborate your project. With the command above, you only load in the training set. Tip: if you want to know more about importing data with the Python data manipulation library Pandas, consider taking DataCamp’s Importing Data in Python course.

Explore Your Data

When first starting out with a data set, it’s always a good idea to go through the data description and see what you can already learn. When it comes to scikit-learn , you don’t immediately have this information readily available, but in the case where you import data from another source, there's usually a data description present, which will already be a sufficient amount of information to gather some insights into your data.

However, these insights are not merely deep enough for the analysis that you are going to perform. You really need to have a good working knowledge about the data set.

Performing an exploratory data analysis (EDA) on a data set like the one that this tutorial now has might seem difficult.

Where do you start exploring these handwritten digits?

Gathering Basic Information on Your Data Let’s say that you haven’t checked any data description folder (or maybe you want to double-check the information that has been given to you). Then you should start by gathering the necessary information. When you printed out the digits data after having loaded it with the help of the scikit-learn datasets module, you will have noticed that there is already a lot of information available. You already know things such as the target values and the description of your data. You can access the digits data through the attribute data . Similarly, you can also access the target values or labels through the target attribute and the description through the DESCR attribute. To see which keys you have available to already get to know your data, you can just run digits.keys() . Try this all out in the following DataCamp Light blocks: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKSIsInNhbXBsZSI6IiMgR2V0IHRoZSBrZXlzIG9mIHRoZSBgZGlnaXRzYCBkYXRhXG5wcmludChkaWdpdHMuX19fX19fKVxuXG4jIFByaW50IG91dCB0aGUgZGF0YVxucHJpbnQoZGlnaXRzLl9fX18pXG5cbiMgUHJpbnQgb3V0IHRoZSB0YXJnZXQgdmFsdWVzXG5wcmludChkaWdpdHMuX19fX19fKVxuXG4jIFByaW50IG91dCB0aGUgZGVzY3JpcHRpb24gb2YgdGhlIGBkaWdpdHNgIGRhdGFcbnByaW50KGRpZ2l0cy5ERVNDUikiLCJzb2x1dGlvbiI6IiMgR2V0IHRoZSBrZXlzIG9mIHRoZSBgZGlnaXRzYCBkYXRhXG5wcmludChkaWdpdHMua2V5cygpKVxuXG4jIFByaW50IG91dCB0aGUgZGF0YVxucHJpbnQoZGlnaXRzLmRhdGEpXG5cbiMgUHJpbnQgb3V0IHRoZSB0YXJnZXQgdmFsdWVzXG5wcmludChkaWdpdHMudGFyZ2V0KVxuXG4jIFByaW50IG91dCB0aGUgZGVzY3JpcHRpb24gb2YgdGhlIGBkaWdpdHNgIGRhdGFcbnByaW50KGRpZ2l0cy5ERVNDUikiLCJzY3QiOiIjIFRlc3QgYHByaW50YCBcbnRlc3RfZnVuY3Rpb24oXG4gICAgXCJwcmludFwiLFxuICAgIDEsXG4gICAgbm90X2NhbGxlZF9tc2c9XCJEaWQgeW91IHByaW50IG91dCB0aGUga2V5cyBvZiBgZGlnaXRzYD9cIixcbiAgICBpbmNvcnJlY3RfbXNnPVwiRG9uJ3QgZm9yZ2V0IHRvIHByaW50IG91dCB0aGUga2V5cyBvZiBgZGlnaXRzYCFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG4jIFRlc3QgYHByaW50YFxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgMixcbiAgICBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IHRoZSBkYXRhP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcHJpbnQgb3V0IHRoZSBkYXRhIVwiLFxuICAgIGRvX2V2YWw9RmFsc2VcbilcbiMgVGVzdCBgcHJpbnRgXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAzLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIHRhcmdldCB2YWx1ZXMgb2YgdGhlIGRhdGE/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIHRhcmdldCB2YWx1ZXMgb2YgdGhlIGRhdGEhXCIsXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxuIyBUZXN0IGBwcmludGAgXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICA0LFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIGRlc2NyaXB0aW9uIG9mIGBkaWdpdHNgP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcHJpbnQgb3V0IHRoZSBkZXNjcmlwdGlvbiBvZiBgZGlnaXRzYCFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG5zdWNjZXNzX21zZyhcIkF3ZXNvbWUhXCIpIn0=



The next thing that you can (double)check is the type of your data. If you used read_csv() to import the data, you would have had a data frame that contains just the data. There wouldn’t be any description component, but you would be able to resort to, for example, head() or tail() to inspect your data. In these cases, it’s always wise to read up on the data description folder! However, this tutorial assumes that you make use of the library's data and the type of the digits variable is not that straightforward if you’re not familiar with the library. Look at the print out in the first code chunk. You’ll see that digits actually contains numpy arrays! This is already quite vital information. But how do you access these arrays? It’s straightforward, actually: you use attributes to access the relevant arrays. Remember that you have already seen which attributes are available when you printed digits.keys() . For instance, you have the data attribute to isolate the data, target to see the target values and the DESCR for the description, … But what then? The first thing that you should know of an array is its shape. That is the number of dimensions and items that are contained within an array. The array’s shape is a tuple of integers that specify the sizes of each dimension. In other words, if you have a 3d array like this y = np.zeros((2, 3, 4)) , the shape of your array will be (2,3,4) . Now let’s try to see what the shape is of these three arrays that you have distinguished (the data , target and DESCR arrays). Use first the data attribute to isolate the numpy array from the digits data and then use the shape attribute to find out more. You can do the same for the target and DESCR . There’s also the images attribute, which is basically the data in images. You’re also going to test this out. Check up on this statement by using the shape attribute on the array: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuaW1wb3J0IG51bXB5IGFzIG5wXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpIiwic2FtcGxlIjoiIyBJc29sYXRlIHRoZSBgZGlnaXRzYCBkYXRhXG5kaWdpdHNfZGF0YSA9IGRpZ2l0cy5kYXRhXG5cbiMgSW5zcGVjdCB0aGUgc2hhcGVcbnByaW50KGRpZ2l0c19kYXRhLnNoYXBlKVxuXG4jIElzb2xhdGUgdGhlIHRhcmdldCB2YWx1ZXMgd2l0aCBgdGFyZ2V0YFxuZGlnaXRzX3RhcmdldCA9IGRpZ2l0cy5fX19fX19cblxuIyBJbnNwZWN0IHRoZSBzaGFwZVxucHJpbnQoZGlnaXRzX3RhcmdldC5fX19fXylcblxuIyBQcmludCB0aGUgbnVtYmVyIG9mIHVuaXF1ZSBsYWJlbHNcbm51bWJlcl9kaWdpdHMgPSBsZW4obnAudW5pcXVlKGRpZ2l0cy50YXJnZXQpKVxuXG4jIElzb2xhdGUgdGhlIGBpbWFnZXNgXG5kaWdpdHNfaW1hZ2VzID0gZGlnaXRzLmltYWdlc1xuXG4jIEluc3BlY3QgdGhlIHNoYXBlXG5wcmludChkaWdpdHNfaW1hZ2VzLnNoYXBlKSIsInNvbHV0aW9uIjoiIyBJc29sYXRlIHRoZSBgZGlnaXRzYCBkYXRhXG5kaWdpdHNfZGF0YSA9IGRpZ2l0cy5kYXRhXG5cbiMgSW5zcGVjdCB0aGUgc2hhcGVcbnByaW50KGRpZ2l0c19kYXRhLnNoYXBlKVxuXG4jIElzb2xhdGUgdGhlIHRhcmdldCB2YWx1ZXMgd2l0aCBgdGFyZ2V0YFxuZGlnaXRzX3RhcmdldCA9IGRpZ2l0cy50YXJnZXRcblxuIyBJbnNwZWN0IHRoZSBzaGFwZVxucHJpbnQoZGlnaXRzX3RhcmdldC5zaGFwZSlcblxuIyBQcmludCB0aGUgbnVtYmVyIG9mIHVuaXF1ZSBsYWJlbHNcbm51bWJlcl9kaWdpdHMgPSBsZW4obnAudW5pcXVlKGRpZ2l0cy50YXJnZXQpKVxuXG4jIElzb2xhdGUgdGhlIGBpbWFnZXNgXG5kaWdpdHNfaW1hZ2VzID0gZGlnaXRzLmltYWdlc1xuXG4jIEluc3BlY3QgdGhlIHNoYXBlXG5wcmludChkaWdpdHNfaW1hZ2VzLnNoYXBlKSIsInNjdCI6Im1zZ19kYXRhPVwiRGlkIHlvdSBhZGQgYHNoYXBlYCB0byBnZXQgdGhlIG51bWJlciBvZiBkaW1lbnNpb25zIGFuZCBpdGVtcyBvZiB0aGUgYGRpZ2l0c19kYXRhYCBhcnJheT9cIlxubXNnX3RhcmdldD1cIkRpZCB5b3UgYWRkIGBzaGFwZWAgdG8gZ2V0IHRoZSBudW1iZXIgb2YgZGltZW5zaW9ucyBhbmQgaXRlbXMgb2YgdGhlIGBkaWdpdHNfdGFyZ2V0YCBhcnJheT9cIlxubXNnX2ltYWdlPVwiRGlkIHlvdSBhZGQgYHNoYXBlYCB0byBnZXQgdGhlIG51bWJlciBvZiBkaW1lbnNpb25zIGFuZCBpdGVtcyBvZiB0aGUgYGRpZ2l0c19pbWFnZXNgIGFycmF5P1wiXG4jIFRlc3Qgb2JqZWN0IGBkaWdpdHNfZGF0YWBcbnRlc3Rfb2JqZWN0KFwiZGlnaXRzX2RhdGFcIiwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgZGVmaW5lIHRoZSBgZGlnaXRzX2RhdGFgIG9iamVjdD9cIiwgaW5jb3JyZWN0X21zZz1cIkRpZCB5b3UgdXNlIHRoZSBgZGF0YWAgYXR0cmlidXRlIHRvIGlzb2xhdGUgdGhlIGRhdGEgb2YgYGRpZ2l0c2A/XCIpXG4jIFRlc3Qgb2JqZWN0IGBkaWdpdHNfdGFyZ2V0YFxudGVzdF9vYmplY3QoXCJkaWdpdHNfdGFyZ2V0XCIsIHVuZGVmaW5lZF9tc2c9XCJEaWQgeW91IGRlZmluZSB0aGUgYGRpZ2l0c190YXJnZXRgIG9iamVjdD9cIiwgaW5jb3JyZWN0X21zZz1cIkRpZCB5b3UgdXNlIHRoZSBgdGFyZ2V0YCBhdHRyaWJ1dGUgdG8gaXNvbGF0ZSB0aGUgdGFyZ2V0IHZhbHVlcyBvZiB0aGUgYGRpZ2l0c2AgZGF0YT9cIilcbiMgVGVzdCBgc2hhcGVgIG9mIGBkaWdpdHNfZGF0YWBcbiN0ZXN0IGZ1bmN0aW9uIHByaW50XG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAxLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIHNoYXBlIG9mIHRoZWRhdGE/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIHNoYXBlIG9mIHRoZSBkYXRhIVwiLFxuICAgIGRvX2V2YWw9RmFsc2VcbilcbnRlc3Rfb2JqZWN0X2FjY2Vzc2VkKFwiZGlnaXRzX2RhdGEuc2hhcGVcIiwgbm90X2FjY2Vzc2VkX21zZz1tc2dfZGF0YSlcbiMgVGVzdCBgcHJpbnRgXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAyLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIHNoYXBlIG9mIHRoZSB0YXJnZXQgdmFsdWVzIG9mIHRoZSBkYXRhP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcHJpbnQgb3V0IHRoZSBzaGFwZSBvZiB0aGUgdGFyZ2V0IHZhbHVlcyBvZiB0aGUgZGF0YSFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG4jIFRlc3QgYWNjZXNzIGBzaGFwZWAgb2YgYGRpZ2l0c190YXJnZXRgXG50ZXN0X29iamVjdF9hY2Nlc3NlZChcImRpZ2l0c190YXJnZXQuc2hhcGVcIiwgbm90X2FjY2Vzc2VkX21zZz1tc2dfdGFyZ2V0KVxuIyBUZXN0IG9iamVjdCBgbnVtYmVyX2RpZ2l0c2BcbnRlc3Rfb2JqZWN0KFwibnVtYmVyX2RpZ2l0c1wiLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgdGhlIGBudW1iZXJfZGlnaXRzYCBvYmplY3Q/XCIsIGluY29ycmVjdF9tc2c9XCJEaWQgeW91IHVzZSBgbnAudW5pcXVlKClgIHRvIGdpdmUgYmFjayB0aGUgdW5pcXVlIHRhcmdldCB2YWx1ZXM/IERvbid0IGZvcmdldCB0byBnaXZlIGJhY2sgdGhlIGxlbmd0aCBvZiB0aGlzIGFycmF5IHdpdGggYGxlbigpYCFcIilcbiMgVGVzdCBvYmplY3QgYGRpZ2l0c19pbWFnZXNgXG50ZXN0X29iamVjdChcImRpZ2l0c19pbWFnZXNcIiwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgZGVmaW5lIHRoZSBgZGlnaXRzX2ltYWdlc2Agb2JqZWN0P1wiLCBpbmNvcnJlY3RfbXNnPVwiRGlkIHlvdSB1c2UgdGhlIGBpbWFnZXNgIGF0dHJpYnV0ZSB0byBpc29sYXRlIHRoZSBpbWFnZXMgb2YgdGhlIGBkaWdpdHNgIGRhdGE/XCIpXG4jIFRlc3QgYHNoYXBlYCBvZiBgZGlnaXRzX2ltYWdlc2BcbnRlc3Rfb2JqZWN0X2FjY2Vzc2VkKFwiZGlnaXRzX2ltYWdlcy5zaGFwZVwiLCBub3RfYWNjZXNzZWRfbXNnPW1zZ19pbWFnZSlcbiMgVGVzdCBgcHJpbnRgIFxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgMyxcbiAgICBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IHRoZSBzaGFwZSBvZiB0aGUgaW1hZ2VzIG9mIGBkaWdpdHNgP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcHJpbnQgb3V0IHRoZSBzaGFwZSBvZiB0aGUgaW1hZ2VzIG9mIGBkaWdpdHNgIVwiLFxuICAgIGRvX2V2YWw9RmFsc2VcbilcbnN1Y2Nlc3NfbXNnKFwiV2VsbCBkb25lIVwiKSJ9



To recap: by inspecting digits.data , you see that there are 1797 samples and that there are 64 features. Because you have 1797 samples, you also have 1797 target values. But all those target values contain 10 unique values, namely, from 0 to 9. In other words, all 1797 target values are made up of numbers that lie between 0 and 9. This means that the digits that your model will need to recognize are numbers from 0 to 9. Lastly, you see that the images data contains three dimensions: there are 1797 instances that are 8 by 8 pixels big. You can visually check that the images and the data are related by reshaping the images array to two dimensions: digits.images.reshape((1797, 64)) . But if you want to be entirely sure, better to check with print(np.all(digits.images.reshape((1797,64)) == digits.data)) With the numpy method all() , you test whether all array elements along a given axis evaluate to True . In this case, you evaluate if it’s true that the reshaped images array equals digits.data . You’ll see that the result will be True in this case.

Visualize Your Data Images With matplotlib Then, you can take your exploration up a notch by visualizing the images that you’ll be working with. You can use one of Python’s data visualization libraries, such as matplotlib , for this purpose: # Import matplotlib import matplotlib.pyplot as plt # Figure size (width, height) in inches fig = plt.figure(figsize=(6, 6)) # Adjust the subplots fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05) # For each of the 64 images for i in range(64): # Initialize the subplots: add a subplot in the grid of 8 by 8, at the i+1-th position ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[]) # Display an image at the i-th position ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest') # label the image with the target value ax.text(0, 7, str(digits.target[i])) # Show the plot plt.show() The code chunk seems quite lengthy at first sight, and this might be overwhelming. But, what happens in the code chunk above is actually pretty easy once you break it down into parts: You import matplotlib.pyplot .

. Next, you set up a figure with a figure size of 6 inches wide and 6 inches long. This is your blank canvas where all the subplots with the images will appear.

inches wide and inches long. This is your blank canvas where all the subplots with the images will appear. Then you go to the level of the subplots to adjust some parameters: you set the left side of the suplots of the figure to 0 , the right side of the suplots of the figure to 1 , the bottom to 0 and the top to 1 . The height of the blank space between the suplots is set at 0.005 and the width is set at 0.05 . These are merely layout adjustments.

, the right side of the suplots of the figure to , the bottom to and the top to . The height of the blank space between the suplots is set at and the width is set at . These are merely layout adjustments. After that, you start filling up the figure that you have made with the help of a for loop.

You initialize the suplots one by one, adding one at each position in the grid that is 8 by 8 images big.

by images big. You display each time one of the images at each position in the grid. As a color map, you take binary colors, which in this case will result in black, gray values and white colors. The interpolation method that you use is 'nearest' , which means that your data is interpolated in such a way that it isn’t smooth. You can see the effect of the different interpolation methods here.

, which means that your data is interpolated in such a way that it isn’t smooth. You can see the effect of the different interpolation methods here. The cherry on the pie is the addition of text to your subplots. The target labels are printed at coordinates (0,7) of each subplot, which in practice means that they will appear in the bottom-left of each of the subplots.

Don’t forget to show the plot with plt.show() !

In the end, you’ll get to see the following:



On a more simple note, you can also visualize the target labels with an image, just like this: # Import matplotlib import matplotlib.pyplot as plt # Join the images and target labels in a list images_and_labels = list(zip(digits.images, digits.target)) # for every element in the list for index, (image, label) in enumerate(images_and_labels[:8]): # initialize a subplot of 2X4 at the i+1-th position plt.subplot(2, 4, index + 1) # Don't plot any axes plt.axis('off') # Display images in all subplots plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest') # Add a title to each subplot plt.title('Training: ' + str(label)) # Show the plot plt.show() Which will render the following visualization:



Note that in this case, after you have imported matplotlib.pyplot , you zip the two numpy arrays together and save it into a variable called images_and_labels . You’ll see now that this list contains suples of each time an instance of digits.images and a corresponding digits.target value. Then, you say that for the first eight elements of images_and_labels -note that the index starts at 0!-, you initialize subplots in a grid of 2 by 4 at each position. You turn of the plotting of the axes and you display images in all the subplots with a color map plt.cm.gray_r (which returns all grey colors) and the interpolation method used is nearest . You give a title to each subplot, and you show it. Not too hard, huh? And now you have an excellent idea of the data that you’ll be working with!

Visualizing Your Data: Principal Component Analysis (PCA) But is there no other way to visualize the data? As the digits data set contains 64 features, this might prove to be a challenging task. You can imagine that it’s tough to understand the structure and keep the overview of the digits data. In such cases, it is said that you’re working with a high dimensional data set. High dimensionality of data is a direct result of trying to describe the objects via a collection of features. Other examples of high dimensional data are, for example, financial data, climate data, neuroimaging, … But, as you might have gathered already, this is not always easy. In some cases, high dimensionality can be problematic, as your algorithms will need to take into account too many features. In such cases, you speak of the curse of dimensionality. Because having a lot of dimensions can also mean that your data points are far away from virtually every other point, which makes the distances between the data points uninformative. Don’t worry, though, because the curse of dimensionality is not merely a matter of counting the number of features. There are also cases in which the effective dimensionality might be much smaller than the number of the features, such as in data sets where some features are irrelevant. In addition, you can also understand that data with only two or three dimensions are easier to grasp and can also be visualized easily. That all explains why you’re going to visualize the data with the help of one of the Dimensionality Reduction techniques, namely Principal Component Analysis (PCA). The idea in PCA is to find a linear combination of the two variables that contains most of the information. This new variable or “principal component” can replace the two original variables. In short, it’s a linear transformation method that yields the directions (principal components) that maximize the variance of the data. Remember that the variance indicates how far a set of data points lie apart. If you want to know more, go to this page. You can easily apply PCA do your data with the help of scikit-learn : eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKVxuZnJvbSBza2xlYXJuLmRlY29tcG9zaXRpb24gaW1wb3J0IFJhbmRvbWl6ZWRQQ0FcbmZyb20gc2tsZWFybi5kZWNvbXBvc2l0aW9uIGltcG9ydCBQQ0FcbmltcG9ydCBudW1weSBhcyBucCIsInNhbXBsZSI6IiMgQ3JlYXRlIGEgUmFuZG9taXplZCBQQ0EgbW9kZWwgdGhhdCB0YWtlcyB0d28gY29tcG9uZW50c1xucmFuZG9taXplZF9wY2EgPSBSYW5kb21pemVkUENBKG5fY29tcG9uZW50cz0yKVxuXG4jIEZpdCBhbmQgdHJhbnNmb3JtIHRoZSBkYXRhIHRvIHRoZSBtb2RlbFxucmVkdWNlZF9kYXRhX3JwY2EgPSByYW5kb21pemVkX3BjYS5maXRfdHJhbnNmb3JtKGRpZ2l0cy5kYXRhKVxuXG4jIENyZWF0ZSBhIHJlZ3VsYXIgUENBIG1vZGVsIFxucGNhID0gUENBKG5fY29tcG9uZW50cz0yKVxuXG4jIEZpdCBhbmQgdHJhbnNmb3JtIHRoZSBkYXRhIHRvIHRoZSBtb2RlbFxucmVkdWNlZF9kYXRhX3BjYSA9IHBjYS5maXRfdHJhbnNmb3JtKGRpZ2l0cy5kYXRhKVxuXG4jIEluc3BlY3QgdGhlIHNoYXBlXG5yZWR1Y2VkX2RhdGFfcGNhLnNoYXBlXG5cbiMgUHJpbnQgb3V0IHRoZSBkYXRhXG5wcmludChyZWR1Y2VkX2RhdGFfcnBjYSlcbnByaW50KHJlZHVjZWRfZGF0YV9wY2EpIiwic29sdXRpb24iOiIjIENyZWF0ZSBhIFJhbmRvbWl6ZWQgUENBIG1vZGVsIHRoYXQgdGFrZXMgdHdvIGNvbXBvbmVudHNcbnJhbmRvbWl6ZWRfcGNhID0gUmFuZG9taXplZFBDQShuX2NvbXBvbmVudHM9MilcblxuIyBGaXQgYW5kIHRyYW5zZm9ybSB0aGUgZGF0YSB0byB0aGUgbW9kZWxcbnJlZHVjZWRfZGF0YV9ycGNhID0gcmFuZG9taXplZF9wY2EuZml0X3RyYW5zZm9ybShkaWdpdHMuZGF0YSlcblxuIyBDcmVhdGUgYSByZWd1bGFyIFBDQSBtb2RlbCBcbnBjYSA9IFBDQShuX2NvbXBvbmVudHM9MilcblxuIyBGaXQgYW5kIHRyYW5zZm9ybSB0aGUgZGF0YSB0byB0aGUgbW9kZWxcbnJlZHVjZWRfZGF0YV9wY2EgPSBwY2EuZml0X3RyYW5zZm9ybShkaWdpdHMuZGF0YSlcblxuIyBJbnNwZWN0IHRoZSBzaGFwZVxucmVkdWNlZF9kYXRhX3BjYS5zaGFwZVxuXG4jIFByaW50IG91dCB0aGUgZGF0YVxucHJpbnQocmVkdWNlZF9kYXRhX3JwY2EpXG5wcmludChyZWR1Y2VkX2RhdGFfcGNhKSIsInNjdCI6InRlc3Rfb2JqZWN0KFwicmFuZG9taXplZF9wY2FcIiwgZG9fZXZhbD1GYWxzZSlcbnRlc3Rfb2JqZWN0KFwicmVkdWNlZF9kYXRhX3JwY2FcIiwgZG9fZXZhbD1GYWxzZSlcbnRlc3Rfb2JqZWN0KFwicGNhXCIsIGRvX2V2YWw9RmFsc2UpXG50ZXN0X29iamVjdChcInJlZHVjZWRfZGF0YV9wY2FcIiwgZG9fZXZhbD1GYWxzZSlcbnByZWRlZl9tc2c9XCJEaWQgeW91IGluc3BlY3QgdGhlIHNoYXBlIG9mIGByZWR1Y2VkX2RhdGFfcGNhYD9cIlxudGVzdF9vYmplY3RfYWNjZXNzZWQoXCJyZWR1Y2VkX2RhdGFfcGNhLnNoYXBlXCIsIG5vdF9hY2Nlc3NlZF9tc2c9cHJlZGVmX21zZylcbiMgVGVzdCBgcHJpbnRgIFxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgMSxcbiAgICBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IHRoZSBgcmVkdWNlZF9kYXRhX3JwY2FgIGRhdGE/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIGByZWR1Y2VkX2RhdGFfcnBjYWAgZGF0YSFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAyLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIGByZWR1Y2VkX2RhdGFfcGNhYCBkYXRhP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcHJpbnQgb3V0IHRoZSBgcmVkdWNlZF9kYXRhX3BjYWAgZGF0YSFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG5zdWNjZXNzX21zZyhcIkFtYXppbmchXCIpIn0=



Tip: you have used the RandomizedPCA() here because it performs better when there’s a high number of dimensions. Try replacing the randomized PCA model or estimator object with a regular PCA model and see what the difference is. Note how you explicitly tell the model only to keep two components. This is to make sure that you have two-dimensional data to plot. Also, note that you don’t pass the target class with the labels to the PCA transformation because you want to investigate if the PCA reveals the distribution of the different labels and if you can clearly separate the instances from each other. You can now build a scatterplot to visualize the data: colors = ['black', 'blue', 'purple', 'yellow', 'white', 'red', 'lime', 'cyan', 'orange', 'gray'] for i in range(len(colors)): x = reduced_data_rpca[:, 0][digits.target == i] y = reduced_data_rpca[:, 1][digits.target == i] plt.scatter(x, y, c=colors[i]) plt.legend(digits.target_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') plt.title("PCA Scatter Plot") plt.show() Which looks like this:



Again you use matplotlib to visualize the data. It’s useful for a quick visualization of what you’re working with, but you might have to consider something a little bit fancier if you’re working on making this part of your data science portfolio. Also note that the last call to show the plot ( plt.show() ) is not necessary if you’re working in Jupyter Notebook, as you’ll want to put the images inline. When in doubt, you can always check out our Definitive Guide to Jupyter Notebook. What happens in the code chunk above is the following: You put your colors together in a list. Note that you list ten colors, which is equal to the number of labels that you have. This way, you make sure that your data points can be colored in according to the labels. Then, you set up a range that goes from 0 to 10. Mind you that this range is not inclusive! Remember that this is the same for indices of a list, for example. You set up your x and y coordinates. You take the first or the second column of reduced_data_rpca , and you select only those data points for which the label equals the index that you’re considering. That means that in the first run, you’ll consider the data points with label 0 , then label 1 , … and so on. You construct the scatter plot. Fill in the x and y coordinates and assign a color to the batch that you’re processing. The first run, you’ll give the color black to all data points, the next run blue , … and so on. You add a legend to your scatter plot. Use the target_names key to get the right labels for your data points. Add labels to your x and y axes that are meaningful. Reveal the resulting plot.

Where To Go Now? Now that you have even more information about your data and you have a visualization ready, it does seem a bit like the data points sort of group together, but you also see there is quite some overlap. This might be interesting to investigate further. Do you think that, in a case where you knew that there are 10 possible digits labels to assign to the data points, but you have no access to the labels, the observations would group or “cluster” together by some criterion in such a way that you could infer the labels? Now, this is a research question! In general, when you have acquired a good understanding of your data, you have to decide on the use cases that would be relevant to your data set. In other words, you think about what your data set might teach you or what you think you can learn from your data. From there on, you can think about what kind of algorithms you would be able to apply to your data set in order to get the results that you think you can obtain. Tip: the more familiar you are with your data, the easier it will be to assess the use cases for your specific data set. The same also holds for finding the appropriate machine algorithm. However, when you’re first getting started with scikit-learn , you’ll see that the amount of algorithms that the library contains is pretty vast and that you might still want additional help when you’re assessing your data set. That’s why this scikit-learn machine learning map will come in handy. Note that this map does require you to have some knowledge about the algorithms that are included in the scikit-learn library. This, by the way, also holds some truth for taking this next step in your project: if you have no idea what is possible, it will be tough to decide on what your use case will be for the data. As your use case was one for clustering, you can follow the path on the map towards “KMeans”. You’ll see the use case that you have just thought about requires you to have more than 50 samples (“check!”), to have labeled data (“check!”), to know the number of categories that you want to predict (“check!”) and to have less than 10K samples (“check!”). But what exactly is the K-Means algorithm? It is one of the simplest and widely used unsupervised learning algorithms to solve clustering problems. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters that you have configured before you run the algorithm. This number of clusters is called k , and you select this number at random. Then, the k-means algorithm will find the nearest cluster center for each data point and assign the data point closest to that cluster. Once all data points have been assigned to clusters, the cluster centers will be recomputed. In other words, new cluster centers will emerge from the average of the values of the cluster data points. This process is repeated until most data points stick to the same cluster. The cluster membership should stabilize. You can already see that, because the k-means algorithm works the way it does, the initial set of cluster centers that you give up can have a significant effect on the clusters that are eventually found. You can, of course, deal with this effect, as you will see further on. However, before you can go into making a model for your data, you should definitely take a look into preparing your data for this purpose.

Preprocessing Your Data

As you have read in the previous section, before modeling your data, you’ll do well by preparing it first. This preparation step is called “preprocessing”.

Data Normalization The first thing that we’re going to do is preprocessing the data. You can standardize the digits data by, for example, making use of the scale() method: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKSIsInNhbXBsZSI6IiMgSW1wb3J0XG5mcm9tIHNrbGVhcm4ucHJlcHJvY2Vzc2luZyBpbXBvcnQgc2NhbGVcblxuIyBBcHBseSBgc2NhbGUoKWAgdG8gdGhlIGBkaWdpdHNgIGRhdGFcbmRhdGEgPSBfX19fXyhkaWdpdHMuZGF0YSkiLCJzb2x1dGlvbiI6IiMgSW1wb3J0XG5mcm9tIHNrbGVhcm4ucHJlcHJvY2Vzc2luZyBpbXBvcnQgc2NhbGVcblxuIyBBcHBseSBgc2NhbGUoKWAgdG8gdGhlIGBkaWdpdHNgIGRhdGFcbmRhdGEgPSBzY2FsZShkaWdpdHMuZGF0YSkiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFxuICAgIFwic2tsZWFybi5wcmVwcm9jZXNzaW5nLnNjYWxlXCIsXG4gICAgbm90X2NhbGxlZF9tc2c9XCJEaWQgeW91IHN0YW5kYXJkaXplIHRoZSBgZGlnaXRzYCBkYXRhP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gc3RhbmRhcmRpemUgdGhlIGBkaWdpdHNgIGRhdGEgd2l0aCBgc2NhbGUoKWAhXCIsXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxuc3VjY2Vzc19tc2coXCJBd2Vzb21lIVwiKSJ9



By scaling the data, you shift the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

Splitting Your Data Into Training And Test Sets To assess your model’s performance later, you will also need to divide the data set into two parts: a training set and a test set. The first is used to train the system, while the second is used to evaluate the learned or trained system. In practice, the division of your data set into a test and a training sets are disjoint: the most common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set. You will try to do this also here. You see in the code chunk below that this ‘traditional’ splitting choice is respected: in the arguments of the train_test_split() method, you clearly see that the test_size is set to 0.25 . You’ll also note that the argument random_state has the value 42 assigned to it. With this argument, you can guarantee that your split will always be the same. That is particularly handy if you want reproducible results. eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKVxuZnJvbSBza2xlYXJuLnByZXByb2Nlc3NpbmcgaW1wb3J0IHNjYWxlXG5kYXRhID0gc2NhbGUoZGlnaXRzLmRhdGEpIiwic2FtcGxlIjoiIyBJbXBvcnQgYHRyYWluX3Rlc3Rfc3BsaXRgXG5mcm9tIHNrbGVhcm4uY3Jvc3NfdmFsaWRhdGlvbiBpbXBvcnQgX19fX19fX19fX19fX19fX1xuXG4jIFNwbGl0IHRoZSBgZGlnaXRzYCBkYXRhIGludG8gdHJhaW5pbmcgYW5kIHRlc3Qgc2V0c1xuWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QsIGltYWdlc190cmFpbiwgaW1hZ2VzX3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGRhdGEsIGRpZ2l0cy50YXJnZXQsIGRpZ2l0cy5pbWFnZXMsIHRlc3Rfc2l6ZT0wLjI1LCByYW5kb21fc3RhdGU9NDIpIiwic29sdXRpb24iOiIjIEltcG9ydCBgdHJhaW5fdGVzdF9zcGxpdGBcbmZyb20gc2tsZWFybi5jcm9zc192YWxpZGF0aW9uIGltcG9ydCB0cmFpbl90ZXN0X3NwbGl0XG5cbiMgU3BsaXQgdGhlIGBkaWdpdHNgIGRhdGEgaW50byB0cmFpbmluZyBhbmQgdGVzdCBzZXRzXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGF0YSwgZGlnaXRzLnRhcmdldCwgZGlnaXRzLmltYWdlcywgdGVzdF9zaXplPTAuMjUsIHJhbmRvbV9zdGF0ZT00MikiLCJzY3QiOiJpbXBvcnRfbXNnPVwiRGlkIHlvdSBpbXBvcnQgYHRyYWluX3Rlc3Rfc3BsaXRgIGZyb20gYHNrbGVhcm4uY3Jvc3NfdmFsaWRhdGlvbmA/XCJcbnByZWRlZl9tc2c9XCJEb24ndCBmb3JnZXQgdG8gZmlsbCBpbiBgdHJhaW5fdGVzdF9zcGxpdGAhXCJcbnRlc3RfaW1wb3J0KFwic2tsZWFybi5jcm9zc192YWxpZGF0aW9uLnRyYWluX3Rlc3Rfc3BsaXRcIiwgc2FtZV9hcyA9IFRydWUsIG5vdF9pbXBvcnRlZF9tc2cgPSBpbXBvcnRfbXNnLCBpbmNvcnJlY3RfYXNfbXNnID0gcHJlZGVmX21zZylcbnRlc3Rfb2JqZWN0KFwiWF90cmFpblwiLCBkb19ldmFsPUZhbHNlLCAgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgbGVhdmUgb3V0IGBYX3RyYWluYCBvciBhbnkgb2YgdGhlIG90aGVyIHZhcmlhYmxlcz9cIilcbnRlc3Rfb2JqZWN0KFwiWF90ZXN0XCIsIGRvX2V2YWw9RmFsc2UsIHVuZGVmaW5lZF9tc2c9XCJEaWQgeW91IGRlZmluZSBgWF90ZXN0YD9cIilcbnRlc3Rfb2JqZWN0KFwieV90cmFpblwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgYHlfdHJhaW5gP1wiKVxudGVzdF9vYmplY3QoXCJ5X3Rlc3RcIiwgZG9fZXZhbD1GYWxzZSwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgZGVmaW5lIGB5X3Rlc3RgP1wiKVxudGVzdF9vYmplY3QoXCJpbWFnZXNfdHJhaW5cIiwgZG9fZXZhbD1GYWxzZSwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgZGVmaW5lIGBpbWFnZXNfdHJhaW5gP1wiKVxudGVzdF9vYmplY3QoXCJpbWFnZXNfdGVzdFwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgYGltYWdlc190ZXN0YD9cIilcbnN1Y2Nlc3NfbXNnKFwiR3JlYXQgam9iIVwiKSJ9



After you have split up your data set into train and test sets, you can quickly inspect the numbers before you go and model the data: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLmNyb3NzX3ZhbGlkYXRpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXRcbmZyb20gc2tsZWFybi5wcmVwcm9jZXNzaW5nIGltcG9ydCBzY2FsZVxuaW1wb3J0IG51bXB5IGFzIG5wXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpXG5kYXRhID0gc2NhbGUoZGlnaXRzLmRhdGEpXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGF0YSwgZGlnaXRzLnRhcmdldCwgZGlnaXRzLmltYWdlcywgdGVzdF9zaXplPTAuMjUsIHJhbmRvbV9zdGF0ZT00MikiLCJzYW1wbGUiOiIjIE51bWJlciBvZiB0cmFpbmluZyBmZWF0dXJlc1xubl9zYW1wbGVzLCBuX2ZlYXR1cmVzID0gWF90cmFpbi5zaGFwZVxuXG4jIFByaW50IG91dCBgbl9zYW1wbGVzYFxucHJpbnQoX19fX19fX19fKVxuXG4jIFByaW50IG91dCBgbl9mZWF0dXJlc2BcbnByaW50KF9fX19fX19fX18pXG5cbiMgTnVtYmVyIG9mIFRyYWluaW5nIGxhYmVsc1xubl9kaWdpdHMgPSBsZW4obnAudW5pcXVlKHlfdHJhaW4pKVxuXG4jIEluc3BlY3QgYHlfdHJhaW5gXG5wcmludChsZW4oX19fX19fXykpIiwic29sdXRpb24iOiIjIE51bWJlciBvZiB0cmFpbmluZyBmZWF0dXJlc1xubl9zYW1wbGVzLCBuX2ZlYXR1cmVzID0gWF90cmFpbi5zaGFwZVxuXG4jIFByaW50IG91dCBgbl9zYW1wbGVzYFxucHJpbnQobl9zYW1wbGVzKVxuXG4jIFByaW50IG91dCBgbl9mZWF0dXJlc2BcbnByaW50KG5fZmVhdHVyZXMpXG5cbiMgTnVtYmVyIG9mIFRyYWluaW5nIGxhYmVsc1xubl9kaWdpdHMgPSBsZW4obnAudW5pcXVlKHlfdHJhaW4pKVxuXG4jIEluc3BlY3QgYHlfdHJhaW5gXG5wcmludChsZW4oeV90cmFpbikpIiwic2N0IjoidGVzdF9vYmplY3QoXCJuX3NhbXBsZXNcIiwgdW5kZWZpbmVkX21zZz1cImRpZCB5b3UgbGVhdmUgb3V0IGBuX3NhbXBsZXNgIG9yIGBuX2ZlYXR1cmVzYD9cIilcbnRlc3Rfb2JqZWN0KFwibl9mZWF0dXJlc1wiKVxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgMSxcbiAgICBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IHRoZSBudW1iZXIgb2Ygc2FtcGxlcyBvZiB0aGUgYGRpZ2l0c2AgdHJhaW5pbmcgZGF0YT9cIixcbiAgICBpbmNvcnJlY3RfbXNnPVwiRG9uJ3QgZm9yZ2V0IHRvIHByaW50IG91dCB0aGUgbnVtYmVyIG9mIHNhbXBsZXMhXCIsXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgMixcbiAgICBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IHRoZSBudW1iZXIgb2YgZmVhdHVyZXMgb2YgdGhlIGBkaWdpdHNgIHRyYWluaW5nIGRhdGE/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIG51bWJlciBvZiBmZWF0dXJlcyFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG50ZXN0X29iamVjdChcIm5fZGlnaXRzXCIsIGluY29ycmVjdF9tc2c9XCJkaWQgeW91IGRlZmluZSBgbl9kaWdpdHNgIGNvcnJlY3RseT9cIilcbnRlc3RfZnVuY3Rpb24oXG4gICAgXCJwcmludFwiLFxuICAgIDMsXG4gICAgbm90X2NhbGxlZF9tc2c9XCJEaWQgeW91IHByaW50IG91dCB0aGUgbnVtYmVyIG9mIHRyYWluaW5nIGxhYmVscyBmb3IgdGhlIGBkaWdpdHNgIGRhdGE/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIG51bWJlciBvZiB0cmFpbmluZyBsYWJlbHMgd2l0aCBgbGVuKHlfdHJhaW4pYCFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG5zdWNjZXNzX21zZyhcIldlbGwgZG9uZSFcIikifQ==



You’ll see that the training set X_train now contains 1347 samples, which is precisely 2/3d of the samples that the original data set contained, and 64 features, which hasn’t changed. The y_train training set also contains 2/3d of the labels of the original data set. This means that the test sets X_test and y_test contains 450 samples.

Clustering The digits Data

After all these preparation steps, you have made sure that all your known (training) data is stored. No actual model or learning was performed up until this moment.

Now, it’s finally time to find those clusters of your training set. Use KMeans() from the cluster module to set up your model. You’ll see that there are three arguments that are passed to this method: init , n_clusters and the random_state .

You might still remember this last argument from before when you split the data into training and test sets. This argument basically guaranteed that you got reproducible results.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLmNyb3NzX3ZhbGlkYXRpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXRcbmZyb20gc2tsZWFybi5wcmVwcm9jZXNzaW5nIGltcG9ydCBzY2FsZVxuaW1wb3J0IG51bXB5IGFzIG5wXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpXG5kYXRhID0gc2NhbGUoZGlnaXRzLmRhdGEpXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGF0YSwgZGlnaXRzLnRhcmdldCwgZGlnaXRzLmltYWdlcywgdGVzdF9zaXplPTAuMjUsIHJhbmRvbV9zdGF0ZT00MikiLCJzYW1wbGUiOiIjIEltcG9ydCB0aGUgYGNsdXN0ZXJgIG1vZHVsZVxuZnJvbSBza2xlYXJuIGltcG9ydCBfX19fX19fX1xuXG4jIENyZWF0ZSB0aGUgS01lYW5zIG1vZGVsXG5jbGYgPSBjbHVzdGVyLktNZWFucyhpbml0PSdrLW1lYW5zKysnLCBuX2NsdXN0ZXJzPTEwLCByYW5kb21fc3RhdGU9NDIpXG5cbiMgRml0IHRoZSB0cmFpbmluZyBkYXRhIGBYX3RyYWluYHRvIHRoZSBtb2RlbFxuY2xmLmZpdChfX19fX19fXykiLCJzb2x1dGlvbiI6IiMgSW1wb3J0IHRoZSBgY2x1c3RlcmAgbW9kdWxlXG5mcm9tIHNrbGVhcm4gaW1wb3J0IGNsdXN0ZXJcblxuIyBDcmVhdGUgdGhlIEtNZWFucyBtb2RlbFxuY2xmID0gY2x1c3Rlci5LTWVhbnMoaW5pdD0nay1tZWFucysrJywgbl9jbHVzdGVycz0xMCwgcmFuZG9tX3N0YXRlPTQyKVxuXG4jIEZpdCB0aGUgdHJhaW5pbmcgZGF0YSB0byB0aGUgbW9kZWxcbmNsZi5maXQoWF90cmFpbikiLCJzY3QiOiJpbXBvcnRfbXNnPVwiRGlkIHlvdSBpbXBvcnQgYGNsdXN0ZXJgIGZyb20gYHNrbGVhcm5gP1wiXG5wcmVkZWZfbXNnPVwiRG9uJ3QgZm9yZ2V0IHRvIGltcG9ydCBgY2x1c3RlciBmcm9tIGBza2xlYXJuYCFcIlxudGVzdF9pbXBvcnQoXCJza2xlYXJuLmNsdXN0ZXJcIiwgc2FtZV9hcyA9IFRydWUsIG5vdF9pbXBvcnRlZF9tc2cgPSBpbXBvcnRfbXNnLCBpbmNvcnJlY3RfYXNfbXNnID0gcHJlZGVmX21zZylcbnRlc3Rfb2JqZWN0KFwiY2xmXCIsIGRvX2V2YWw9RmFsc2UsIGluY29ycmVjdF9tc2c9XCJkaWQgY3JlYXRlIHRoZSBLTWVhbnMgbW9kZWwgY29ycmVjdGx5P1wiKVxudGVzdF9mdW5jdGlvbihcImNsZi5maXRcIiwgZG9fZXZhbD1GYWxzZSlcbnN1Y2Nlc3NfbXNnKFwiV29vaG9vIVwiKSJ9

The init indicates the method for initialization and even though it defaults to ‘k-means++’ , you see it explicitly coming back in the code. That means that you can leave it out if you want. Try it out in the DataCamp Light chunk above!

Next, you also see that the n_clusters argument is set to 10 . This number not only indicates the number of clusters or groups you want your data to form, but also the number of centroids to generate. Remember that a cluster centroid is the middle of a cluster.

Do you also still remember how the previous section described this as one of the possible disadvantages of the K-Means algorithm?

That is that the initial set of cluster centers that you give up can have a significant effect on the clusters that are eventually found?

Usually, you try to deal with this effect by trying several initial sets in multiple runs and by selecting the set of clusters with the minimum sum of the squared errors (SSE). In other words, you want to minimize the distance of each point in the cluster to the mean or centroid of that cluster.

By adding the n-init argument to KMeans() , you can determine how many different centroid configurations the algorithm will try.

Note again that you don’t want to insert the test labels when you fit the model to your data: these will be used to see if your model is good at predicting the actual classes of your instances!

You can also visualize the images that make up the cluster centers as follows:

# Import matplotlib import matplotlib.pyplot as plt # Figure size in inches fig = plt.figure(figsize=(8, 3)) # Add title fig.suptitle('Cluster Center Images', fontsize=14, fontweight='bold') # For all labels (0-9) for i in range(10): # Initialize subplots in a grid of 2X5, at i+1th position ax = fig.add_subplot(2, 5, 1 + i) # Display images ax.imshow(clf.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary) # Don't show the axes plt.axis('off') # Show the plot plt.show()

If you want to see another example that visualizes the data clusters and their centers, go here.

The next step is to predict the labels of the test set:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLmNyb3NzX3ZhbGlkYXRpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXRcbmZyb20gc2tsZWFybi5wcmVwcm9jZXNzaW5nIGltcG9ydCBzY2FsZVxuZnJvbSBza2xlYXJuIGltcG9ydCBjbHVzdGVyXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpXG5kYXRhID0gc2NhbGUoZGlnaXRzLmRhdGEpXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGF0YSwgZGlnaXRzLnRhcmdldCwgZGlnaXRzLmltYWdlcywgdGVzdF9zaXplPTAuMjUsIHJhbmRvbV9zdGF0ZT00MilcbmNsZiA9IGNsdXN0ZXIuS01lYW5zKGluaXQ9J2stbWVhbnMrKycsIG5fY2x1c3RlcnM9MTAsIHJhbmRvbV9zdGF0ZT00MilcbmNsZi5maXQoWF90cmFpbikiLCJzYW1wbGUiOiIjIFByZWRpY3QgdGhlIGxhYmVscyBmb3IgYFhfdGVzdGBcbnlfcHJlZD1jbGYucHJlZGljdChYX3Rlc3QpXG5cbiMgUHJpbnQgb3V0IHRoZSBmaXJzdCAxMDAgaW5zdGFuY2VzIG9mIGB5X3ByZWRgXG5wcmludCh5X3ByZWRbOjEwMF0pXG5cbiMgUHJpbnQgb3V0IHRoZSBmaXJzdCAxMDAgaW5zdGFuY2VzIG9mIGB5X3Rlc3RgXG5wcmludCh5X3Rlc3RbOjEwMF0pXG5cbiMgU3R1ZHkgdGhlIHNoYXBlIG9mIHRoZSBjbHVzdGVyIGNlbnRlcnNcbmNsZi5jbHVzdGVyX2NlbnRlcnNfLl9fX19fIiwic29sdXRpb24iOiIjIFByZWRpY3QgdGhlIGxhYmVscyBmb3IgYFhfdGVzdGBcbnlfcHJlZD1jbGYucHJlZGljdChYX3Rlc3QpXG5cbiMgUHJpbnQgb3V0IHRoZSBmaXJzdCAxMDAgaW5zdGFuY2VzIG9mIGB5X3ByZWRgXG5wcmludCh5X3ByZWRbOjEwMF0pXG5cbiMgUHJpbnQgb3V0IHRoZSBmaXJzdCAxMDAgaW5zdGFuY2VzIG9mIGB5X3Rlc3RgXG5wcmludCh5X3Rlc3RbOjEwMF0pXG5cbiMgU3R1ZHkgdGhlIHNoYXBlIG9mIHRoZSBjbHVzdGVyIGNlbnRlcnNcbmNsZi5jbHVzdGVyX2NlbnRlcnNfLnNoYXBlIiwic2N0IjoidGVzdF9vYmplY3QoXCJ5X3ByZWRcIilcbnRlc3RfZnVuY3Rpb24oXG4gICAgXCJwcmludFwiLFxuICAgIDEsXG4gICAgbm90X2NhbGxlZF9tc2c9XCJEaWQgeW91IHByaW50IG91dCB0aGUgZmlyc3QgMTAwIGluc3RhbmNlcyBvZiBgeV9wcmVkYD9cIixcbiAgICBpbmNvcnJlY3RfbXNnPVwiRG9uJ3QgZm9yZ2V0IHRvIHByaW50IG91dCB0aGUgZmlyc3QgMTAwIGluc3RhbmNlcyBvZiBgeV9wcmVkYCFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAyLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIGZpcnN0IDEwMCBpbnN0YW5jZXMgb2YgYHlfdGVzdGA/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIGZpcnN0IDEwMCBpbnN0YW5jZXMgb2YgYHlfdGVzdGAhXCIsXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxubXNnX2RhdGE9XCJEaWQgeW91IGZpbGwgaW4gYHNoYXBlYCB0byBwcmludCBvdXQgdGhlIHNoYXBlIG9mIHRoZSBjbHVzdGVyIGNlbnRlcnM/XCJcbnRlc3Rfb2JqZWN0X2FjY2Vzc2VkKFwiY2xmLmNsdXN0ZXJfY2VudGVyc18uc2hhcGVcIiwgbm90X2FjY2Vzc2VkX21zZz1tc2dfZGF0YSlcbnN1Y2Nlc3NfbXNnPVwiQXdlc29tZSFcIiJ9

In the code chunk above, you predict the values for the test set, which contains 450 samples. You store the result in y_pred . You also print out the first 100 instances of y_pred and y_test , and you immediately see some results.

In addition, you can study the shape of the cluster centers: you immediately see that there are 10 clusters with each 64 features.

But this doesn’t tell you much because we set the number of clusters to 10 and you already knew that there were 64 features.

Maybe a visualization would be more helpful.

Let’s visualize the predicted labels:

# Import `Isomap()` from sklearn.manifold import Isomap # Create an isomap and fit the `digits` data to it X_iso = Isomap(n_neighbors=10).fit_transform(X_train) # Compute cluster centers and predict cluster index for each sample clusters = clf.fit_predict(X_train) # Create a plot with subplots in a grid of 1X2 fig, ax = plt.subplots(1, 2, figsize=(8, 4)) # Adjust layout fig.suptitle('Predicted Versus Training Labels', fontsize=14, fontweight='bold') fig.subplots_adjust(top=0.85) # Add scatterplots to the subplots ax[0].scatter(X_iso[:, 0], X_iso[:, 1], c=clusters) ax[0].set_title('Predicted Training Labels') ax[1].scatter(X_iso[:, 0], X_iso[:, 1], c=y_train) ax[1].set_title('Actual Training Labels') # Show the plots plt.show()

You use Isomap() as a way to reduce the dimensions of your high-dimensional data set digits . The difference with the PCA method is that the Isomap is a non-linear reduction method.

Tip: run the code from above again, but use the PCA reduction method instead of the Isomap to study the effect of reduction methods yourself.

You will find the solution here:

# Import `PCA()` from sklearn.decomposition import PCA # Model and fit the `digits` data to the PCA model X_pca = PCA(n_components=2).fit_transform(X_train) # Compute cluster centers and predict cluster index for each sample clusters = clf.fit_predict(X_train) # Create a plot with subplots in a grid of 1X2 fig, ax = plt.subplots(1, 2, figsize=(8, 4)) # Adjust layout fig.suptitle('Predicted Versus Training Labels', fontsize=14, fontweight='bold') fig.subplots_adjust(top=0.85) # Add scatterplots to the subplots ax[0].scatter(X_pca[:, 0], X_pca[:, 1], c=clusters) ax[0].set_title('Predicted Training Labels') ax[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y_train) ax[1].set_title('Actual Training Labels') # Show the plots plt.show()

At first sight, the visualization doesn’t seem to indicate that the model works well.

But this needs some further investigation.