Introduction to Machine Learning with Scikit Learn Video Transcript

Evi: Okay, good day everyone. Today we will be looking at an introduction to machine-learning psychic plans. For this session of RC days, we will… and my colleague Ibrahim Alabi. So what we’re going to be doing? we are going to be using software carpentry introduction to the machine learning model and we will have the links for those in the chat if you don’t already have it. I’ll look up an introduction to what machine learning is and we’ll then go into the two major types of machine learning paths. These machine learning problems are classification problems and regression problems and mainly we’ll be looking at how we can use a Psychic Land to accomplish machine learning. For this tutorial, we’ll need to have a basic setup that we can easily follow. If you are new to data science with Python some of the tools you need include the Python environment where you run the lab and one you can use here would be to run the maps using a Google Collab. So with LNG, you’re…

Kyle: You’re not sharing your screen.

Eve: Oh sorry let me share my screen. Okay, so we’ll be talking about lab using Google Lab. So instead of Google now all you need to do is form your search bar select Google Collab and with any Google Account. You’ll have access to Notebook and you can just click on a file and then a new notebook.

Then sign in with your Google account and you’ll be set to follow the labs we have in this session. I already have my code on here on the left side of the screen. So to get the data sets I’m going to be putting some code in the chat. This code is very similar to the same code you have when you get inside the page. That’s this code here but we slightly modified it to fit into our collab mode and make it easier for us to retrieve data. Alternatively, if you are using your own Anaconda maybe you’re using your owner setup… by the way, Anaconda is easy to install. You just need to install a Python environment. Data science also has a Jupiter notebook which is similar to this column notebook. We’re going to move ahead and paste this code in the chat for us to… put that in the first cell and once you put it in the first cell run your that cell with the play button here. You have your data downloaded in a folder called the data folder. Before we start looking at all the notebooks I think that we should do a quick overview of collab just so that we know where each button is. So each of these cells are basically cells that are gonna set a Python code. Alternatively click on a text cell and you can just put a title here. You could just put some title here and you can actually also make this title load. So if you want to use the code cell instead code cell you can actually write some Python code and it will run just using some of these. We’ll be using this as we proceed so before we before we get into the labs let’s look at a basic introduction to machine learning. Recently, you might have heard the term ‘machine learning’. It’s becoming a buzzword due to its wide-ranging applications.

These applications span across fields like science, engineering, and even marketing. Machine learning can be applied to many fields. So, what is machine learning?

Machine learning is simply a collection of tools and techniques that help us discover patterns in data. Usually, if you want to make predictions or find a dependent value for an independent set, you would use a mathematical function or formula to connect those values. This is essentially classical mathematics. But with machine learning, we use data to find a pattern that links it to an outcome. Initially, we just have some data that we look through to find a pattern that connects that data to some kind of output or pattern within that data.

So what we are going to be doing is looking at a couple of techniques That can help us achieve this. The two major categories Include regression programming and classification programming.

In such models, we aim to predict a continuous variable, like the cost of an item. This could include predicting stock prices or housing values. So, we will be looking at predictive and classification models. The aim of predictive models is to receive some inputs and predict outputs accordingly. The goal for such models is to predict, say, a continuous variable like the price of something. For instance, we could predict stock prices or housing values. Maybe you want to predict the cost of a product, the stock price, or the value of a house. You could even predict a country’s GDP.

Meanwhile, classifiers are used to categorize things into different groups. Consider a dataset of handwritten digits. Our goal would be to classify which digit each image represents. That’s a classification problem. Or perhaps we have a dataset containing images of dogs and cats, and we want to classify these images accordingly. We want to build a model that can classify these images correctly. Another example is classifying an incoming email as spam or not. These are two main categories in machine learning. There are other types too, but we’ll be focusing on these two.

So, next, we’ll talk about data. In machine learning, we’re trying to find patterns in data. What is this data, and what kind can we use it? Well, data can take many forms, depending on our goals. For instance, if we’re predicting stock prices, we would consider past stock prices and other influencing factors. If we’re predicting life expectancy, we might consider health metrics.

Examples of data we can use for various machine learning projects include heights and weights to predict a person’s weight based on their height, traffic conditions to predict commute times, or stock market trends to predict house prices. For classification problems, we might classify emails as spam or not, or classify images as containing a person or not.

Typically, we need a lot of data to train an effective model— at least hundreds or thousands of instances, Sometimes even millions, to make highly accurate predictions or classifications.

If I try to build a model to predict stock prices, I’d consider past stock price data and other factors. Factors that we think may influence stock prices. If we want to predict life expectancy, we’d look at datasets of health metrics and other relevant factors.

Examples of data for different machine learning projects include using height values to predict a person’s weight— there might be a correlation. We could also build a model to predict commute times based on traffic conditions or predict house prices using stock market trends. For classification problems, we might classify emails as spam or not, or classify images as containing a person or not.

Before training any model, we first need data. Usually, for most models, we need a large amount of data— ranging from hundreds to thousands of instances, sometimes even millions.

We can build models that predict house prices based on stock market prices. Then for classification, we might classify emails as spam or not spam, or distinguish whether an image contains a person.

To train any model, we first need data. Typically, a substantial amount of data is necessary for most models. If you want to train a reliable model, you’d need a lot of data— sometimes hundreds, thousands, even millions of data points. This quantity is crucial for achieving accurate predictions with our models.

Even though it’s possible to train models with less data, these models might not generalize well because of the limited size of their training sets. We’ll talk more about this later.

Next, let’s discuss the types of outputs.

For regression problems, outputs are usually continuous values. For instance, if we’re trying to predict a person’s weight based on their height, The output would be a continuous value, like 120.6 pounds, as a predicted weight. Similarly, if we’re trying to predict commute times, the output could be, say, 30 or 15 minutes. In essence, we’re predicting the person’s weight or commute time, which might be 120.6 pounds, 30 minutes, 15 minutes, and so on. These are essentially continuous values.

However, for classifiers, we’re trying to distinguish whether something is in class A or class B, like spam or not spam, human or not human, dog or cat, etc. This is the basic concept of classification.

Now, you might have heard the terms machine learning and artificial intelligence used interchangeably. While they are closely related, there are some subtle differences between them. When we talk about artificial intelligence, we’re considering systems that can adapt to a broad range of tasks or problems. In other words, AI is a more generalized concept. On the other hand, machine learning models are usually trained to solve specific problems. For instance, a company called OpenAI recently released a tool called ChatGPT, which was built using machine learning. This model is excellent for language processing, but it is specifically trained for that task and can’t be applied broadly to other tasks. That’s one of the key differences between machine learning and artificial intelligence.

While the model is excellent at language processing, it can’t, for instance, predict the amount of water based on an image. Machine learning (ML) models are typically trained for specific tasks or systems. So that’s one of the major differences between AI and machine learning: AI is a more generalized concept, while ML is focused on specific applications. So when we talk about AI, we’re considering a broader concept, while machine learning models are more task-specific.

Let’s look at some applications of machine learning.

Machine learning models can be applied across various domains.

Certainly! Here’s the revised text with line breaks for better readability:

You can use machine learning models for a wide range of topics. For example, they can be used in image recognition or object detection, to identify objects in videos or images. They can be used to recognize images of cats or dogs or even distinguish between different breeds. We can also use ML for character recognition, where we aim to identify handwritten digits or alphabets.

There’s also a wide range of classification and regression problems that can be solved using ML. For instance, it can be used in the healthcare sector for disease detection and diagnosis. In financial contexts, machine learning can be used to predict insurance payouts, and housing prices, or even to forecast crime rates. These are just a few examples of the applications of machine learning in our everyday lives.

Machine learning is, indeed, a transformative tool that has numerous applications across various sectors. Machine learning is also making waves in research, where it can be used for classification or prediction tasks. For instance, given remote sensing images, we can predict the presence of water in certain areas. If we have a remote sensing instrument that scans a region and captures images, we can use those images to determine whether water is present in that specific location.

This technology has already been utilized in various fields. For example, machine learning has been used to predict or identify breast tumors using X-ray images. That’s another compelling application of machine learning.

Moreover, we can use machine learning to predict the behavior of animals by analyzing GPS data. While there is much excitement about machine learning and its predictive capabilities, it’s important to note that it is not a panacea. Machine learning models can also produce inaccurate results.

Next, let’s discuss some of the limitations of machine learning. Every machine learning model we build is only as good as the data we provide it. So, there are two major limitations or shortcomings when it comes to machine learning. Firstly, the performance of every machine learning model is closely tied to the quality of the input data.

This leads us to the first major problem with machine learning. For a machine learning model to function optimally, we need high-quality input data. For a machine learning model to perform optimally, we need both high-quality data and a well-designed model. Let’s focus on the data aspect. If we are trying to build a machine learning model with a small data set, the model might struggle to learn effectively due to the limited amount of data, and it might not generalize well. This is one problem related to the size of the dataset.

Another issue is bias in the data. Bias can be introduced in numerous ways during data collection or preprocessing. For example, consider we’re developing a model to determine which operating system is better between Android and iOS. If we gather data predominantly from Android users, our results will likely be biased toward Android. On the other hand, if our data is mainly from iPhone users, the results will likely favor iOS. Therefore, it’s critical to ensure a fair representation of all classes in our data when we’re building a machine-learning model. This will help to eliminate any potential biases and allow our model to make more accurate and generalized predictions. In essence, when we are collecting data, we need to be careful not to introduce any biases, as these can greatly impact the performance of our machine-learning models. Therefore, the representation of each class in our dataset should ideally reflect the real-world distribution. Any biases in data representation can lead to biased predictions. When building a machine learning model, it’s crucial to strive for a fair and unbiased dataset. Besides sampling bias, other types of bias can creep into your data, so being mindful of these is critical.

Now, moving onto problems that can arise from the model itself, one of the key issues is overfitting. Overfitting is a common issue with machine learning models. This problem arises when a model learns the training data too well. Overfitting occurs when a model learns so much from the data that it aligns too closely with it. As a result, when you try to generalize this model to unseen data, it may not perform well because it has over-learned from the training data. Essentially, overfitting is like memorizing the answers to an exam – it works well for the questions you’ve seen, but it doesn’t help with new questions. This can happen if you train a model too much on a subset of data. It ends up learning the specific details of that data too well and then struggles to make good predictions on new, unseen data. When deployed into real-world scenarios, the overfitted model might not perform well, because it’s too closely tuned to the specifics of the training data.

One way to overcome this issue of overfitting is by applying regularization techniques. There are various ways to prevent overfitting in machine learning models.

Typically, when creating machine learning models, for every type of problem, there are various techniques and methods to prevent overfitting. When creating a machine learning model with any dataset, it’s beneficial to partition your data into at least two sets. Typically, you’d divide it into a training set and a test set. One set, the training set, is used for the model to learn from. After the model has learned from the training data, you then use the test set to evaluate the model’s performance. The idea is to first train your model with the training set, and then test it with the test set. By doing so, you can measure the performance of your model and see how well it generalizes to unseen data. After you’ve trained and tested the model, you’ll have performance metrics from both the training and test sets. If the model’s performance is significantly worse on the test set than on the training set, then it’s likely overfitting the data. In other words, it’s performing well on the training data but failing to do so on the test data.

However, there’s another problem that can occur, which is essentially the opposite of overfitting, called underfitting. Underfitting occurs when the model doesn’t learn enough from the training data, resulting in poor performance on both the training and test sets. In some rare cases, underfitting might result in the model performing worse on the training set than on the test set. However, there are methods available to address both overfitting and underfitting when developing machine learning models.

The key takeaway here is that machine learning is a set of tools we can use to find patterns in data. In this section, we’ve discussed two main challenges in machine learning: overfitting and underfitting. But remember, there are solutions to these problems, and an understanding of these issues is an important step toward creating effective models.

To summarize, machine learning is a powerful tool, but it’s critical to use it correctly. The process involves finding patterns in data, and we’ve talked about two key aspects of this process: overfitting and underfitting. We’ve discussed the two major types of machine learning models – those used for predicting values and those for classifying data. We’ve also noted that there are various techniques that can be used for data classification. Next, we’ve drawn a distinction between machine learning (ML) and artificial intelligence (AI), with AI generally referring to the ability of computers to mimic human-like intelligence. On the other hand, ML is a more specific model within the larger AI field, although the two terms are sometimes used interchangeably. We also discussed some limitations of machine learning, particularly when it comes to issues of bias in the systems.

Now, we’re transitioning to the next section which focuses on regression problems. Here, we’re aiming to predict a continuous value using machine learning techniques. In the first part of this regression section, we’re going to manually explore how this can be done using Python.

We’ll provide a general explanation of the process. In the following section, we’ll use Scikit-learn, a Python machine-learning library, to show how these problems can be solved. But first, let’s take a look at what linear regression is. We will be using machine learning techniques to discuss the first part of regression.

We’ll look at how we can manually implement this using Python and try to provide a general explanation of what’s happening. In the next section, we will use Scikit-learn, a Python machine-learning library, to demonstrate how to solve our problem. But before we get into that, let’s first take a look at what linear regression is.

When you start to think about or visualize linear regression, it’s essentially an attempt to draw a line that can best fit your data. Consider a set of data points.

Linear regression is essentially about finding a line that can best represent these points.

Our aim is to find a line that best fits or represents this data. So, the question is, given this set of data points, what sort of line can we draw that will best fit this data?

On the left-hand side, we have these discrete data points. When I mention linear regression, it essentially means finding an optimal line to fit these points. What I’m trying to explain is that given these data points, how can we draw a line that best represents them? We’re looking for a line that can fit this data as closely as possible. So, before we delve deeper into linear regression, we’re going to perform a small exercise to better understand this concept. In this first cell, I’m importing some Python libraries. The first one that I’m importing is the numpy library, which is essentially used for array manipulation in Python. The next library I’m importing is matplotlib, which will help us with data visualization or plotting. Moving on, I have defined a function here. This function basically computes the equation of a straight line given ‘m’ as the gradient, ‘c’ as a constant, and ‘x’ as the value. Now, we’re going to attempt to fit a straight line to this data set manually. For this purpose, we’re going to use the function we just defined. We’ll try to find a line that best fits this data and see what we come up with. Next, let’s see how this line-fitting process can be used for prediction. To do this, I’m going to guess random values for the gradient ‘m’ and the constant ‘c’. I will set the y-intercept of the straight-line equation to zero and the gradient to one. Then, I will use the equation of a straight line to generate some predictions to match this data. After entering these values and running the cell, you can see here in my first run, the line fits the data quite well.

Let’s execute this and see what happens. As you can see, the line does a reasonable job of fitting the data. From the uniformity of this graph, you can observe how this line represents the underlying data. However, it’s noticeable that the line is quite far from the actual data points. To improve this, I will try adjusting the gradient. Let’s increase it to, say, 0.2 and see what happens.

After adjusting the gradient and running the cell again, we can see that the line is getting closer to the actual data points. Next, I will try adjusting the gradient to 0.9. The line continues to shift, but it’s still not close enough to the actual data.

Even when I adjust the gradient to 75, the line still falls far away from the actual data points. In order to find the best fitting line, we need to fine-tune our parameters further. For the perfect value that would fit this line, I still need to do, a bunch of trials. I keep trying different values, hoping that at some point, I might find a value that’s close, that kind of fits this data. Now another thing, when I put 1.5 here, I have a value that is closer to those data points. Another thing I can do is adjust that value. I don’t have to only adjust one value. Let’s say, this is 0.5, I can also try to adjust this intercept value to say, “Okay, let me adjust this to maybe three.” Right, if I am doing, okay. If I can, okay.

So now we can see that the line is plotting. So I’m going to stick with 1.5 because that kind of gives me something very close to it. So you see how the position of the line shifts as I change this value. So I’m going to stick with 0.0 and try it. We see that this line is from all the trials I did, this line is, you know, the one that’s kind of closest to my data points. In a nutshell, this is basically what linear regression does. Linear regression is basically a model that tries to fit a straight line in our data, to get a kind of manual best fit that can cut across our data, where the average distance of every point in the data set to that line is minimized. That’s basically what linear regression is. You’re just trying to find the best values of M and C that will help us get a line that best fits this data. So, there are a couple of ways to achieve this. There are a couple of ways to find these sweet spots or these ideal values. And one of those methods is called least squares. The idea of least squares is to consider the distance between each of the data points for each of these terms. We consider the distance between each of the data points, right, to this line. We’re going to look at the difference, square them, and take the average. We do that for this value, then try another value, say three or six, and see this line. We’re also going to take the distance between each of these points and this line, then calculate the average, and then sum up the averages. After we do that, we calculate the minimum, which is the smallest sum of squares. This would give us a choice of M and C values that would fit our data. So what we’re going to do here is use the government dataset that we had downloaded previously. I’m going to open it to see what it is. This government data set includes life expectancy in different regions or countries of the world over different years. That’s what the data set is, and I’m going to be using this data set to make some predictions. This data set, I’m going to use it to do some predictions. So here we have a more comprehensive code for linear regression guidance. This one here was just a simple demonstration to see how this works behind the scenes. So, this block of code basically defines a function. I will call the function “displays” and what it does is, first, we want to initialize X sum and Y sum as zeros. We also initialize the square of this sum as zero. We also initialize all this as zeros and we also initialize the XY sum as zero. So, basically what we’re doing here, we’re just trying to go over each value in the data and look at them, and then look at the distance between each of those values and some value for our A and C. We would use that to eventually determine which values of M and C produce the smallest square. That’s essentially what this function here is doing.

So, I’ve already copied this block of code into my environment. That said, I’ve copied this block of code into my environment and I’m just going to run this code. So I have the function for this task now. I’m just going to test this out using some specific datasets. So, given the values of x and these values for y, I’m going to perform this task on these values. So when I run that cell, we are seeing the sum of the X values to be 26. The values that give us this result are basically the M values and the C values which are 1.5 and 0.3. Interestingly, we can see that because this is the exact same example we used for our manual computation here. This is the exact same example we used for our manual computation.

We can see that, you know, by guessing, by just trying to guess, we got a result. I mean, it’s not that far off. It’s somewhat close. Also, the M value of 1.5, it’s also somewhat close to 1.54.

So, essentially, what this function has done for us is it saves us the effort of trying to manually input these values. Basically, this method is able to help us minimize and find that straight line. So if I input this value of 1.51 and the value of 0.3 here, I’ll get a line that’s somewhat closer to my optimal line. Let’s say I’ll get a line that’s closer to the line that actually minimizes the error between the predicted and the true values.

So yeah, this is the line we get from using the least squares approach.

Now, how do we test the accuracy of our model? So, like how do we validate the approach, really? So, to do that, we basically, like I said earlier, we just want to measure the distance between each of these points. We want to measure the distance between each of these points, right, and this predicted value. The larger the distance, the less accurate our model is. So, the smaller the distance, the more accurate our model is. One popular method used to calculate this difference is the root mean square error, or basically, we just look at the mean square error.

So, here we basically define a function for measuring error in our data, where we have this first dataset and the second dataset. So what this function does is, you can give it the original data, which in this case will be our y values, and we can give it the predicted values. Because if you look at this line here, you can clearly see that when X is three, we pretty much got the value of y, but when X was five, we didn’t quite get the value of y, right?

So we want to use this goodness of fit measure to calculate this difference, and that’s what this function is for. We’ll use this function to try to minimize or find the error.

And if we run this line of code, then we would be able to measure the difference. We’ll try to find the minimum error, and if we run this line of code, then we

We can see that this is basically the error we get on average when we use these particular values. If we were to use some other value, we would likely get a much larger error. This is the plot of the original data and line, and the line of best fit. It’s showing the difference. The original data is the one in blue, and the line of best fit is the smooth blue line. Now, we can see that they intersect at some points, but we can also see that there are differences. There are some differences between this line and that brings us to a point in machine learning.

When we do machine learning, it’s almost impossible to get a model that will give you 100% accuracy. It’s not something that is easily achievable. What we’re trying to do is, we’re trying to get a model that will closely approximate the true values as much as possible. So when we’re building machine learning models, we’re not trying to get the exact values, but we’re trying to get values that are as close to the true values as possible.

Now, I’m going to try this model with a different dataset. We’re going to be using the life expectancy dataset. So, I’m going to apply this technique to the life expectancy dataset. Before we start, we’ll load the life expectancy dataset. First of all, we can. Before we continue, let’s first define our goal and understand what we’re trying to achieve here.

We’re given a dataset, which is the life expectancy dataset, and our objective is to use this dataset to predict life expectancy in different countries or regions, and their life expectancy at different years. We want to use this data to build a machine-learning model to predict life expectancy. The code we have here is essentially a preprocessing step for the dataset. The dataset comprises different countries and different years, so this code helps us to process the data accordingly. Given any adjustments that might be needed, this code will handle it. Now, let’s go through this code. It’s primarily for preprocessing the data. It’s been defined as a function. The first thing it does is read in the data frame.

This function takes a filename, a country, and a minimum and maximum year as input arguments. It helps us preprocess the data and get a subset of the data that is relevant to our predictive model. The first thing we do is use a function from pandas, a library for manipulating data frames in Python, to read the data file with the index column we care about.

Then, this line of code helps to get a subset of the data. We do that by selecting a range of rows from the data frame, specifying the minimum and maximum year for the subset. We tell the function to take the data from these specified rows and columns. This type of indexing helps to target a specific part of the data. With this subset of data selected, we’re ready to apply our model. Now, we can try to use the least squares method. We can apply the least squares function that we defined earlier. This allows us to calculate the Mean Squared Error (MSE). We then perform some calculations to get the MSE values, which help us measure the error rate. We’re using the least squares method that we discussed earlier for this purpose. I may have made a minor mistake, but that’s fine. Basically, this is just a hands-on example of using least squares to predict life expectancy. We’ll go into more detail about when we use scikit-learn to do the prediction. That’s just an overview of linear regression, and an attempt to manually implement it. Now we’re going to perform this example using scikit-learn.

Scikit-learn is a machine-learning library that includes a collection of functions implementing a variety of machine learning algorithms. We’ll be using the linear model from scikit-learn for our predictions. We will specifically use a model called the Linear Regression model to perform our task. Moving on, the first thing we need to do is to import the required libraries. We’ll need numpy for array manipulation and scikit-learn’s linear model. We’ve defined a function for processing the data, and now we’re going to use the Linear Regression model from scikit-learn for our task. We’ll use the method we discussed earlier to process the life expectancy data. Then, we’re going to use a method from Scikit-learn to perform our analysis. So basically, what we’re going to do is, within the function we defined in the previous section, we’re going to utilize the linear regression model from the Scikit-learn library. The first thing we’ll do here is define a function to process the data. This function takes the file name as input. This function reads the file name we provide. In this case, we’re looking at the life expectancy dataset, which we’ll read as input. The next step is to select a subset of the data, then we will invoke the Linear Regression function from the Scikit-learn library. Essentially, this is just a way of using Scikit-learn, and then applying the linear regression model to fit our dataset. Upon doing that, from the regression model, we can extract both the coefficients and the intercept, which corresponds to the ‘m’ and ‘c’ in our linear equation. With these values, we can use the Linear Regression function to make predictions on our ‘X’ data, utilizing the inputs we provided. That’s how we can use Scikit-learn for our linear regression tasks.

We can also calculate the error between the prediction we made and the actual values. So in this case, we’re going to run this cell where we… I think I’m going to… Let’s go ahead. So basically, we’re going to use the linear regression from Scikit-learn to fit the data, and then do some computations. You can simply copy this cell and try it out on the import stage and see how it works.

Upon running that, we’re able to see the error it produces when we perform linear regression on these datasets. We can also experiment by changing the country parameter from our dataset. We have several other countries to choose from, such as Afghanistan, Albania, Algeria, or Angola for instance. We can choose any of these. For instance, you can change this value here to Angola, just to see the life expectancy for it. Let’s examine that. So, I think I have some issues with my notebook, so I believe it’s a good time for us to take a break. Let’s take about 10 minutes, so we’ll have a short intermission for about 5 minutes. We’ll take a 10-minute break and I will resume around 16 minutes past the hour.

Ibrahim: So, In this tutorial, we have seen that when you’re trying to predict a continuous variable, one approach is to simply draw a line through the scatter plot to try to fit the data as best as possible. But in reality, this is a very simplistic approach. There are instances where other methods might fit better. In reality, this is not always the case, and there are other kinds of regression methods. In cases where the scatter plot does not show a linear relationship between the data points, such methods may be more applicable. In such cases, relying on linear regression doesn’t make sense because we would end up having a model that is very simple and doesn’t capture the randomness of the data. This is where other types of regression come in. It should be noted here that when we say linear regression, in most cases, it’s the parameters that are linear, not necessarily the variables themselves. We can transform the variables into non-linear forms by taking their square or cube, and that’s where polynomial regression comes into play. When the relationship between the outcome and the predictors cannot be explained using a straight line, we can resort to other forms of regression, such as polynomial regression. The idea behind polynomial regression is that instead of assuming the relationship is y equals mx + c, we can transform the x variable in a non-linear way, for example, by taking the square or cube of x, or even raising e to the power of x, whatever fits the data better. The caveat here is that making your model too complex can lead to a problem known as overfitting, where your model fits the training data extremely well but fails to generalize to the test data. Overfitting is a common issue when using models that don’t capture the relationship between variables well on the training set, and then fail to perform on the test set. Again, overfitting is when the model doesn’t perform well on the test set at all, because it has been overly fitted to the training set. There’s also a phenomenon called underfitting, where the model doesn’t perform well on either the training set or the test set. In reality, the challenge is in finding a sweet spot between a very simple model and a very complex one. That’s the underlying reality. Now, moving on to polynomial regression. We use the same function as before but this time, instead of simple linear regression, we employ polynomial features for our x variable. Then we fit the same linear regression model. Again, we use linear regression but this time, we transform the x variable using polynomial features, which by default, takes the square of each feature variable. We fit the linear regression to this transformed data and observe how well the model fits. Now, if we plot this fitted line against the original data, we see that this line fits the data points quite well. So, compared to the initial straight-line fit, this model performs better. In reality, you would want to examine the shape of your data before deciding whether to apply a linear regression or a polynomial regression. There are other options to consider, like cluster analysis. Now, let’s shift our focus a bit and introduce machine learning.

As mentioned before, machine learning involves predicting outcomes, but sometimes, we might not want to make predictions. So, broadly speaking, We categorize machine learning broadly into two. There are more types, but generally, we classify them as supervised and unsupervised. The linear regression we’ve done so far, or any task that involves making some kind of prediction, either numeric or non-numeric, is called supervised learning. In other words, we have some target variables that we’re trying to predict. However, a lot of times we’re not just interested in prediction; we’re interested in grouping things together. If we’re interested in grouping things together, or in reducing the number of features we have, we can also achieve that with machine learning. These types of tasks don’t involve prediction but just involve finding relationships between the features or your observations, that’s called unsupervised learning. There are different algorithms for this, one of which is cluster analysis. At a very basic level, clustering means grouping things together. For example, if you have a photo album and you want to group the same person together into some portion of it, that would be a clustering problem. Any analysis that involves grouping things of similar characteristics together is called cluster analysis. It’s grouping data points that are similar to each other. So, the idea is, items similar to each other should be in one group, and items dissimilar to each other should be in separate groups. Cluster analysis usually doesn’t require any prior training and is known as an Unsupervised learning, as I’ve mentioned earlier, is fundamentally about finding relationships, as opposed to trying to predict something from the features. The lack of a need for training means it can be applied quite well, depending on the size of the dataset at hand. For instance, you could group similar kinds of objects together instead of trying to predict something. You could put swans in one folder, and tools in another, based on their images. These are popular examples of cluster analysis. There are different algorithms for cluster analysis. The most common one is called K-means clustering, but there are also others like spectral clustering, which we are going to talk about just in a moment. You also have others like DBSCAN or hierarchical clustering, but they are beyond the scope of this tutorial. So, at the very basic level, what K-means is trying to do is to find a centroid for each cluster, and then data points that are close to that centroid are put in that group. The challenge here is that you have to specify the number of clusters you want beforehand before K-means can help you print out those clusters. So it doesn’t know the number of clusters automatically, even though there are some ways in the literature on how you can estimate that. You specify the number of clusters you want, and then it’s going to try its best to form those clusters for you. It’s a simple clustering algorithm that tries to identify the centers.

So in this case, the term “K” refers to the centroids, which are the average of each cluster. It does it by searching for a point that minimizes the distance between the centroid and all the points, and that’s what the algorithm is doing. The user provides the number of centroids to look for. While implementing K-means, you try different number of clusters and compare the results to decide what works best. This is often based on domain knowledge. So, based on your knowledge, you specify the optimum number of clusters.

Scikit-learn is a very broad package and contains a lot of machine learning algorithms. We have the cluster module in the Scikit-learn library and then in the cluster module, we have different clustering algorithms. To perform K-means in Scikit-learn, the natural step is to import the KMeans function, which we are going to use. For this example, we are going to use Scikit-learn’s random sampling technique. So we will import the necessary modules and then generate some random clusters.

A useful feature of scikit-learn is the ability to generate data. We’ll use the random data blob generator for testing purposes. The blob generator in scikit-learn can generate synthetic data sets, ideal for testing machine learning algorithms. We’ll generate blobs of points that will serve as a test dataset for our algorithm. On the left, I’ve started by importing necessary components. The first one is the KMeans function and the second one is a module for generating datasets.

This module includes a function for generating synthetic datasets, which will be used here. There’s a function to create datasets with random values which will be useful in this demonstration. Using the make_blobs function, we can generate some random values for our test. The make_blobs function is a part of this module and can be used to generate random values. This function takes several parameters, which we can utilize to create our synthetic datasets.

When using a function you’re not familiar with in Python, you can always ask for its documentation to understand how it works. You can access a function’s documentation by appending a question mark at the end of the function name. So, if we need to learn more about make_blobs, we can get its documentation by adding a question mark at the end. This gives us a quick overview of the function’s usage. This will prompt Python to display the function’s documentation, detailing its usage and parameters. Python will show us the documentation, explaining the parameters that make_blobs accepts. This way, we understand the various parameters that we can use with the make_blobs function for generating our dataset.

The function “make_blobs” accepts a number of parameters, starting with the number of samples, which specifies the number of data points you want to generate. Next, it takes the number of features for each data point. These two parameters dictate the size and complexity of your dataset. Each parameter has a default value. For instance, the default number of samples is set to 100. This means that if you call the function without providing an argument for the number of samples, it will automatically generate 100 data points. By default, it also gives you two features or variables for each data point.

Another parameter is “centers” which denotes the number of centers or clusters you want to generate. Its default value is None. You can specify the number of clusters you wish to generate using this parameter. There’s also the “cluster_std” parameter which sets the standard deviation of the clusters. It defaults to 1.0. The standard deviation controls the spread of data points around the center of the cluster. A smaller standard deviation will lead to tightly packed clusters. On the other hand, a larger standard deviation results in clusters that are more spread out. The “shuffle” parameter, if set to True (which is the default), shuffles the samples.

Finally, there’s the “random_state” parameter. Since we’re dealing with random number generation, results can vary each time you run the code. To get consistent results and ensure reproducibility, you can set a specific random state. By setting the random state to a specific integer, you ensure that the generated dataset will be the same every time you run the code. This is especially useful in machine learning, where we want our experiments to be reproducible.

Now, what does the function return? It returns an “X” value which represents the generated data points. Specifically, it returns an n-dimensional array, where n is defined by the number of samples and features you specified. In addition to the “X” value, the function also returns a “Y” value. This is an integer array associated with each of the data points. These integers essentially label the cluster that each data point belongs to. They range from 0 to n-1, where n is the number of centers. For instance, if you specify three centers, you’ll have labels 0, 1, and 2. This helps us know which data points belong to which center. Data points labeled 0 belong to center 0, those labeled 1 belong to center 1, and those labeled 2 belong to center 2. The function documentation even provides an example of how to use it.

Now, returning to our code, we specified 400 samples. We set the cluster standard deviation to 0.75, requested four centers, and defined a specific random state. When we run this, we get a dataset with four distinct clusters. The clusters are labeled from 0 to 3. You can visually distinguish between these clusters in a plot. If you modify the cluster standard deviation to a smaller value, like 0.5, you’ll see that the clusters become tighter. In other words, the smaller the standard deviation, the closer together the data points within each cluster. This demonstrates the influence of the standard deviation parameter on the dispersion of clusters.

So, now that we have our data…

We can proceed to the next step, which is applying clustering algorithms to this dataset. The goal is to identify the distinct clusters within the data. Remember, this dataset is randomly generated, so we’re using it to test and visualize the performance of our clustering algorithm. To visualize the dataset without any clustering, you can use a simple scatter plot. This will allow you to see the raw data before we apply any algorithms. This is the unprocessed dataset right here.

Our next goal is to apply the clustering algorithm to see if it can accurately group these data points into their respective clusters, thereby demonstrating its effectiveness. We’ll use K-means clustering in this demonstration. It’s important to remember that this is a simplified example and real-world data would likely be more complex. Our current dataset is clearly divided into clusters, each represented by a color in the scatter plot, but real-world data won’t be this neatly organized. Real-world datasets present much more complicated problems for clustering.

After this overview, let’s proceed with applying the K-means function to our dataset. To understand the function better, let’s delve into its documentation and study the parameters it takes. We can access the documentation using the question mark operator. Running this will give us the function’s docstring, essentially the built-in documentation. This helps us understand what the function does and the kind of parameters it expects.

The K-means function requires some parameters to work, the first one being “n_clusters”. As discussed earlier, K-means requires us to specify the number of clusters beforehand. If not specified, it defaults to eight clusters. This means if you don’t specify the number of clusters, the function will automatically generate eight clusters. So “n_clusters” is a crucial parameter that you need to set.

Next, let’s look at the “init” parameter. The initial cluster centers for K-means are vital as they influence the final output. The “init” parameter controls how these centers are initially set. K-means clustering begins by initializing cluster centers and iteratively optimizing their positions. So, the “init” parameter defines how these initial centers should be determined. You have two options for initializing the cluster centers in K-means. There are two methods for initializing the cluster centers, namely, “random” and “k-means++”. The “random” method is self-explanatory, but “k-means++” takes a more sophisticated approach. “K-means++” chooses initial cluster centers in a way that accelerates convergence. It doesn’t pick cluster centers entirely at random. Instead, it’s based on an empirical probability distribution of the data points. This technique introduces a kind of probability into the selection of initial centers rather than randomly picking them. As a result, “k-means++” can make the algorithm run faster and potentially provide better results.

The “n_init” parameter controls the number of times the algorithm will be run with different centroid seeds. This matters because the final clusters depend on the initial positions of the centroids. “n_init” helps to iterate over several randomly chosen initial centroids to ensure the best results. In other words, this speeds up the algorithm and increases the likelihood of finding the global optimum. “n_init” is essentially the number of times K-means will run with different centroid seeds. The final results are the best outputs in terms of inertia. Inertia, in this context, is a way of measuring the error in the clustering. Just like how we use mean squared error in regression analysis, inertia helps in evaluating the clustering.

Another parameter is “max_iter” which represents the maximum number of iterations for a single run of the algorithm. After initializing the centers, K-means iteratively assigns each data point to the nearest cluster. This process is repeated until the clusters stabilize, meaning the points no longer switch clusters. Thus, “max_iter” is essentially the number of iterations performed until stable clusters are found. “Max_iter” is set to 300 by default.

Another parameter to note is “random_state”. For simpler datasets with clearly defined clusters, setting a “random_state” isn’t necessary. However, with more complex datasets where cluster outcomes might differ, setting a “random_state” can be helpful. This is because it ensures the reproducibility of the clustering results.

The K-means function also has other parameters, but most of them can be left at their default settings. The primary parameter that you’d need to specify is the number of clusters.

So, when we invoke this function, we must specify the number of clusters before proceeding. Next, we create the K-means model. This model will be applied to our data. In scikit-learn, you “fit” a model to your data, a process is also known as training the model. You first initialize the model, and then “fit” it to your data. After fitting, we use the model to identify the cluster each data point belongs to. Note, however, that this isn’t technically “prediction” in the traditional sense. Rather, we’re categorizing each of the data points into the cluster where they best fit. This process is done based on the distance from the cluster centers. After fitting the model, K-means provides the center value for each cluster, known as the centroid.

These centroids, which can be considered the “average” of the cluster, play a crucial role in the clustering process. The centroids are essential in identifying the clusters. The centroids should ideally represent the center of the clusters. However, in reality, perfect clustering doesn’t always occur. In this example, we’ve only used two variables, allowing for a simpler visualization of the clusters. One variable corresponds to the X-axis and the other to the Y-axis. However, real-world datasets often involve multiple variables, which complicates the visualization process. Despite these complexities, K-means can still function effectively. Even when dealing with data in higher dimensions, the algorithm works as intended. In machine learning, we often refer to these as “features” or “dimensions”. In this case, we have two dimensions or two features. Visualizing data becomes increasingly difficult as we add more dimensions. The most we can visually represent is three dimensions, corresponding to three variables. Once we exceed three, visualization becomes nearly impossible. However, in machine learning, it’s common to work with higher-dimensional data. The classifiers can still work effectively in these higher-dimensional spaces. In other words, K-means isn’t limited to only two or three dimensions; it can function with many more. However, one of the challenges with K-means is that the number of clusters needs to be specified in advance.

In real-world applications, knowing the ideal number of clusters beforehand can be difficult. Whether the data is generated synthetically or collected from surveys, it’s often challenging to determine the right number of clusters. As such, determining the optimal number of clusters can be a common challenge in K-means clustering. A limitation of K-means clustering is its requirement for a specified number of clusters. K-means may struggle to perform optimally when the clusters are hard to distinguish. Although our current example is simple, with clearly defined clusters, this may not be the case in real-world data. In scenarios where clusters aren’t well separated, K-means can struggle to identify them correctly. Another issue with K-means is that it will always produce an answer, finding the required number of clusters, even if the data isn’t well-clustered. This can be problematic, as it may produce misleading results if the data is not suitable for clustering. In essence, it’s the classic ‘garbage in, garbage out’ problem: you will always get a result, but the quality of that result depends on the input data. Regardless of whether the data can be clustered sensibly, K-means will still produce clusters. Another limitation is that K-means assumes that the boundaries between clusters are linear. This linear boundary assumption implies that the separation between clusters must be a straight line. If this isn’t the case, K-means may not produce meaningful clusters.

For example, consider a dataset with a circular pattern of clusters. If the clusters are separated by non-linear boundaries, K-means may struggle. The algorithm expects a linear boundary, which might not align with the actual data distribution. Hence, the separation between clusters would be a straight line, which may not reflect the true structure of the data. These are the main limitations of K-means clustering. If your data doesn’t follow these assumptions (linear boundaries, known cluster number), then K-means may not be the best choice. There are alternative methods to K-means clustering when it isn’t suitable for your data. We’re going to discuss one such alternative method, namely spectral clustering.

One advantage of K-means, though, is its simplicity and computational efficiency. K-means quickly calculate clusters based on the distance of data points to initialized centroids. Given its speed and simplicity, K-means is often a good starting point for cluster analysis. Similar to how linear regression is a common starting point for regression analysis, K-means can be the starting point for cluster analysis. It has low computational requirements, making it suitable for large datasets. In situations where the dataset is extremely large, some other algorithms might be computationally expensive, making K-means a better choice. But again, it depends on the specific requirements of your data.

We’ve discussed the limitations and advantages of K-means, but other approaches exist such as spectral clustering, DBSCAN, or agglomerative clustering. There are numerous clustering methods to choose from based on your data characteristics. Now, let’s focus on spectral clustering, a technique designed to overcome the linear boundary limitation of K-means. Spectral clustering doesn’t require linear boundaries between clusters. This technique, rooted in graph theory, can also be applied to non-graph data. Spectral clustering is a flexible and robust clustering method that can handle a variety of data structures. It operates by creating a similarity graph from the data and identifying clusters based on this graph. Spectral clustering treats clustering as a graph partitioning problem, focusing on nodes with small distances between them in a graph. While spectral clustering is rooted in graph theory, it doesn’t strictly adhere to it. Its most significant characteristic is its independence from the linear boundary constraint of K-means. Another relevant concept is the kernel trick, which plays a crucial role in spectral clustering. The kernel trick allows higher dimensional interactions within the data to be handled implicitly.

This becomes especially useful when dealing with large, high-dimensional datasets.
Handling these interactions explicitly could require significant computational resources. Instead, the kernel trick allows these high-dimensional interactions to be computed in a lower-dimensional space. Thus, we can work in a lower-dimensional space while effectively operating in higher dimensions. The kernel trick has been widely used in various machine-learning algorithms. For example, it has been extensively used in Support Vector Machines. A common scenario that benefits from this is when dealing with concentric circles of data points. K-means struggles to handle such data, but the kernel trick can help resolve this problem. It can effectively draw a boundary between these circles by mapping them into a higher-dimensional space. In the higher-dimensional space, these circles can be separated more effectively. In this transformed space, one of the circles appears to move away from the other. In effect, what appears as a circular arrangement in lower dimensions can appear as a straight line in higher dimensions. This ability to manipulate dimensional spaces is a core advantage of the kernel trick.

The kernel trick’s primary benefit is its ability to transform and divide data into higher dimensions, aiding in more efficient clustering. This principle is the foundation of spectral clustering.

As an example, consider a dataset containing circular clusters of data points.

Applying K-means on this would not provide satisfactory results, but spectral clustering is better suited for such scenarios. The example dataset consists of concentric circles, akin to targets. The ‘make_circles’ function from Scikit-learn library is used to generate the dataset, with a total of 400 samples. Noise and a random state seed are added for consistent results across multiple runs. In the spectral clustering implementation, we specify two clusters and set the affinity to ‘nearest_neighbors’. This differs from K-means, which assumes a spherical shape of clusters and doesn’t work well with complex cluster shapes. Several parameters, like ‘n_components’ and ‘affinity’, are set in the spectral clustering function. These parameters have theoretical underpinnings, but for this example, we won’t delve deeply into them. Running the spectral clustering algorithm will yield the clustered dataset. The results should demonstrate spectral clustering’s effectiveness over K-means for this specific dataset. Thus, spectral clustering provides more flexibility when dealing with complex data structures. Again, some parameters are set to handle the clustering process more efficiently. These include the number of clusters (‘n_components’) and the method to construct the affinity matrix (‘affinity’). Such settings are essentially tunable tricks to enhance the performance of the spectral clustering algorithm. The spectral clustering algorithm proceeds by constructing the affinity matrix and assigning labels to the clusters. These labels will represent different clusters in the higher dimensional space created by the kernel trick. Essentially, spectral clustering treats clustering as a graph partitioning problem and solves it in a high-dimensional space. This is done by assigning the data points to clusters (or ‘communities’) in this higher-dimensional space. There are several methods to assign data points to clusters, and k-means is one of the commonly used methods. The result is a better clustering result that could potentially reveal more meaningful patterns in the data compared to traditional methods. Now, to evaluate the performance of spectral clustering, it’s necessary to have a comparison framework.

Comparisons could be made with other clustering methods, like K-means, visually or numerically.

Visual comparisons are based on graphical representation of clusters, while numerical comparisons involve the use of performance metrics. One such performance metric is the silhouette coefficient. This coefficient provides a measure of how similar a point is to its own cluster compared to other clusters. The value of the silhouette coefficient ranges from -1 to 1, where a high value indicates that the point is well-matched to its own cluster and poorly matched to neighboring clusters. In other words, a high silhouette coefficient suggests good clustering. This coefficient can be used to compare the performance of different clustering algorithms on the same dataset. To summarize, spectral clustering allows for more sophisticated and potentially more meaningful clustering of data, especially when the clusters in the data are not spherical or have complex structures. This clustering can be evaluated using methods like silhouette coefficients. Unsupervised learning, such as clustering, doesn’t require labeled data. It’s about finding patterns or structures in the data.

The primary goal here is to group similar data points together without any preconceived notions of what these groups should be. Unlike supervised learning, you don’t divide your data into training and testing sets. Instead, the entire dataset is used for both learning the clusters and evaluating the model.

One notable aspect of spectral clustering is that it can handle non-linear boundaries, something that is challenging for many other algorithms. However, due to the high computational cost, especially when dealing with larger datasets, spectral clustering can be slower than some other methods. Therefore, depending on the context and requirements of your problem, you might have to decide whether you prefer speed or the ability to accurately handle complex structures in your data.

Moreover, Scikit-learn, a popular machine learning library in Python, provides built-in functions for spectral clustering. It also offers functions to generate synthetic data which can be useful for testing algorithms.

So in summary, unsupervised learning and specifically clustering, including methods like spectral clustering, are powerful tools in machine learning for discovering hidden structures in data. The key is to understand the trade-offs involved and select the method that best fits your specific problem and computational resources.

Finally, another important topic in unsupervised learning is dimensionality reduction, which we will cover next. If you are testing an algorithm, Scikit-learn provides the ability to generate sample data to give you a starting point.

That essentially covers the topic of cluster analysis, which is a type of unsupervised learning. Next, we’ll move to another type of unsupervised learning, which is dimensionality reduction. Dimensionality Reduction.

At the start of the discussion on unsupervised learning, I mentioned that we’re not predicting anything. One task could be grouping data, and another could be reducing the number of features in your dataset. When your interest lies in decreasing the feature space, that’s where dimensionality reduction comes in. There can be several reasons for wanting to reduce the number of features. For example, having too many features can lead to increased complexity in your model. So, one way to mitigate overfitting, which happens when your model is too complex, is to reduce the feature space. By focusing on the most useful features, you can simplify your model without losing too much information. Oftentimes, the concept of reducing dimensionality plays a crucial role in machine learning.

Reducing the number of input features, or variables, is a key part of many unsupervised learning approaches. This isn’t about predicting outcomes but about simplifying the dataset that we’re working with. This can involve selecting a subset of variables that capture the majority of the variation in the data. Dimensionality reduction can be a valuable step in preprocessing data before you perform other operations such as classification or visualization. Earlier, I mentioned that it can be challenging to visualize data with a high number of dimensions. When you have a lot of variables and you want to visualize them, it’s practically impossible to do this beyond three dimensions. So, one approach to help with visualizing data is to reduce the number of dimensions. Reducing dimensions isn’t about just picking a few variables and visualizing them. Instead, you want to condense the information contained in all the variables into two or three dimensions, while preserving as much of the original information as possible. This process of preserving information is called Principal Component Analysis, or PCA. So, when we reduce the number of variables while preserving the majority of the information in the original dataset, we’re performing dimensionality reduction. There are various ways to perform dimensionality reduction in Scikit-learn, one of the simplest of which is known as Principal Component Analysis, or PCA. This is a technique that transforms a dataset with many variables into a smaller set of new variables. These new variables, or ‘principal components’, are ordered based on the amount of original information they retain. In simple terms, PCA takes a high-dimensional dataset and reduces it to fewer dimensions while retaining as much information as possible. It creates a new set of variables by combining the original variables.

It’s important to note that by default, PCA will create as many components as there were in the original data. However, the first component will retain more information than the second, and so on. This process continues until the last component, which retains the least amount of information.

Another term you might come across when discussing PCA is ‘variance’. The components produced by PCA are ranked based on the amount of variance they explain in the original data. So, in essence, PCA transforms correlated variables into a set of uncorrelated ones.

To implement PCA in Scikit-learn, you’ll need to import NumPy, which is used for manipulating arrays and creating new ones, as well as Matplotlib’s pyplot for plotting the dataset. You’ll also need the decomposition module from Scikit-learn, which includes the PCA function.

We will use the principal component analysis (PCA) on a specific dataset. This particular dataset is known as the ‘digits’ dataset, and the manifold submodule will be used for another type of dimensionality reduction that we will discuss later. The ‘digits’ dataset, also known as the MNIST dataset, contains images of handwritten digits. It has images of digits from 0 to 9, and the objective of this analysis is to detect these digits automatically.

We could use classification for this, but we can also use unsupervised learning. In this context, each digit or image is simply a series of pixel intensity values, which form an array of numbers. Visualizing such high-dimensional data is practically impossible, for example, the MNIST dataset is 28 by 28 pixels, meaning it has over 700 features per image, and it’s impossible to visualize such high-dimensional data directly. Our only option here is to use dimensionality reduction techniques.

So we’ll start by implementing PCA. Let’s look at the parameters of the PCA function from the decomposition module. The PCA function in Scikit-learn performs linear dimensionality reduction using Singular Value Decomposition (SVD) of the data. The data is projected onto a lower-dimensional space. Even though some of the terms might seem complex, for now, understand that PCA is a type of linear dimensionality reduction technique.

One of the parameters it takes is ‘n_components’, which specifies the number of components to keep. By default, it gives you the same number of components as your original features, but you can specify a different number. For example, if you specify ‘2’, it will return the first two components. These two components carry the most information from the entire dataset. If ‘n_components’ is not specified, it returns all the components. The rest of the parameters are optional and you can also specify a ‘random_state’ to reproduce results.

Once you have defined your PCA, you need to fit it on your data, and then you can use it to transform your dataset. This ‘fit’ and ‘transform’ process allows you to reduce the dimensionality of the data you’re working with to achieve more manageable and insightful datasets. When performing transformations on your dataset, you can conveniently use the ‘fit_transform’ function in a single step. This will execute both ‘fit’ and ‘transform’ and reduce your data to two dimensions. Specifying ‘n_components’ as two implies that we are reducing more than 700 variables to just two.

Now we are ready to visualize this two-dimensional data. In our plot, we assign different colors to each data point based on the associated digit. The black dots represent one digit, the blue dots represent another, and so on. However, this visualization isn’t perfect. It’s difficult to clearly see each digit separated, which is one of the limitations of this approach. PCA assumes that the data follows a linear structure, but if that’s not the case in reality, the results might not be accurate. This plot demonstrates this particular limitation of PCA. This is an example of the complications that can occur in a real-world dataset.

This method aims to solve the problem with PCA. It doesn’t rely solely on linear decomposition but adds a degree of nonlinearity, which provides it with significant power. Although, it may not be perfect, but it offers better visualization compared to Principal Component Analysis. In the resulting plot, you can see clear separations. For example, the digit ‘0’ is distinctly apart, while ‘9’ is somewhat close to ‘3’. Typically, this method is utilized as a preprocessing step. However, there is another frequently considered method, a newer one, referred to as UMAP. Though we won’t delve into it here, it also addresses certain limitations. Like t-SNE, UMAP is worth exploring.

The key takeaways here are that PCA is a linear dimensionality reduction technique best suited for tabular data, while t-SNE provides a more general approach. It doesn’t strictly adhere to linear transformations, adding a level of complexity.

These techniques can be considered as preprocessing steps in many data analysis workflows. Moving on to our next topic, neural networks, a concept that frames the foundation of deep learning. You encounter deep learning in everyday life. An example is the renowned GPT model, a type of neural network. Neural networks are indeed extensive, providing a taste of their capabilities. They’re a machine learning method inspired by the way the human brain functions. However, it’s important to clarify that neural networks don’t exactly replicate brain function. Contrary to some misconceptions, neural networks don’t operate as human brains do. There’s no evidence suggesting that human learning mirrors the optimization processes in these networks. But their design was indeed inspired by how the human brain processes information.

Often utilized in image recognition tasks, they are well-established machine learning techniques that have been around since the 1950s, evolving through several iterations. They’ve significantly advanced from their earlier versions, leading to the development of deep learning, a sophisticated form of neural network.

Their breakthrough came in the 2010s when a neural network model won a notable image recognition competition. This victory established neural networks as a powerful tool for machine learning, particularly effective with unstructured datasets.

Until now, we have been discussing techniques such as linear equations that generate structured data, data that can be stored in a table. Contrarily, an image does not constitute a structured dataset, even though we can extract numerical data from it, it’s not structured in the conventional sense. Similarly, voice recordings also represent unstructured datasets.

For such datasets, neural networks can outperform traditional machine learning models, which might struggle to handle them effectively. At the heart of neural networks, we find units called perceptrons. Perceptrons serve as the building blocks of neural networks, analogous to single neurons in the brain.

Certainly! Here’s the revised text with appropriate line breaks to enhance readability:

Each perceptron, or neuron, receives multiple inputs and produces a single output. This is how a neuron is structured. When inputs are received, a computation is performed within the neuron, which can be thought of as linear regression, as it involves adjusting the weights and adding a bias term. So, we’re essentially conducting a linear equation within the neuron. This is the foundational mechanism.

However, instead of directly providing the output of this regression, it is further sent through a transformation called an activation function. This introduces a level of non-linearity to the process. Among the different activation functions, the Rectified Linear Unit (ReLU) is commonly used. It returns the maximum of zero and the input value. So, if the regression yields a negative number, the function simply outputs zero.

There are other activation functions as well, including the sigmoid function, which squashes the output between zero and one, ensuring that the output is within this range regardless of the input value. There is also the hyperbolic tangent function. In the early days of neural networks, this was frequently used as it maps the output between negative one and positive one.

The choice of activation function depends on the specific problem. Typically, you would experiment with different activation functions, evaluating their performance on your dataset, with ReLU being a common starting point.

Moving on, we will now delve into coding a perceptron.

The function of a perceptron takes as input the data, the weights, and a threshold.

Firstly, we ensure that the length of the input matches the length of the weights. Otherwise, an exception is thrown. We then perform a weighted sum of the inputs, similar to linear regression. This sum is then passed through an activation function. In this example, we use a binary threshold activation function. If the sum is less than the threshold, the function outputs zero, otherwise, it outputs one. However, there are numerous other types of activation functions that can be used, and the choice of which to use is often problem-specific and requires testing. Once we have computed the activation, we have the output of our perceptron which can then be used as input to another neuron in the network, and so on. In real-life applications, we would typically have more complex activation functions. However, for simplicity, we are sticking with a binary threshold here. So, in summary, this function processes the inputs, produces a weighted sum, applies an activation function, and outputs the result.

Here, we have a perceptron that receives input, applies weights, and outputs a result. Notice that this is the second layer, where the first was the set of inputs, the second is another layer of inputs, and so on. This is akin to a dataset with two variables and four observations. The observations, as you can see, are used for the operation of the perceptron. For example, the first observation gives us an output of 0, because the weighted sum of the inputs is less than the threshold. Essentially, it’s a binary operation, which is why our perceptron outputs 0. This is what happens inside the perceptron. Essentially, a neural network can be thought of as a series of linear regressions, but with added activation functions, making the process nonlinear. So, let’s move on to multilayer perceptrons. A perceptron can only classify linearly separable data, meaning we can draw a straight line to separate different classes of data, as a linear regression would. However, a multilayer perceptron can handle more complex data that is not linearly separable. In a multilayer perceptron, instead of just having one layer of perceptrons, we can have multiple layers. In this multilayer perceptron, we have several layers, starting with the input layer. In neural network terminology, we would say this network has three layers: the input layer, the hidden layer, and the output layer.

The input layer simply receives the input variables, or the independent variables, if you will. Then, for every input, it’s connected to the hidden layer where computations are made, which includes linear regression computation along with activation functions. The result of these computations is then sent to the output layer.

The output layer holds the result of the computations made by the neural network. The number of nodes in the output layer should be equal to the number of classes you’re predicting for. For example, in the MNIST dataset where we classify digits from 0 to 9, the output layer should have 10 neurons, each representing a digit. If you’re predicting a numerical value, such as the price of an Airbnb, you would typically have a single neuron in the output layer that holds the predicted value.

The hidden layer controls the complexity of your network. The higher the number of nodes in the hidden layer, the more complex the network is, and the more difficult it is to train. The optimal number of nodes is problem-dependent, and it often requires experimentation to find the best configuration. The number of nodes you have in your hidden layer is problem-dependent, and you may need to experiment before you find the configuration that works well.

When your network only has a single hidden layer, it’s known as a shallow neural network. If you have multiple hidden layers, then it becomes a deep neural network, and this is where the term “deep learning” comes from. The depth of your network refers to the number of hidden layers it contains.

In a neural network, every interneural connection has a weight associated with it. When we say we’re training a neural network, what we’re really doing is trying to find the optimal weights such that when we input data, the network produces outputs that are as close as possible to the real values. However, the specifics of how this training process works is a bit beyond the scope of this tutorial.

To quickly cover some important concepts, the training process typically starts by initializing the weights of your neural network randomly. Then, the network tries to make predictions, and you assess how close those predictions are to the true values. This process is sometimes referred to as forward propagation.

However, not all datasets can be modeled this way. For instance, in some cases, the sequence of the data matters, such as with time-series data like stock prices, where today’s price may depend on yesterday’s price. For such cases, simply forward propagating information is not sufficient. This type of model, where past information is taken into account, is sometimes referred to as a recurrent model.

In recurrent neural networks, there’s some degree of looping or feedback involved, and the information flow is not purely forward – it can also go backward. Recurrent neural networks have been used to successfully model complex problems in various domains, including natural language processing, where they have shown strong performance.

In summary, the architecture and training process of a neural network can vary greatly depending on the specific problem at hand. Transformers are also a significant advance in neural network architectures but beyond the scope of this tutorial.

To train a basic multilayer perceptron, scikit-learn offers a simple model to use, but note that for more optimal or complex neural network training, scikit-learn might not be the best package. Other frameworks, such as TensorFlow or PyTorch, offer more capabilities. Whether to use TensorFlow or PyTorch depends on your preference and needs. If you’re just starting, Keras, a high-level API for TensorFlow, can be an easier entry point. However, once you grasp the fundamental concepts, either framework can be effective.

Moving on to the dataset we’re going to use in scikit-learn, the MNIST database is conveniently included. This database consists of 70,000 images of handwritten digits, each of which is a 28×28 pixel image. This means each image has 28×28, or 784, unique pixels, which can be flattened into a single vector. Each of these pixels will be a variable in your dataset, turning each image into a point in a 784-dimensional space. So, in this dataset, we have 70,000 rows (each representing an image), 784 features (each pixel in the image), and an outcome variable which is the digit we’re trying to predict.

In machine learning, preprocessing of data is often an essential step. Here, we standardize the data. This process is crucial in machine learning, especially when the variables are measured on different scales. Variables with higher magnitudes could appear more important to a model, even if they’re not. So, to avoid this bias, we level the playing field by putting all variables on the same scale. One common approach is called standardization, which adjusts variables to have a mean of zero and a standard deviation of one. This makes the variables more comparable for machine learning algorithms. In this case, we’re implementing a slightly different form of standardization. We know that pixel values in an image range from 0 to 255, so instead of using standard scalar standardization (mean of zero, standard deviation of one), we can use min-max scaling, which scales all the pixel values to lie between 0 and 1. This is sometimes referred to as normalizing the image data. As previously mentioned, the highest pixel value in a grayscale image is 255. Therefore, by dividing all pixel values by 255, we can scale all pixel values to be between 0 and 1.

After applying this MinMax scaling, we then make use of the neural network module from the scikit-learn package. Here, we construct the neural network model, specifying that each hidden layer should have 50 nodes. However, it’s important to remember that this number, 50, is not necessarily the optimal number of nodes – it’s dependent on the specific problem at hand. You might need to experiment with different numbers of nodes to find the optimal configuration for your problem. Furthermore, the size parameter here, 50, means we could potentially add more hidden layers if needed. The maximum iteration parameter is set to 15, indicating that we’ll limit the training process to 15 epochs. These epochs are the cycles through the entire training dataset where the model learns by adjusting the weights. The number of epochs, or iterations, the model goes through is determined by this max_iter parameter. Here, we’ve set the max_iter to 15, meaning the model will go through 15 epochs before stopping. However, if the model converges (i.e., the error becomes small enough) before the 15th epoch, it’ll stop early. Finally, we set the verbose parameter to True. This means that the training process will output updates for each epoch, which can be helpful for monitoring the training progress. Of course, if you’d prefer to not see these updates, you could set verbose to False, and the model will train silently, only returning the final result. Here, we’re setting the random state to ensure that the results of our model are reproducible. Remember that the initial weights of the neural network are randomly assigned. Therefore, each time we rerun the model, the weights get initialized differently. By setting a random state, we ensure consistent initializations. This guarantees that each run of the model will give the same result. After setting up the model, we then explore the dataset to understand its structure.

Next, we split the dataset into a training set and a testing set.

The training set is used to teach the model how to recognize handwritten digits, while the testing set is used to evaluate the model’s performance. Before deploying a model in a production setting, it’s critical to evaluate its performance using a testing set. Here, out of 70,000 images, we’re using 63,000 for training. The remaining 7,000 images are used as the testing set to evaluate the model’s performance. So, we train the model on 63,000 images and then test it on the remaining 7,000. Finally, we evaluate the model’s accuracy.

One of the benefits of scikit-learn is that it makes this evaluation straightforward. Scikit-learn has a built-in method to compute the accuracy of a model directly, making the evaluation process simple. Depending on the type of problem, whether it’s classification or regression, it gives you the appropriate metric, such as accuracy or R-squared. Here, for our classification problem, it computes the accuracy of the model.

Next, the model begins the training process. The training process can take some time as the model is trying to get its parameters right on the training set. We’re interested in obtaining the model’s accuracy on both the training set and the testing set. The reason for this is as follows. If we train too much on the training set and then test on the same data, the accuracy measure might be misleading.

If the accuracy on the training and testing sets is far apart, it indicates that we might be overfitting the model – meaning the model is performing well on the training set but not on unseen data. So, assessing the model’s accuracy on both sets gives us a better understanding of its performance.

Finally, we use the trained model (now optimized with a set of weights) to make predictions. In other words, we use the trained weights to predict the labels of unseen images. We want to check how well these optimized weights can be used to classify unseen data. And this forms the basis for transfer learning in neural networks – using a pre-trained model to understand and make predictions on new data.

Given that neural networks take a significant amount of time to train, once training is complete, it’s beneficial to save the model weights. These weights can then be used by anyone else who wants to make predictions with the model, instead of training it from scratch. This means they can load the weights of the trained model and make their own predictions, which saves a lot of time and computational resources.

To make a prediction, the model takes an input, applies the operations specified during the training process, and outputs a prediction. This is essentially how the prediction process works. Scikit-learn also has a built-in function for making predictions.

Once you have your model and predictions, you can evaluate how well the model is performing. This is typically done by comparing the model’s predictions to the actual labels in the test set.

If the model’s accuracy is high, it means it has been able to adjust its weights effectively to produce good outputs. It’s not unusual for the accuracy to decrease and then increase again during training. In such cases, you can save the model at the point where it achieved the best results. In other words, you can create a checkpoint at the point of highest accuracy.
This means you’ll have a saved model that performs at its best.

In this case, we’ve set up a neural network with 50 hidden neurons, but we’re not certain if this is optimal. So we may want to tune these parameters, either with a random approach or an optimization approach. One common technique for this is called Grid Search, where different parameter combinations are tested systematically. In a Grid Search, you specify different parameter values for your model to try. The model makes predictions using each set of parameters and the one that gives the best accuracy is used. This is how Grid Search works. However, this method can be computationally expensive. Moreover, there’s a chance that you didn’t specify the best hyperparameters, so even though you might get good results, there’s no guarantee. As long as the hyperparameters are manually specified, there’s a risk that they might not be optimal. To mitigate this, other approaches exist that involve different ways of tuning hyperparameters. These can include more advanced techniques that automatically adjust hyperparameters based on performance. When measuring model performance, accuracy is a common metric used.

Here, the training set has 99% accuracy and the test set has 97% accuracy, which is quite good. This suggests that the model is not overfitting to the training data, because the accuracies are close to each other. Overfitting is when a model performs very well on the training data but poorly on the test data, and it’s a common problem in machine learning. If the model was overfitting, we would expect a high accuracy on the training set and a significantly lower one on the test set. The way accuracy is calculated is by dividing the number of correct predictions by the total number of predictions made. This gives a percentage that represents how often the model makes correct predictions.

There is a concept in machine learning called confusion matrix, which we’ll discuss next. A confusion matrix is a way to visualize the performance of your model by comparing its predictions to the ground truth. It is a grid where each row represents the instances in a predicted class, and each column represents the instances in an actual class.

For instance, if you are working on a multi-class problem with 10 classes (0 to 9), the confusion matrix would be a 10×10 grid. The intersection of a row and a column indicates how often an instance of the actual class (represented by the row) was predicted as the class represented by the column. So the diagonal of the matrix (from top left to bottom right) represents the correct predictions.

In the case of MNIST dataset (a dataset of handwritten digits), the actual labels would be the digits from 0 to 9. Similarly, the predicted labels would be the digits from 0 to 9. So the intersection of the actual label row and predicted label column would represent the number of correct predictions for that particular class. For instance, if you have “2” in the row for actual label 2 and a column for predicted label 2, it means the model made two correct predictions for the digit “2”. If you see a non-zero value in any off-diagonal cell, that indicates a misclassification. For example, if there’s a “1” in the row for actual label 2 and a column for predicted label 3, it means the model incorrectly predicted the digit “3” for one instance of the digit “2”. In this way, a confusion matrix gives you a detailed view of how well your model is performing for each class. It can reveal if your model is particularly weak at predicting a certain class, or if it is consistently misclassifying one class as another. For example, if your model is consistently misclassifying traffic signs as buildings, this would be evident in your confusion matrix.

A confusion matrix offers more granularity than simply measuring overall accuracy, as it displays how predictions are distributed across all classes. This allows us to identify if our model is particularly strong or weak in predicting certain classes. Accuracy can sometimes be misleading, especially in cases where our data is imbalanced. Beyond just accuracy, we can calculate other metrics from the confusion matrix such as precision and recall. Precision asks the question: when our model predicts a class, how often is that prediction correct? Recall, on the other hand, asks: out of the actual instances of a class, how many did the model correctly identify?

For example, in a model that predicts whether someone is likely to go to jail, precision would represent how often those predicted to go to jail actually end up going, and recall would represent how many people who actually went to jail were correctly predicted by the model. In such cases, you may not just care about accuracy, but also about precision. This is because a false positive (wrongly predicting that someone will go to jail) can have serious consequences. Thus, you would want your model to have high precision.

When tuning hyperparameters, we don’t evaluate the model on the test set right away. The test set is only used for the final evaluation, once we’re done with fitting and tuning the model. This helps prevent overfitting to the test set.

While developing a model, we commonly split our data into two sets: training and testing. However, there’s also a third set called the validation set. Often, we carve out a portion of our training set for validation, perhaps around 20%. We train our model on the remaining 80% and use the validation set for hyperparameter tuning and to get a sense of the model’s generalization performance during training. This way, we can make adjustments to our model and its parameters without touching the test set until the very end.

However, by setting aside a portion of our data for validation, we reduce the amount of data available for training. This is where cross-validation comes in handy. In cross-validation, we divide our data into several partitions, or “folds”. We train our model on all but one of these folds, and use the left-out fold for validation. In k-fold cross-validation, we repeat this process k times, each time using a different fold for validation.

For example, in 4-fold cross-validation, we might divide our data into four parts, training on three and validating on one. We then repeat this process four times, each time using a different part for validation. In the second iteration, we might train on parts 1, 2, and 4, and validate on part 3. And so on. This way, every example in our data gets to be part of the validation set exactly once.

This process is known as cross-validation, and the number of folds (in this case, 4) is a parameter we can tune. So in practice, you don’t just pick a validation set and stick with it – you use cross-validation to ensure every sample is used for validation at some point. This method gives a more robust measure of the model’s performance, as it reduces the variance associated with the choice of a specific train/test split.

Cross-validation is particularly useful in cases of limited data, as it maximizes both the training and validation data. After this process, the model can then be finally evaluated on the untouched test set to estimate how it will perform on unseen data. The concept I’ve explained so far pertains to a basic neural network with a single hidden layer. But what if we want to model more complex relationships? In that case, we might turn to deep learning. In a deep learning model, we have more than one hidden layer. Deep learning has proven very effective for a wide range of problems, including image and speech recognition, natural language processing, and more. Deep learning involves the use of complex architectures, often involving specific types of neural networks. For instance, convolutional neural networks (CNNs) have been a key development in this area, particularly for image-based tasks.

CNNs are often trained on Graphics Processing Units (GPUs) because they can perform the heavy matrix computations needed for deep learning far more efficiently than regular CPUs. As the field advances, new types of networks and architectures continue to be developed.

When we talk about the computational operations involved in deep learning, we often use the term “tensor”. A tensor is like a generalization of matrices that can have more than two dimensions. Tensorflow, a popular library for deep learning developed by Google, gets its name from this concept. The name suggests that the library is designed to handle these tensor operations efficiently.

But TensorFlow isn’t the only library out there for deep learning. PyTorch, developed by Facebook’s AI Research lab, is another popular choice that has a more dynamic computation graph compared to TensorFlow.

These are some of the key tools and concepts in deep learning. The choice of tools can depend on your specific use case and personal preference. Deep learning represents the state of the art in many areas of machine learning, but it’s not a panacea. There are simpler models that might be more appropriate for certain tasks, especially when interpretability is a key requirement.

In conclusion, deep learning is a powerful tool, but it’s only one of the many tools in a data scientist’s toolbox. Keras is another popular library for deep learning that acts as an interface to Tensorflow, making it more user-friendly. PyTorch, as mentioned earlier, is an open-source machine learning library developed by Facebook’s AI Research lab.

Theano is another library, though its use has declined in recent years with the rise of Tensorflow and PyTorch. Cloud APIs have also become popular, with major tech companies like Google, Microsoft, and Amazon providing cloud-based deep learning services. These platforms allow users to upload data, such as images, and obtain predictions without having to train their own models.

Many of these services use proprietary neural networks trained on vast datasets. There are also models that have been pre-trained on large public datasets, which can be used for tasks such as image recognition. You can feed your own data into these models to generate predictions.

The key concept behind all of this is the neural network, and more specifically, the perceptron, which is the building block of these networks.

A perceptron takes multiple inputs, each multiplied by a weight. It then sums these weighted inputs and applies an activation function. This is similar to performing a linear regression at each node, and then applying a nonlinear function. A single perceptron can solve simple problems that are linearly separable, where a line can be drawn to separate the classes.

However, for more complex problems where the classes cannot be separated by a single line, we need a network of perceptrons, also known as a multilayer perceptron or a neural network. This allows us to model more complex, nonlinear relationships in our data.

In conclusion, whether you’re using a single perceptron or a deep neural network, the key idea is the same: learn from the data by adjusting the weights, and make predictions based on the learned model. Non-linear problems can’t be solved by a single perceptron, as the function is not linear.

The process begins by initializing the weights randomly and then making a prediction. This prediction is compared to the actual output, and the weights are adjusted based on the difference, a process called backpropagation. This is the core of the learning process in a neural network.

Training a neural network, like any other supervised learning model, requires a training dataset to provide examples for learning. To evaluate the performance of our model, we use separate training and testing sets. After training, we can use the testing set to assess how well the model generalizes to unseen data.

One common practice is to use cross-validation, which allows for the entire dataset to be used for both training and testing by partitioning it into different subsets. We then train the model multiple times, each time using a different subset as the testing set. This process can help optimize the model’s performance.

Another method is to use mini-batches of data for training, rather than the entire dataset at once. This approach, called batch learning, can speed up the learning process.

Deep learning and neural networks are powerful techniques for machine learning, but they’re not supported by scikit-learn. Instead, you’ll need to use other libraries, like Tensorflow or PyTorch, to implement these models. These libraries provide more advanced features and are specifically designed for neural networks and deep learning tasks.

Cloud platforms, like Google Cloud, also offer machine learning services and can handle these more complex models. These platforms also allow for the use of powerful hardware, like GPUs, which can significantly speed up the training process.

Overall, neural networks and deep learning offer advanced capabilities for solving complex machine-learning tasks and are widely used in the industry today. Other platforms like Microsoft Azure and Amazon also offer similar cloud-based machine learning services. These platforms provide tools to train and predict models using large amounts of data.

But as we reach the end of this tutorial, we must address the ethical considerations in machine learning.

Machine learning is a powerful tool, but it must be applied responsibly, with consideration for its ethical implications.

From mismanagement of data to unfair impacts on different aspects of daily life, there are numerous ethical issues that need to be considered. Autonomous vehicles, for instance, bring up numerous ethical questions. Other ethical concerns arise from the potential misuse of AI in areas such as personalized advertising, political manipulation, or military applications. There’s also the issue of bias in machine learning models. These models learn from data, and if the data reflects biases, the model will too. For example, if a model is trained on data where people prefer Android over Apple, the model will reflect that preference, regardless of the objective merits of either platform. These biases can lead to unfair or discriminatory outcomes, which is a major ethical concern in machine learning. Another concern is the use of machine learning in areas like predictive policing or military applications, where incorrect predictions could have serious consequences. So, while machine learning offers many benefits, we need to carefully consider these ethical implications when deploying these technologies.

When applying machine learning, it often involves complexities or decisions that are not immediately clear. The training data used in production environments is of crucial importance. Let me illustrate with an example. Suppose we are gauging people’s preferences between Android and Apple phones. If the sample population is predominantly Android users, that’s what the machine learning algorithm will learn and adapt to. Thus, when deployed into production, it will naturally show a bias towards Android users. These are the issues we should consider during the training phase.

Ultimately, good data is more valuable than a sophisticated model.

Explaining the decisions made by machine learning systems can be challenging, as these decisions can be intricate and hard to interpret. In the early days of machine learning, even the experts struggled to understand why some models worked exceptionally well. This lack of transparency in the decision-making process is a subject of ongoing work. In places like China, efforts are being made to develop methods to provide clearer explanations for machine learning predictions. Some of these strategies involve sharp learners and form a field called interpretable machine learning. While there are ongoing efforts in this area, it remains a challenging task. Another issue to consider is accuracy. No machine learning system guarantees 100% accuracy. Indeed, striving for high accuracy can sometimes lead to model overfitting. Evaluating a model often requires a balance between different measures, such as precision and recall. Aiming for high accuracy might conflict with the goal of achieving high precision or recall. This becomes a matter of personal preference and depends on the specific requirements of the project. Therefore, it’s crucial to consider the implications of inaccurate predictions and decisions.

Introduction to Machine Learning with Scikit Learn Video Transcript

Research Computing Support