Machine learning has revolutionized the way we approach data analysis, and Scikit-learn has emerged as one of the most popular and versatile libraries for implementing machine learning algorithms. In this comprehensive guide, we’ll explore Scikit-learn and its many features, including how to use it for classification, regression, clustering, and more. Let’s go with VinLab to begin the journey to master machine learning with this library.
What is Scikit Learn?
Scikit-learn, also known as Sklearn, is a popular open-source machine learning library for Python. It provides a wide range of efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction via a consistent interface. Scikit-learn is built on top of NumPy, SciPy, and Matplotlib, and is designed to be easy to use and accessible to both beginners and experts.
The library includes a number of popular algorithms, such as support vector machines, random forests, k-nearest neighbors, and gradient boosting, among others. It also provides tools for preprocessing data, feature selection, and model evaluation, as well as utilities for working with text data and image data.
Understanding the Iris Dataset
The Iris dataset is a classic and widely used dataset in machine learning. Introduced by the British statistician and biologist Ronald Fisher in 1936, the dataset contains measurements of the physical characteristics of 3 different species of iris flowers: Iris Setosa, Iris Versicolor, and Iris Virginica. The data consists of 150 samples (50 Iris Setosa, 50 Iris Virginica, and 50 Iris Versicolor) with 4 features measured from each sample: sepal length, sepal width, petal length, and petal width.
There are a number of reasons why the Iris Dataset is a suitable choice for this tutorial.
- It is a prebuilt dataset in Scikit Learn, so you don’t have to search for its source and download
- Scikit Learn only works on numeric data. Since the Iris Dataset is already built-in, it is perfectly formatted for data manipulation and visualization.
Getting Started with Scikit-Learn
In this section, we will cover the foundation of Scikit Learn, including installation, import, and how to train a basic machine learning model using the famous Iris Dataset.
Step 1: Installing Sklearn
The only prerequisites that you need for installing Sklearn are:
- Python: Download the latest version of Python from the official website and follow the instructions
- Pip: GeeksforGeeks has a simple but informative guide on installing Pip (see guide for Windows, macOS, and Linux)
After having both prerequisites installed on your computer, you can immediately install Sklearn by running the simple command:
pip install scikit-learn
Once the installation is complete, the system will notify you with a message like this:
Step 2: Import
To import Sklearn into a Python environment, you can simply use the import statement in your Python code. This imports the entire Sklearn library into your Python environment, and you can use any of its functions and classes in your code.
If you only need to use specific modules or classes from Sklearn, you can import them directly using the statement: from sklearn import + module name. For example:
from sklearn.linear_model import LogisticRegression
Step 3: Load a dataset
Since the Iris dataset is already pre-built, you can directly import it using 2 simple lines of code:
from sklearn import datasets
iris = datasets.load_iris()
The dataset comprises 4 numerical columns which represent 4 features of an Iris flower: sepal length, sepal width, petal length, and petal width. The last column is the target column containing categorical values, in this case, the names of flower species.
# Print the feature names
# Print the target names
Now, we have to assign the features and target to separate variables
X = iris.data
y = iris.target
Step 4: Visualizing the Dataset
We might not see any obvious relationships between data points if we only look at the datasets at the surface using their numeric values. One way to observe trends and correlations is to visualize the dataset by plotting features on a graph. We often perform this task using a library called matplotlib.pyplot.
Suppose we want to know if there exists a correlation between the sepal lengths and sepal widths of Iris flowers:
# import library
import matplotlib.pyplot as plt
# separate the features
features = iris.data.T
sepal_length = features
sepal_width = features
petal_length = features
petal_width = features
# setting labels for the plot figure
sepal_length_label = iris.feature_names
sepal_width_label = iris.feature_names
petal_length_label = iris.feature_names
petal_width_label = iris.feature_names
# here we choose a scatter graph to visualize sepal width and length
plt.scatter(sepal_width, sepal_length, c=iris.target)
# prints the graph
The result of the code block above is a beautiful scatter graph. Below you can kind of see how different species of Iris flowers fall into almost distinct clusters. You can also see 2 species slightly overlap each other, which demonstrates a need for a good machine learning model to accurately classify these data points into their true categories. We will see how this can be accomplished in the next section.
Step 5: Splitting the Dataset
Before getting hands-on with the actual machine learning model, it is important to split the dataset into a train set and a test set. This is a standard practice in machine learning to evaluate the performance of a model on new, unseen data.
The training set is used to train the model and adjust its parameters to minimize the error or loss. The test set is then used to evaluate the model’s performance on unseen data.
By using a separate test set, we can get an unbiased estimate of how well the model is likely to perform on new data. If we don’t use a test set and simply evaluate the model on the same data it was trained on, the model may perform well on the training data but may not generalize well to new data.
We split the dataset by executing the following code:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=0)
Step 6: Build a Model with K-Nearest Neighbors (KNN)
The K-Nearest Neighbor (KNN) algorithm is particularly effective when working with datasets that are characterized by being small in size, labeled, and free of noise. Therefore, the Iris dataset is an excellent choice for utilizing the KNN algorithm.
KNN is a type of machine learning algorithm that can be used to classify or predict the value of a target variable based on its similarity to other examples in a training set. To do this, it looks at the K closest examples (or neighbors) in the training set to the new example and assigns the same class or value as the majority of those K neighbors.
For example, if you were trying to classify whether an image showed a cat or a dog, you could use KNN to compare the new image to other images in the training set and assign the same label as the majority of the K most similar images.
We create a KNN model by running the code block below. Note that the second line only produces an empty model, so it cannot predict whether a given sample belongs to which Iris species. To give our model the ability to predict, we must fit it to the training data that we have prepared in previous steps (third line).
from sklearn import neighbors
# empty model
classifier = neighbors.KNeighborsClassifier()
# fit to actual training data
Step 6: Predicting
Now, our model is ready for prediction on new, unseen data. Let’s use the model to predict the categories of samples in the test set that we have split in Step 5:
predictions = classifier.predict(X_test)
Step 7: Model Evaluation
You have trained a machine learning model and let it predict the test dataset. The last task is to evaluate the model to see how well it generalizes to new data. Scikit-learn provides functions for evaluating your model’s performance, such as accuracy score.
An accuracy score is a metric used to evaluate the performance of a model by measuring how often the model’s predictions match the actual outcomes. It helps us assess how well a model is performing and whether it is suitable for the intended purpose. Hence, we can determine whether the predictions are reliable, and identify potential issues in the model’s predictions, allowing for improvements to be made.
In Sklearn, the accuracy score is calculated as follows:
from sklearn.metrics import accuracy_score
The accuracy for our model is about 0.97 or 97%, which is pretty high. You can continue playing with this dataset, testing other machine learning algorithms, or changing the hyperparameters of KNN.
Congratulations! You have successfully created a machine learning model with an effective prediction capability. This is a splendid effort for a beginner, so keep it up!
See you in future blog posts in which we will show you how to level up your machine learning skills with new knowledge and technical skills!
Thanks for reading!
If you are looking for information about artificial intelligence, machine learning, general data concepts, or medical data science applications, follow us to acquire more useful knowledge about these topics.
Open source project: https://github.com/vinbigdata-medical/vindr-lab