Intro to Machine Learning via the Abalone Age Prediction Problem

The best way to dive into ML is to see it in action. Here it is!

Kunj Mehta
14 min readJul 23, 2020

Anyone who has studied Machine Learning(ML) can attest to being overwhelmed by the sheer amount of math, equations and symbols that comprise ML at some point of time or other. As interesting as ML is, it is very hard for a beginner to digest all the concepts being thrown at him. I believe, to understand any ML algorithm, we have to first understand the principle or intuition behind it, then see it in action to understand how it works and then understand the math behind it (if you want to master the algorithm, you can go a step ahead and try implementing it from scratch yourself). This article addresses the first two aspects for some of the most fundamental algorithms and in the process explains the pipeline for building a ML model. I have assumed the reader knows a bit of ML terminology and Python. Knowledge of numpy, pandas and scikit-learn is beneficial.

Photo by Kevin Ku on Unsplash

Want to read this story later? Save it in Journal.

What is Machine Learning?

We have more often then not heard Machine Learning and Artificial Intelligence (AI) being uttered in one breath. While we may see them being used interchangeably, they are not the same. But they are related. Put simply, the field of AI is aimed at devising techniques to make computational machines intelligent to such an extent that they can perform tasks that at the moment, humans are better at doing. Now, knowledge is a prerequisite of intelligence. This brings us to ML which can be described as a sub-field of AI that believes in making the machines intelligent by enabling them to acquire knowledge on their own from experience.

Now, the question is what is the knowledge and how is it acquired from experience by the machines. Broadly, there are three approaches to this (and all of them require data):

  1. We provide the computational agent (called a model in ML terms) with a set of input data along with the corresponding output (training data). The model then extracts patterns from the data by learning the relationship between the input data and the output data (knowledge). As and when the model acquires knowledge from the data, the performance of the model in predicting the output of training data is also calculated and used to refine the acquired knowledge (thus the model gains experience). In fact, this is an iterative process called training. The refined knowledge is then used by the model to predict outputs for previously unseen input data (test data). This approach is called supervised learning.
  2. We provide the model with the input data but do not provide it any output data to go with the input. The model extracts relationships inherent in the data as knowledge on its own. It self-evaluates its performance based on its understanding of the data and uses that as experience to refine its knowledge. This knowledge is then used to return an output for the corresponding input. If we are not satisfied with the output, we may execute the model repeatedly till we are okay with the result. This approach is called unsupervised learning.
  3. We provide the model with the input data and an objective to complete. Again, no output data is provided. The model is tasked with finding answers (or steps) that can complete the objective optimally in the long run. If an answer is optimal, the model is rewarded. Otherwise, it is punished. Knowledge acquisition is through self exploration and as such the reward/punishment acts as everything: knowledge that the answer is wrong, evaluation that the model is not moving towards the optimal solution and experience that the answer should not be repeated. This approach is called reinforcement learning.

We have now seen the different ways a model can learn from experience. But how does it actually learn in the first place? That is where algorithms come in! You may ask, but so does math, doesn’t it? Not to worry, let’s understand the why and what of algorithms by seeing some of the supervised algorithms in action. And in the process get to know the steps needed to build and train a model on a dataset!

Building an End-to-End Machine Learning Model

These are the steps to follow to train a model on data:

  1. Collect and know the data
  2. Clean and analyze the data
  3. Choose the best algorithm for your data
  4. Train the model using the algorithm
  5. Evaluate the model
  6. Retrain using different configurations of the model to get the best performance.
  7. Make predictions using the model (inference)

Let’s see these step-by-step.

Collect and Know the Data

We will be using the already collected Abalone dataset to see the algorithms in action. The first step in knowing the data is to know what it contains. This means understanding the type (continuous numeric, discrete numeric or categorical) and meaning of each feature and noting down the number of instances and features in the dataset. (For readers familiar to Excel, features correspond to columns and instances correspond to rows).

A brief aside on the motivation behind collecting the dataset. Abalone is a type of consumable snail whose price varies as per its age and as mentioned here: The aim is to predict the age of abalone from physical measurements. The age of abalone is traditionally determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope — a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age.

Let’s now see the type and name of the features:

  1. Sex : This is the gender of the abalone and has categorical value (M, F or I).
  2. Length : The longest measurement of the abalone shell in mm. Continuous numeric value.
  3. Diameter : The measurement of the abalone shell perpendicular to length in mm. Continuous numeric value.
  4. Height : Height of the shell in mm. Continuous numeric value.
  5. Whole Weight : Weight of the abalone in grams. Continuous numeric value.
  6. Shucked Weight : Weight of just the meat in the abalone in grams. Continuous numeric value.
  7. Viscera Weight : Weight of the abalone after bleeding in grams. Continuous numeric value.
  8. Shell Weight : Weight of the abalone after being dried in grams. Continuous numeric value.
  9. Rings : This is the target, that is the feature that we will train the model to predict. As mentioned earlier, we are interested in the age of the abalone and it has been established that number of rings + 1.5 gives the age. Discrete numeric value.

Let’s take a look at the actual data with the code for it.

target_url = ("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data")# we use pandas to read the dataset and specify its feature names
abalone = pd.read_csv(target_url,header=None, prefix="V")
abalone.columns = ['Sex', 'Length', 'Diameter', 'Height',
'Whole weight', 'Shucked weight',
'Viscera weight', 'Shell weight', 'Rings']
# this line displays the first five rows of the dataset
abalone.head()
First five rows of the abalone dataset
First 5 Rows of the Abalone Dataset

Clean and Analyze the Data

Now that we can make sense of the features the next step is to clean and analyze the data. Both of these go hand-in-hand and we may need to repeatedly do one after the other. But first, cleaning. Why do we clean the data? There are many reasons why data cleaning is important. Sometimes due to wrong entries, the data may contain some weird characters which can confuse the model as to what they mean. Sometimes, some fields have been unintentionally left blank. The model then needs to know how to treat these blank values: whether to ignore them or fill in a default value.

Next, let’s make sure that there are no missing values in the dataset. We can do this with: abalone.isnull().sum(axis = 0) where (axis = 0) specifies that we are summing null values over each feature. We can see from the output there are no null values in the dataset.

Checking for Null Values in the Dataset

Now, let’s look at some statistics of the dataset possible by: abalone.describe()

Statistical Description of the Dataset

As we can see, the feature Sex is missing. This is because the values of Sex are categorical and categorical values do not have means and percentiles. A point to note here: ML models find it difficult to work with values of different types (such as both categorical and numeric, as is the case here) at the same time. This is why we will convert Sex by doing something called one-hot encoding which is basically converting a categorical feature into binary numeric feature(s) indicating the presence or absence of the values that were originally there in the categorical feature. This is done by: abalone = pd.get_dummies(abalone)

First five rows of the converted dataset

Now, let’s move on to analyzing the dataset. Why do we analyze the dataset? Quite simply, to decide on the best algorithm and the best features to use to train a model on the dataset.

One way to approximate what might be the best features for prediction is to use a correlation heatmap. For those who do not know what correlation is, it is simply a measure of the degree to which two variables move in relation to one another. For instance, a positive correlation implies that variable A increases or decreases as B increases or decreases. Now, what does this have to do with feature selection? Well, we can find out which features are strongly correlated with the one we are trying to predict (target) as these are the ones that have the most effect on the target. We do this in code:

#calculate and round off correlation matrix
corMat = DataFrame(abalone.iloc[:,:8].corr()).values
corMat = np.around(corMat, decimals = 3)
#print correlation with 'Rings' feature
feature_importance = DataFrame(abalone.iloc[:,:8].corr()).iloc[:-1, -1].sort_values(ascending=False)
print('Features in Descending Order of Importance', list(feature_importance.index))
Features in descending order of importance

Here too, Sex is missing. And this is again because correlation is a statistical measure for continuous numeric values. Not to say that it isn’t useful but it doesn’t cover all the features in this case. Now, if we look at the distribution (or even the values) of the target in the data, we will find that even that is not continuous but discrete valued.

The set of discrete target values

You may ask, is that a problem? No, its not a problem but a good place to mention that supervised learning problems can be solved in two ways: regression and classification. Regression is when we train the model to output a value belonging to the real number set. Because both input and output are continuous-valued, correlation is better used here. On the other hand, classification is when we train the model to categorize an input into one of the two(binary classification) or more (multi-class classification) of the categorical or discrete target values. For instance, in our case any input can be said to belong to an abalone of age in the range of [1,29]. Hence, ours is a multi-class classification problem. Now that we don’t have a method for finding out the best features, that leaves us in a bit of a quandary, doesn’t it? Not quite. There are many other methods but I do not cover them here because they are highly math-based and related to correlation which gives a good approximation anyhow.

Finally, because we are training the model to predict Rings, we will remove it from our dataset and pass it separately. We will do this by:

y = abalone["Rings"]
X = abalone.drop(columns=”Rings”)

Since we are finally done with the cleaning and analyzing, lets dive right into the algorithms! Take note here that training, evaluation and inference are done iteratively, together (even in practice).

In the previous code snippet, we saw the features and the target being separated. By convention, we assign X to the input features and y to the input target for training. Now, if you remember, after training we also evaluate the model. On what data do we evaluate the model if we give as input the whole dataset? The answer is that we split the dataset into two: One for training known as the training data and one for evaluation known as the test data. The typical split is in the 80:20 ratio. We do this by: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Now, let’s cover the intuition and working for some algorithms.

Algorithm 1: K-nearest Neighbours (KNN)

Intuition: The intuition behind KNN is really simple. Given a point and asked to predict which category/class it belongs to, KNN simply checks the k nearest points to the given point and outputs the category that is represented in the majority of the k nearest points. How are the nearest points found? The most common way is by using the Euclidean distance, the distance formula that we used to use in school.

K Nearest Neighbours

For instance, in the above image, we have to find what class the point represented by question mark belongs to. If k=3, the 3 nearest neighbours are found (given by inner circle); the majority class represented by the neighbouring points is green so the given point belongs to the green class. Similarly, if k=7, the point belongs to the red class. And so we can see on changing k, the output changes. k is called a hyperparameter, a parameter supplied by us to train different configs of the models to try and obtain the best possible performance (stage 6 of the model building pipeline).

Performance. How is that measured? The most common method of measuring performance for classifiers is calculating the ratio of instances correctly classified and total instances in the dataset (here just the test dataset).

Let’s see the code:

# Initializing classifier and giving hyperparameter k=3
knn = KNeighborsClassifier(n_neighbors=3)
# training classifier
knn.fit(X_train, y_train)
# Evaluate the classifier
print(knn.score(X_test, y_test))
# Try changing hyperparameter
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))

We get accuracy of 0.2021 and 0.2212 for k=3 and k=5 respectively.

Algorithm 2: Logistic Regression

Intuition: The intuition behind logistic regression is to try and find the most optimal equation of a line a₁x₁ + a₂x₂ + … aₙxₙ = y that separates the dataset points cleanly into two categories, where all x represents the feature values for that instance. The way this is done is the algorithm starts with a random line whose answer y (calculated for individual data points) is passed through a function called the sigmoid. If the result is 1, the data point belongs to one class and if 0, then to the other class. The algorithm checks its classification performance with this line and adjusts the coefficients aᵢ in the line equation accordingly to get another line. This keeps repeating till a maximum number of iterations specified by us or till the time the line stops changing.

Logistic Regression

For instance, in the above image, the algorithm starts with a random line a₁x₁ + a₂x₂ which classifies all points below it as Class 1 and above it as Class 2 which is not correct. After many iterations, it updates the coefficients a₁ and a₂ both to 1 to get the line x₁ + x₂ which gives the correct classification (you can verify it by assuming an approximate value for a data point and, calculating y and then passing it through sigmoid). For our problem, the model follows a one-versus-rest approach. We know we have 29 classes. Now, suppose we have a data point. The algorithm assumes there are only two classes in the data, instead of the 29. Say it assumes there is class 0 (which is the real class 0) and class 1 (all the other classes combined). Then it checks whether the data point belongs to class 0 or class 1 (by passing the feature values into the equation of the line and then through the sigmoid). It does this over all the 29 classes in the dataset and thus gets 29 lines. Finally, it classifies the point as belonging to the class whose y value in the equation of the line was the greatest in magnitude.

Lets see the code:

# Initializing classifier with one-v-rest approach. random_state is # to ensure same results in every execution.
logr = LogisticRegression(multi_class = 'ovr', random_state=3)
# training classifier
logr.fit(X_train, y_train)
# Evaluate the classifier
print(logr.score(X_test, y_test))

We get an accuracy of 0.2308.

Algorithm 3: Decision Trees

Intuition: Again, a very simple algorithm. We can visualize a decision tree as a flowchart. Given a dataset, a decision tree tries to partition it on a certain feature and at a certain value of the selected feature such that the resultant partitions contain as much instances from one class as possible. It keeps on doing this splitting till a specified limit or till there are no features left to partition on, or till it perfectly partitions the dataset.

Visualizing Decision Tree for Abalone Dataset

We can see the decision tree constructed for our dataset for 2 iterations above. In the root node, we can see the algorithm has decided to split the dataset consisting of 3341 instances using Shell Weight at value 0.14. This means all instances having Shell Weight less than or equal to 0.14 will form one partition and all other instances, the other. The class variable in the above figure represents the majority class represented by the instances in that partition. For instance, the data at the root has majority instances of class 8, i.e. abalones having 8 rings.

Let’s see the code:

# Initializing classifier. random_state is to ensure same results in every execution. max_depth is to specify number of splits
dt = DecisionTreeClassifier(random_state=0, max_depth = 3)
# training classifier
dt.fit(X_train, y_train)
# Evaluate the classifier
print(dt.score(X_test, y_test))

We get an accuracy of 0.2547.

Conclusion

In this article, we covered the basics of Machine Learning, learnt about the model building pipeline and used it to see a few of the supervised classification algorithms in action with the Abalone dataset. At the end of it, we could see that the accuracy of the model was not good. This is because the number of instances per class in the dataset is less for the model to perfectly learn the patterns between the features. Moreover, since this was an introductory article, we have not used the most appropriate algos needed specifically for this dataset. I will leave that to you!

Linked Articles for Reference

  1. Code for this article
  2. One-hot Encoding
  3. Correlation
  4. Feature Selection #1
  5. Feature Selection #2
  6. Sigmoid Function
  7. One-versus-rest Approach

See the full code on Github. I would love to connect with you on Linkedin!

📝 Save this story in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.

--

--

Kunj Mehta

MS @ Rutgers 2023 | Writing on AI transformation, AI in finance, climate and logistics. linkedin.com/in/kunjmehta