To my perspective K-NN is the simplest algorithm in the world of machine learning, Why? because it doesn’t use any probability or statistics concepts to compute the class label.
In a democratic countries like India, States, Canada etc.. A government is formed by conducting an election and the party with majority votes are allowed to form a government.
K-NN works the similar way, after conducting a majority vote from the points nearby query point (Xi), which ever class gets majority votes, Xi’s class label is declared to belong to that class.
Regional people who vote - Nearest neighbor of Xi.
Party which wins - Class label which gets more votes.
Government formed- Class with more votes are declared to be Xi’s class label.
How K-NN Algorithm works:
Step 1: Find the K-nearest points to the query point (Xi).
Step 2: Find the Class label K-nearest points belongs to.
Step 3: Compute voting to find which class label get more vote.
Step 4: Whichever Class gets more vote, declare Xi belongs to that class label.
Example: There is a query point Xi , surrounded by n number of data points.
Since our K = 7, we consider only 7 nearest point to our Xi , which is found to be X1, X2, X3, X4, X5, X6, X7
Hence we can conclude that our query point (Xi) belongs to class label — ▲
How do we find K-nearest points: Distance between the points can be calculate by:
1.1 Euclidean distance.
Take 2-D space, where Euclidean distance between two points X1 and X2 :
1.2 Manhattan Distance
Manhattan distance is nothing but a block distance between 2 points:
1.3 Cosine-Similarity / Cosine — Distance:
Cosine Similarity is cos ϴ , where ϴ is angle between 2 points.
Cosine Distance = 1- Cosine Similarity
Ranges from -1 to 1.
1.4 Hamming Distance:
Number of points a binary/ Boolean vector differ.
In this case lets consider we have 2 data points X1 and X2 which is a binary vector.
Which is the perfect K-value:
For different values of K, accuracy on cross validation data set is calculated. A graph is plotted with K-value vs Accuracy, like given below:
Whichever K have more accuracy on Dcv that is considered as perfect K.
We could see that when K= 10, we get maximum accuracy of 95%, So we can consider K=10 and further test our model using test data set.
We saw how to compute class label for classification problem, Now we will see for regression problem.
K-NN Algorithm on Regression problem:
K-NN can be used for regression problem with simple alteration done to classification K-NN. Instead of calculating majority votes among nearest neighbor, We can find Mean(nearest neighbor) or Median (nearest neighbor).
How Curse on Dimensionality affects KNN:
As dimension increases, time and space complexity increases in KNN as its O(nd). where n-> datapoints, d-> dimensions.
Intuition of distance between points is not vaild in higher dimensions. That is in higher dimension all points are equally distant from each other. Distance functions like euclidean distance stops working in higher dimension.
In higher dimensions, probability of our model becoming overfit is more.
Outliers on K-NN:
If a query point is an outlier in KNN, model accuracy drops drastically. As it becomes to difficulty to find a nearby cluster for voting.
Imbalance Dataset:
Say if the data-set is imbalanced with the ratio 90–10% for respective class labels. There is probability of our model predicting all the query point belonging to majority class, as KNN is voting based.
Interpretability of K-NN:
Its easy to interpret K-NN on why model predicted query point belongs to that specific class label. i.e Explanation for why our model working in such a way can be given. This is every much useful the medical industry.
Pros & Cons of K-NN:
When D is small we can use KNN. When D is large: Dont use KNN, if using consider LSH and kd-tree.
Interpretability of the model reduces on higher dimension.
Run time complexity increases on higher dimension.
Dont use KNN for low latency system. If used in low latency system use LSH, kd-tree.
Easy to interpret.
Feel free to connect:
www.linkedin.com/in/kailash-sukumaran
— Kailash Sukumaran