Jenks Natural Breaks – Best Range Finder Algorithm
This post tries to give a clear picture of what this obscure handy tool is all about.
The Jenks optimization method, also called the Jenks natural breaks classification method, is one of the data clustering methods designed to determine the best arrangement of values into different classes. But before going any further, let’s look at what “Natural Breaks” means.
Natural Breaks: “Natural breaks” are the best way to split up ranges. Best ranges imply the ranges where like areas are grouped together. This method minimizes the variation within each range, so the areas within each range are as close as possible in value to each other.
Intuition: The Jenks natural breaks algorithm, just like K-means, assigns data to one of K groups such that the within-group distances are minimized. Also just like K-means, one must select K prior to running the algorithm.
Why is it not a good idea to do it manually: It’s usually impractical as there will be an overwhelming number of different ways to set ranges and inaccurate as it destroys the objective display of data. Of the few patterns the user can test, the “prettiest” pattern will almost certainly be selected, but that has nothing to do with the correct display of the data.
Algorithm under the hood:
Let’s look at an example to understand how the algorithm works. Say our list of values is [4, 5, 9, 10] and we need to find the best ranges from that.
Step1: Calculate the “sum of squared deviations for array mean” (SDAM).
list = [4, 5, 9, 10] mean = 7 #(4 + 5 + 9 + 10) / 4 SDAM = (4-7)^2 + (5-7)^2 + (9-7)^2 + (10-7)^2 = 9 + 4 + 4 + 9 = 26
Step2: For each range combination, calculate “sum of squared deviations for class means” (SDCM_ALL), and find the smallest one.SDCM_ALL is similar to SDAM but uses class means and deviations.
""" For [5,9,10] SDCM_ALL = (4-4)^2+(5-8)^2+(9-8)^2+(10-8)^2 = 0 + 9 + 1 + 4 = 14 For [4,5][9,10] SDCM_ALL = (4-4.5)^2+(5-4.5)^2+(9-9.5)^2+(10-9.5)^2 = 0.25 + 0.25 + 0.25 + 0.25 = 1. For [4,5,9] SDCM_ALL = (4-6)^2+(5-6)^2+(9-6)^2+(10-10)^2 = 4 + 1 + 9 + 0 = 14. """
Observe that the middle one is having the lowest SDCM implying minimum variance.
Step3: As a final summary measure, calculate a “goodness of variance fit” (GVF), defined as (SDAM — SCDM) / SDAM. GVF ranges from 1 (perfect fit) to 0 (awful fit).
""" GVF for [4,5][9,10] is (26 - 1) / 26 = 25 / 26 = 0.96 GVF for the other 2 ranges is (26 - 14) / 26 = 12 / 26 = 0.46 """
GVF for [4,5][9,10] is the highest indicating that this combination is the best ranges for the list[4, 5, 9, 10] which makes sense intuitively.
Things to know: It’s a data-intensive algorithm. Take an example of splitting 254 items into 6 ranges. there are 8,301,429,675 possible range combinations. It might take the computer a little while to test so many combinations. So it’s usually a better practice to start with a low number of ranges, and only increase the number to a large number if needed.
Usage in code:
pip install jenks
Below is a function to calculate the Goodness of Variance Fit given an array of values to classify and the number of classes selected:
from jenks import jenks import numpy as np def goodness_of_variance_fit(array, classes): # get the break points classes = jenks(array, classes) # do the actual classification classified = np.array([classify(i, classes) for i in array]) # max value of zones maxz = max(classified) # nested list of zone indices zone_indices = [[idx for idx, val in enumerate(classified) if zone + 1 == val] for zone in range(maxz)] # sum of squared deviations from array mean sdam = np.sum((array - array.mean()) ** 2) # sorted polygon stats array_sort = [np.array([array[index] for index in zone]) for zone in zone_indices] # sum of squared deviations of class means sdcm = sum([np.sum((classified - classified.mean()) ** 2) for classified in array_sort]) # goodness of variance fit gvf = (sdam - sdcm) / sdam return gvf def classify(value, breaks): for i in range(1, len(breaks)): if value < breaks[i]: return i return len(breaks) - 1
For example, consider you decide the GVF should be at least .8, then you could increment the number of classes until the GVF is satisfied:
gvf = 0.0 nclasses = 2 while gvf < .8: gvf = goodness_of_variance_fit(array, nclasses) nclasses += 1 gvf = goodness_of_variance_fit(array, nclasses) nclasses += 1
This brings us to the end of the post. Jenks Natural Breaks could be a handy statical technique in your toolset, designed to optimise the arrangement of a set of values into “natural” classes.