Kernel Density Estimation
FEBRUARY 18, 2020
Kernel density estimation with python
Before starting let’s get some background on Estimators, they're classified into two classes
Parametric make assumptions about the population from which a sample of data is drawn. Often this assumption is that the population is normally distributed, i.e. bell-shaped. This assumption allows the development of a theory that allows us to draw inferences about the population based on a sample taken from it.
The other family of estimators is Non-Parametric this set of distribution makes no distributional assumptions no fixed structure and depends upon all the data points to reach an estimate. Kernel density estimators belong to this class.
So why Kernel Density Estimation let us see how histograms are just not sufficient.
Histograms are not smooth, depend on the width of the bins and the endpoints of the bins, This is where kernel density estimators alleviate the problem.
let’s see how histogram are affected by bins
#Importing libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import pylab from scipy.stats.distributions import norm # Plotting a normal distribution with different bins mu, sigma = 0, 0.1 # mean and standard deviation s = np.random.normal(mu, sigma, 1000) #plotting the different bins from matplotlib.pyplot import figure figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k') plt.hist(s,bins=10,label="10") plt.hist(s,bins=50,label="50",color="green") plt.hist(s,bins=300,label="300",color="orange") plt.hist(s,bins=500,label="500",color="white") plt.show()
Comparison of bins on Histogram
So we see in the above visualization how bin changes the normalization look
So how do we overcome this?
To remove the dependence on the endpoints of the bins, kernel estimators center a kernel function at each data point. We place a kernel function on every data point to get the density estimates. Just like in high school getting the value of the function on at a given point of x
y = f(x)
Kernel Density Estimate
Kernel Function typically has these following properties
Everywhere non-negative: K(x)≥0 ∀ x∈X
Symmetric : K(x) = K(-x) ∀ x∈X
Decreasing : K`(x) ≤ 0 ∀ x >0
Different Kernel Functions
sns.kdeplot(x,data2=None,bw=.4,color=”yellow”,label=”gaussian”,kernel=”gau”) sns.kdeplot(x,data2=None,bw=.4,color=”black”,label=”biw”,kernel=”biw”) sns.kdeplot(x,data2=None,bw=.4,color=”red”,label=”cos”,kernel=”cos”) sns.kdeplot(x,data2=None,bw=.4,color=”green”,label=”epa”,kernel=”epa”) sns.kdeplot(x,data2=None,bw=.4,color=”blue”,label=”tri”,kernel=”tri”) sns.kdeplot(x,data2=None,bw=.4,color=”green”,label=”triw”,kernel=”triw”)
The quality of a kernel estimate depends less on the shape of the K than on the value of its bandwidth h. It’s important to choose the most appropriate bandwidth as a value that is too small or too large is not useful.
x = np.concatenate([norm(-1, 1.).rvs(400),norm(1, 0.3).rvs(100)]) sns.kdeplot(x,data2=None ,bw=2,color="yellow",label="bw:2") sns.kdeplot(x,data2=None ,bw=1,color="red",label="bw: 0.2") sns.kdeplot(x,data2=None ,bw=.5,color ="blue",label="bw: 0.5") sns.kdeplot(x,data2=None ,bw=.3,color="green",label="bw: 0.3") sns.kdeplot(x,data2=None ,bw=.1,color="grey",label="bw: 0.1") sns.kdeplot(x,data2=None ,bw=.05,color="grey",label="bw: 0.05") plt.legend();
The smoothing bandwidth h plays a key role in the quality of KDE. Here is an example of applying different h to the dataset we see that when h is too small (the gray curve), there are many wiggly structures on our density curve this is under smoothing. On the other hand, when h is too large (the yellow curve), we see that the two bumps are smoothed out. This situation is called over smoothing–some important structures are obscured by the huge amount of smoothing.
Bandwidth selection methods, univariate case
The natural way for choosing ℎ is to plot out several curves and choose the estimate that best matches one’s prior (subjective) ideas, However, this method is not practical in high-dimensional data.
Maximum likelihood cross-validation
Reference to a standard distribution
The idea of Kernel Density Estimators is to give you an idea about the distribution.