In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import sympy as sy
from scipy import interpolate, stats

Regression and Least Squares¶

Earlier, we encountered interpolation as a means to determine a line given only a few points. Interpolation determined a polynomial that would go through each point. This is not always sensible however.

Let’s consider the following example where the data represents cigarette consumption and death rates for countries given.

Country	Cigarette Consumption	Deaths per Million
Norway	250	95
Sweden	300	120
Denmark	350	165
Australia	470	170

If we use interpolation as we had earlier, we get the following picture.

In [2]:

cigs = [250, 300, 350, 470]
death = [95, 120, 165, 170]

xs = np.linspace(240, 480, 1000)
i1 = interpolate.splrep(cigs, death, s=0)
ynews = interpolate.splev(xs, i1, der = 0)


plt.plot(cigs, death, 'o')
plt.plot(xs, ynews, '--k')
plt.title("Interpolation Polynomial of Cigarette Data")
plt.xlabel("Cigarette Consumption")
plt.ylabel("Deaths per million")

Out[2]:

<matplotlib.text.Text at 0x10fdfd400>

Now suppose that we wanted to use our polynomial to make a prediction about a country with higher cigarette consumption than that of Australia. You should notice that our polynomial would provide an estimate that is lower.

Alternatively, we can fit a straight line to the data with a simple numpy function polyfit where we describe the data we are fitting and the degree polynomial we are fitting. Here we want a straight line so we choose degree 1.

In [3]:

a, b = np.polyfit(cigs, death, 1)

In [4]:

def l(x):
    return a*x + b

In [5]:

xs = np.linspace(240, 480, 1000)
i1 = interpolate.splrep(cigs, death, s=0)
ynews = interpolate.splev(xs, i1, der = 0)
plt.plot(cigs, death, 'o')
plt.plot(xs, l(xs), label = 'Linear Regression')
plt.plot(xs, ynews, '--k', label = 'Polynomial Interpolation')
plt.title("Interpolation vs. Regression")
plt.xlabel("Cigarette Consumption")
plt.ylabel("Deaths per million")
plt.legend(loc = 'best', frameon = False)

Out[5]:

<matplotlib.legend.Legend at 0x110456d30>

Determining the Line of Best Fit¶

The idea behind our line of best fit, is that it minimizes the distance between itself and all the data points. These distances are called residuals and are shown in the plot below.

In [6]:

plt.plot(cigs, death, 'o')
plt.plot(xs, l(xs), label = 'Linear Regression')
plt.plot((cigs[0], cigs[0]), (death[0], l(cigs[0])), c = 'red')
plt.plot((cigs[1], cigs[1]), (death[1], l(cigs[1])), c = 'red')
plt.plot((cigs[2], cigs[2]), (death[2], l(cigs[2])), c = 'red')
plt.plot((cigs[3], cigs[3]), (death[3], l(cigs[3])), c = 'red')
plt.title("Residuals")

Out[6]:

<matplotlib.text.Text at 0x1104bc710>

The line of best fit minimizes these distances using familiar techniques of differentiation that we have studied. First, we investigate the criteria of least squares, that says the residuals are minimized by finding the smallest ROOT MEAN SQUARE ERROR or RMSE.

In general, we see that a residual is the distance between some actual data point $(x_i, y_i)$ and the resulting point on the line of best fit $l(x)$ at the point $(x_i, l(x_i))$.

Suppose we were deciding between the lines

\[y_1 = .3x + 34.75 \quad y_2 = .4x + .5\]

We want to compare the average difference between the actual and predicted values. We can find the residuals by creating a list of differences in terms of actual and predicted values.

In [7]:

def y1(x):
    return 0.3*x + 34.75

def y2(x):
    return 0.4*x + 0.5

In [8]:

plt.plot(cigs, death, 'o')
plt.plot(xs, y1(xs))
plt.plot(xs, y2(xs))
plt.title("Which is the Better Fit?")

Out[8]:

<matplotlib.text.Text at 0x110609e48>

In [9]:

resid_1 = [np.sqrt((death[i] - y1(cigs[i]))**2) for i in range(len(cigs))]
resid_2 = [np.sqrt((death[i] - y2(cigs[i]))**2) for i in range(len(cigs))]

In [10]:

resid_1

Out[10]:

[14.75, 4.75, 25.25, 5.75]

In [11]:

resid_2

Out[11]:

[5.5, 0.5, 24.5, 18.5]

In [12]:

resid_1_sq = [(a**2 )/4for a in resid_1]
resid_2_sq = [(a**2)/4 for a in resid_2]

In [13]:

np.sqrt(sum(resid_1_sq))

Out[13]:

15.089317413322579

In [14]:

np.sqrt(sum(resid_2_sq))

Out[14]:

15.596473960482221

Thus, the first line $y1$ is considered a better fit for the data.

Deriving the Equation of the Line¶

In general, we have some line of best fit $y$ given by:

\[y = a + bx\]

If we have some set of points $(x_1, y_1), (x_2, y_2), (x_3, y_3)...(x_n, y_n)$. We need to minimize the sum of squares of residuals here, so we would have a number of values determined by:

\[[y_1 - (a + bx_1)]^2 + [y_2 - (a + bx_2)]^2 + [y_3 - (a + bx_3)]^2 + ...\]

which we can rewrite in summation notation as

\[\sum_{i=1}^n[y_i - (a + bx_i)]^2\]

We can consider this as a function in terms of the variable $a$ that we are seeking to minimize.

\[g(a) = \sum_{i=1}^n[y_i - (a + bx_i)]^2\]

From here, we can apply our familiar strategy of differentiating the function and locating the critical values. We are looking for the derivative of a sum, which turns out to be equivalent to the sum of the derivatives, hence we have

\[g'(a) = \sum_{i=1}^n \frac{d}{da}[y_i - (a + bx_i)]^2\]

\[g'(a) = \sum_{i=1}^n 2[y_i -a - bx_i](-1)\]

\[g'(a) = -2 [\sum_{i = 1}^n y_i - a - b\sum_{i=1}^n x_i]\]

Setting this equal to zero and solving for $a$ we get

\[a = \frac{1}{n} \sum_{i=1}^n y_i - b\frac{1}{n} \sum_{i=1}^n x_i\]

The terms should be familiar as averages, and we can rewrite our equation as

\[a = \bar{y} - b \bar{x}\]

We now use this to investigate a similar function in terms of $b$ to complete our solution.

\[f(b) = \sum_{i=1}^n[y_i - (\bar{y} + b(x_i - \bar{x}))]^2\]

We end up with

\[b = \sum_{i = 1}^n \frac{(x_i - \bar{x})(y_i - \bar{y})}{(\bar{x} - x_i)^2}\]

Let’s return to the problem of cigarette consumption and test our work out by manually computing $a$ and $b$.

In [15]:

cigs = [250, 300, 350, 470]
death = [95, 120, 165, 170]

plt.scatter(cigs, death)
plt.title("Cigarette Consumption vs. Deaths")

Out[15]:

<matplotlib.text.Text at 0x110677710>

In [16]:

ybar = np.mean(death)
xbar = np.mean(cigs)
ydiff = (death - ybar)
xdiff = (cigs - xbar)

b = np.sum(ydiff*xdiff)/np.sum(xdiff**2)
a = ybar - b*xbar
a, b

Out[16]:

(21.621368322399249, 0.33833177132146203)

In [17]:

l = [a + b*i for i in cigs]
plt.scatter(cigs, death)
plt.plot(cigs, l, '--k', label = 'Line of Best Fit')
plt.title("Manual Fit of Data")
plt.legend(loc = 'best', frameon = False)

Out[17]:

<matplotlib.legend.Legend at 0x110898860>

We can check with numpy and see if our values for $a$ and $b$ agree with the computer model.

In [18]:

b2, a2 = np.polyfit(cigs, death, 1)
a2, b2

Out[18]:

(21.621368322399242, 0.33833177132146203)

Finally, we can write a simple function to compute the RMSE.

In [19]:

Out[19]:

[106.20431115276476, 123.12089971883786, 140.03748828491098, 180.6373008434864]

In [20]:

death

Out[20]:

[95, 120, 165, 170]

In [21]:

diff = [(l[i] - death[i])**2 for i in range(len(l))]
diff

Out[21]:

[125.53658840796878,
7400150550422442,
12699112595669,
15216923483649]

In [22]:

np.sqrt(np.sum(diff)/len(l))

Out[22]:

14.761061647318972

Other Situations¶

Our goal with regression is to identify situations where regression makes sense, fit models and discuss the reasonableness of the model for describing the data. Data does not always come in linear forms however.

We can easily generate sample data for familiar curves. First, we can make some lists of polynomial form, then we will add some noise to these, fit models with np.polyfit(), and plot the results.

Non-Linear Functions¶

Plotting and fitting non-linear functions follows a similar pattern, however we need to take into consideration the nature of the function. First, if we see something following a polynomial pattern, we can just use whatever degree polynomial fit we believe is relevant. The derivation of these formulas follows the same structure as the linear case, except you are replacing the line $a - bx_i$ with a polynomial $a + bx_i + cx_i^2...$.

If we believe there to be an exponential fit, we can transform this into a linear situation using the logarithm. For example, suppose we have the following population data.

Decade $t$	Year	Population
0	1780	2.8
1	1790	3.9
2	1800	5.3
3	1810	7.2

If we examine the data, we see an exponential like trend. If we use NumPy to find the logarithm of the population values and plot the result, we note the transformed datas similarity to a linear function.

In [23]:

t = np.arange(0,13)
year = np.arange(1780,1910,10)
P = [2.8, 3.9, 5.3, 7.2, 9.6, 12.9, 17.1, 23.2, 31.4, 39.8, 50.2, 62.9, 76.0]

In [24]:

plt.figure(figsize = (8,5))
plt.subplot(1, 2, 1)
plt.scatter(year, P,color = 'red', alpha = 0.7)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title("Original Data")

plt.subplot(1, 2, 2)
lnP = np.log(P)
plt.scatter(year, lnP, color = 'green', alpha = 0.4)
plt.xlabel('Year')
plt.ylabel('Logarithm of Population')
plt.title("Transformed Data")

Out[24]:

<matplotlib.text.Text at 0x110a026d8>

Symbolically, we would imagine the original function as an exponential of the form

\[y = ae^{bx}\]

The expression can be explored in a similar manner, where we use Sympy to find the effect of the logarithm.

In [25]:

y, a, b, x = sy.symbols('y a b x')

In [26]:

eq = sy.Eq(y, a*sy.exp(b*x))

In [27]:

sy.expand_log(sy.log(b**x))

Out[27]:

log(b**x)

In [28]:

sy.expand_log(sy.log(a*sy.exp(b*x)), force = True)

Out[28]:

b*x + log(a)

Hence, we have that

\[\log(y) = bx + \log(a)\]

which should look like our familiar linear equations. Here, we can find $a$ and $b$, then convert the equation back to its original form by undoing the logarithm with the exponential.

For kicks, we introduce the SciPy linregress function. Feel free to examine the help documentation for the function. This gives a little more information about the model than the polyfit function. Further, we add text to the plot to display information about the model.

In [29]:

line = np.polyfit(year, lnP, 1)
fit = np.polyval(line, year)
alpha, beta, r_value, p_value, std_err = stats.linregress(year, lnP) #
alpha, beta, r_value

Out[29]:

(0.027906119028040688, -48.55083021891685, 0.99815691149842678)

In [30]:

fig = plt.figure(figsize = (10,5))

plt.subplot(1, 2, 2)
plt.plot(year, np.exp(fit))
plt.plot(year, P, 'o', markersize = 7, alpha = 0.8)
ax = fig.add_subplot(121)
text_string = "\nCorrelation coefficient: %f" % (r_value)
ax.text(0.022, 0.972, text_string, transform=ax.transAxes, verticalalignment='top',  fontsize=8)
plt.title("Results of Linear Regression on ln P", fontsize = 9)
plt.xlabel("Year", fontsize = 8)
plt.ylabel("ln Population")

plt.subplot(1, 2, 1)
plt.plot(year, fit)
plt.plot(year, lnP, 'o', markersize = 7, alpha = 0.8)
ax = fig.add_subplot(122)
text_string = "\nCorrelation coefficient: %f" % (r_value)
ax.text(0.022, 0.972, text_string, transform=ax.transAxes, verticalalignment='top',  fontsize=8)
plt.title("Results after returning to exponential form", fontsize = 9)
plt.xlabel("Year", fontsize = 8)
plt.ylabel("Population")

Out[30]:

<matplotlib.text.Text at 0x110c1c748>

Logistic Example¶

As you see above, towards the end of our model the actual and predicted values seem to diverge. Considering the context, this makes sense. A population should reach some maximum levels due to physical resources. A more S shaped curve is the logistic function which is given by

\[y = \frac{L}{1 + e^{a+bx}}\]

As an example, consider the Inter Continental Ballistic Missle Data for 1960 - 1969.

Year	Number of ICBM’s
1960	18
1961	63
1962	294
1963	424
1964	834
1965	854
1966	904
1967	1054
1968	1054
1969	1054

In [31]:

year = [i for i in np.arange(1960, 1970, 1)]
icbm = [18, 63, 294, 424, 834, 854, 904, 1054, 1054, 1054]

In [32]:

plt.scatter(year, icbm)

Out[32]:

<matplotlib.collections.PathCollection at 0x1107ab748>

In [33]:

L, y, a, b, x = sy.symbols('L y a b x')

In [34]:

exp = sy.Eq(y, L/(1 + sy.exp(a + b*x)))

In [35]:

sy.solve(exp, (a + b*x),  force = True)

Out[35]:

[log((L - y)/y)]

This means that the tranformation that linearizes our data is

\[\log(\frac{L - y}{y})\]

The value $L$ is defined as the carrying capacity of the model. Here, it seems something like $L = 1060$ would be a reasonable value to try.

In [36]:

t_icbm = [np.log((1060 - i)/i) for i in icbm]

In [37]:

plt.scatter(year, t_icbm)

Out[37]:

<matplotlib.collections.PathCollection at 0x111183cf8>

In [38]:

b, a = np.polyfit(year, t_icbm, 1)

In [39]:

a, b

Out[39]:

(2091.7866057847828, -1.0653944179379069)

In [40]:

def l(x):
    return b*x + a

l(1960), l(1969)

Out[40]:

(3.6135466264850038, -5.9750031349558412)

In [41]:

fit = [l(i) for i in year]

In [42]:

plt.scatter(year, t_icbm)
plt.plot(year, fit)
plt.title("Fitting ICBM Data")
plt.xlabel("Year")
plt.ylabel("Transformed ICMB Data")

Out[42]:

<matplotlib.text.Text at 0x1111ed320>

Much like the last example, we can return everything to its original form with the exponential. We arrive at the equation

\[y = \frac{1060}{1 + e^{2092 - 1.0654x}}\]

In [43]:

def y(x):
    return 1060/(1 + np.exp(2092 - 1.0654*x))

o_fit = [y(i) for i in year]

In [44]:

plt.scatter(year, icbm)
plt.plot(year, o_fit, '--k')
plt.title("ICBM Data and Logistic Fit")
plt.xlabel("Year")
plt.ylabel("ICMB's")

Out[44]:

<matplotlib.text.Text at 0x1111c0d30>

Machine Learning Example¶

In [46]:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

In [61]:

diabetes_X[:5]

Out[61]:

array([[ 0.06169621],
       [-0.05147406],
       [ 0.04445121],
       [-0.01159501],
       [-0.03638469]])

In [64]:

datasets?

In [53]:

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

In [60]:

plt.figure(figsize = (10,5))
plt.scatter(diabetes_X_train, diabetes_y_train, color = 'blue', alpha = 0.4, label = "Training Data")
plt.scatter(diabetes_X_test, diabetes_y_test, color = 'red', alpha = 0.5, label = "Testing Data")
plt.legend(frameon = False)

Out[60]:

<matplotlib.legend.Legend at 0x1123e16a0>

In [65]:

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

In [66]:

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

Coefficients:
 [ 938.23786125]
Mean squared error: 2548.07
Variance score: 0.47

In [75]:

# Plot outputs
plt.figure(figsize = (7,4))
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black', alpha = 0.5, s = 9)
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue')
plt.title("The fit against the Testing Data")

Out[75]:

<matplotlib.text.Text at 0x112c6b748>

Problems¶

In [123]:

from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
column_names = iris.feature_names

In [134]:

print(iris.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

In [125]:

data

Out[125]:

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.7,  1.5,  0.2],
       [ 4.8,  3.4,  1.6,  0.2],
       [ 4.8,  3. ,  1.4,  0.1],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 5.8,  4. ,  1.2,  0.2],
       [ 5.7,  4.4,  1.5,  0.4],
       [ 5.4,  3.9,  1.3,  0.4],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 5.7,  3.8,  1.7,  0.3],
       [ 5.1,  3.8,  1.5,  0.3],
       [ 5.4,  3.4,  1.7,  0.2],
       [ 5.1,  3.7,  1.5,  0.4],
       [ 4.6,  3.6,  1. ,  0.2],
       [ 5.1,  3.3,  1.7,  0.5],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5.2,  3.5,  1.5,  0.2],
       [ 5.2,  3.4,  1.4,  0.2],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 4.8,  3.1,  1.6,  0.2],
       [ 5.4,  3.4,  1.5,  0.4],
       [ 5.2,  4.1,  1.5,  0.1],
       [ 5.5,  4.2,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5. ,  3.2,  1.2,  0.2],
       [ 5.5,  3.5,  1.3,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 4.4,  3. ,  1.3,  0.2],
       [ 5.1,  3.4,  1.5,  0.2],
       [ 5. ,  3.5,  1.3,  0.3],
       [ 4.5,  2.3,  1.3,  0.3],
       [ 4.4,  3.2,  1.3,  0.2],
       [ 5. ,  3.5,  1.6,  0.6],
       [ 5.1,  3.8,  1.9,  0.4],
       [ 4.8,  3. ,  1.4,  0.3],
       [ 5.1,  3.8,  1.6,  0.2],
       [ 4.6,  3.2,  1.4,  0.2],
       [ 5.3,  3.7,  1.5,  0.2],
       [ 5. ,  3.3,  1.4,  0.2],
       [ 7. ,  3.2,  4.7,  1.4],
       [ 6.4,  3.2,  4.5,  1.5],
       [ 6.9,  3.1,  4.9,  1.5],
       [ 5.5,  2.3,  4. ,  1.3],
       [ 6.5,  2.8,  4.6,  1.5],
       [ 5.7,  2.8,  4.5,  1.3],
       [ 6.3,  3.3,  4.7,  1.6],
       [ 4.9,  2.4,  3.3,  1. ],
       [ 6.6,  2.9,  4.6,  1.3],
       [ 5.2,  2.7,  3.9,  1.4],
       [ 5. ,  2. ,  3.5,  1. ],
       [ 5.9,  3. ,  4.2,  1.5],
       [ 6. ,  2.2,  4. ,  1. ],
       [ 6.1,  2.9,  4.7,  1.4],
       [ 5.6,  2.9,  3.6,  1.3],
       [ 6.7,  3.1,  4.4,  1.4],
       [ 5.6,  3. ,  4.5,  1.5],
       [ 5.8,  2.7,  4.1,  1. ],
       [ 6.2,  2.2,  4.5,  1.5],
       [ 5.6,  2.5,  3.9,  1.1],
       [ 5.9,  3.2,  4.8,  1.8],
       [ 6.1,  2.8,  4. ,  1.3],
       [ 6.3,  2.5,  4.9,  1.5],
       [ 6.1,  2.8,  4.7,  1.2],
       [ 6.4,  2.9,  4.3,  1.3],
       [ 6.6,  3. ,  4.4,  1.4],
       [ 6.8,  2.8,  4.8,  1.4],
       [ 6.7,  3. ,  5. ,  1.7],
       [ 6. ,  2.9,  4.5,  1.5],
       [ 5.7,  2.6,  3.5,  1. ],
       [ 5.5,  2.4,  3.8,  1.1],
       [ 5.5,  2.4,  3.7,  1. ],
       [ 5.8,  2.7,  3.9,  1.2],
       [ 6. ,  2.7,  5.1,  1.6],
       [ 5.4,  3. ,  4.5,  1.5],
       [ 6. ,  3.4,  4.5,  1.6],
       [ 6.7,  3.1,  4.7,  1.5],
       [ 6.3,  2.3,  4.4,  1.3],
       [ 5.6,  3. ,  4.1,  1.3],
       [ 5.5,  2.5,  4. ,  1.3],
       [ 5.5,  2.6,  4.4,  1.2],
       [ 6.1,  3. ,  4.6,  1.4],
       [ 5.8,  2.6,  4. ,  1.2],
       [ 5. ,  2.3,  3.3,  1. ],
       [ 5.6,  2.7,  4.2,  1.3],
       [ 5.7,  3. ,  4.2,  1.2],
       [ 5.7,  2.9,  4.2,  1.3],
       [ 6.2,  2.9,  4.3,  1.3],
       [ 5.1,  2.5,  3. ,  1.1],
       [ 5.7,  2.8,  4.1,  1.3],
       [ 6.3,  3.3,  6. ,  2.5],
       [ 5.8,  2.7,  5.1,  1.9],
       [ 7.1,  3. ,  5.9,  2.1],
       [ 6.3,  2.9,  5.6,  1.8],
       [ 6.5,  3. ,  5.8,  2.2],
       [ 7.6,  3. ,  6.6,  2.1],
       [ 4.9,  2.5,  4.5,  1.7],
       [ 7.3,  2.9,  6.3,  1.8],
       [ 6.7,  2.5,  5.8,  1.8],
       [ 7.2,  3.6,  6.1,  2.5],
       [ 6.5,  3.2,  5.1,  2. ],
       [ 6.4,  2.7,  5.3,  1.9],
       [ 6.8,  3. ,  5.5,  2.1],
       [ 5.7,  2.5,  5. ,  2. ],
       [ 5.8,  2.8,  5.1,  2.4],
       [ 6.4,  3.2,  5.3,  2.3],
       [ 6.5,  3. ,  5.5,  1.8],
       [ 7.7,  3.8,  6.7,  2.2],
       [ 7.7,  2.6,  6.9,  2.3],
       [ 6. ,  2.2,  5. ,  1.5],
       [ 6.9,  3.2,  5.7,  2.3],
       [ 5.6,  2.8,  4.9,  2. ],
       [ 7.7,  2.8,  6.7,  2. ],
       [ 6.3,  2.7,  4.9,  1.8],
       [ 6.7,  3.3,  5.7,  2.1],
       [ 7.2,  3.2,  6. ,  1.8],
       [ 6.2,  2.8,  4.8,  1.8],
       [ 6.1,  3. ,  4.9,  1.8],
       [ 6.4,  2.8,  5.6,  2.1],
       [ 7.2,  3. ,  5.8,  1.6],
       [ 7.4,  2.8,  6.1,  1.9],
       [ 7.9,  3.8,  6.4,  2. ],
       [ 6.4,  2.8,  5.6,  2.2],
       [ 6.3,  2.8,  5.1,  1.5],
       [ 6.1,  2.6,  5.6,  1.4],
       [ 7.7,  3. ,  6.1,  2.3],
       [ 6.3,  3.4,  5.6,  2.4],
       [ 6.4,  3.1,  5.5,  1.8],
       [ 6. ,  3. ,  4.8,  1.8],
       [ 6.9,  3.1,  5.4,  2.1],
       [ 6.7,  3.1,  5.6,  2.4],
       [ 6.9,  3.1,  5.1,  2.3],
       [ 5.8,  2.7,  5.1,  1.9],
       [ 6.8,  3.2,  5.9,  2.3],
       [ 6.7,  3.3,  5.7,  2.5],
       [ 6.7,  3. ,  5.2,  2.3],
       [ 6.3,  2.5,  5. ,  1.9],
       [ 6.5,  3. ,  5.2,  2. ],
       [ 6.2,  3.4,  5.4,  2.3],
       [ 5.9,  3. ,  5.1,  1.8]])

In [126]:

df = pd.DataFrame(data)

In [127]:

df.columns = column_names

In [128]:

df.head()

Out[128]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

In [129]:

df.iloc[:, [1]].head()

Out[129]:

	sepal width (cm)
0	3.5
1	3.0
2	3.2
3	3.1
4	3.6

In [130]:

plt.figure()
plt.scatter(df.iloc[:,1], df.iloc[:,2])
plt.xlabel("Sepal Width")
plt.ylabel("Sepal Length")
plt.title("Relationship between Sepal Length and Width \nFrom the sklearn Iris Data", loc = 'right', fontsize = 12)

Out[130]:

<matplotlib.text.Text at 0x11b63c080>

In [131]:

df.describe()

Out[131]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

In [132]:

from sklearn.datasets import load_boston
boston = load_boston()
bdata = boston.data
bcolumn_names = boston.feature_names

In [133]:

print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

**References**

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

In [117]:

df2 = pd.DataFrame(bdata)

In [118]:

df2.columns = bcolumn_names

In [119]:

df2.head()

Out[119]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33