Data Mining

While working on the assignment;

# Import required packages for this chapter from pathlib import Path import pandas as pd import numpy as np from pandas.plotting import parallel_coordinates from sklearn import preprocessing from sklearn.cluster import KMeans from sklearn.metrics import pairwise from scipy.cluster.hierarchy import linkage, dendrogram, fcluster import warnings warnings.filterwarnings('ignore') import matplotlib.pylab as plt %matplotlib inline

Problem 15.1: University Rankings

 

The dataset on American College and University Rankings contains information on 1302 American colleges and universities offering an undergraduate program. For each university, there are 17 measurements, including continuous measurements (such as tuition and graduation rate) and categorical measurements (such as location by state and whether it is a private or public school).

Note that many records are missing some measurements. Our first goal is to estimate these missing values from "similar" records. This will be done by clustering the complete records and then finding the closest cluster for each of the partial records. The missing values will be imputed from the information in that cluster.

 

15.1.a

Remove all records with missing measurements from the dataset.

 

# Load data data = pd.read_csv('Universities.csv') shape = data.shape print("Before:",shape) data=data.dropna() print("After:",data.shape) data.head()
Before: (1302, 20) After: (471, 20)

 

data-mining

15.1.b

For all the continuous measurements, run hierarchical clustering using complete linkage and Euclidean distance. Make sure to normalize the measurements. From the dendrogram: How many clusters seem reasonable for describing these data?

 

 

# Reduce to continuous measurements and normalize data data.drop("State", axis=1, inplace=True) data.drop("Public (1)/ Private (2)",axis=1, inplace=True) data.set_index('College Name', inplace=True) data_norm = (data - data.mean())/data.std() data_norm.head()

data-mining

Z = linkage(data_norm, method='complete',metric='euclidean') fig = plt.figure(figsize=(15, 10)) fig.subplots_adjust(bottom=0.23) plt.title('Hierarchical Clustering Dendrogram (Complete linkage)') plt.xlabel('Collage Name') dendrogram(Z, labels=data_norm.index, color_threshold=2) plt.axhline(y=20, color='black', linewidth=0.5, linestyle='dashed') plt.show()

data-mining