This example shows how to visualize multivariate data using various statistical plots. Many statistical analyses involve only two variables: a predictor variable and a response variable. Such data are easy to visualize using 2D scatter plots, bivariate histograms, boxplots, etc. It's also possible to visualize trivariate data with 3D scatter plots, or 2D scatter plots with a third variable encoded with, for example color. However, many datasets involve a larger number of variables, making direct visualization more difficult. This example explores some of the ways to visualize high-dimensional data in MATLAB®, using Statistics and Machine Learning Toolbox™.
In this example, we'll use the carbig
dataset, a dataset that contains various measured variables for about 400 automobiles from the 1970's and 1980's. We'll illustrate multivariate visualization using the values for fuel efficiency (in miles per gallon, MPG), acceleration (time from 0-60MPH in sec), engine displacement (in cubic inches), weight, and horsepower. We'll use the number of cylinders to group observations.
load carbig X = [MPG,Acceleration,Displacement,Weight,Horsepower]; varNames = {'MPG'; 'Acceleration'; 'Displacement'; 'Weight'; 'Horsepower'};
Viewing slices through lower dimensional subspaces is one way to partially work around the limitation of two or three dimensions. For example, we can use the gplotmatrix
function to display an array of all the bivariate scatter plots between our five variables, along with a univariate histogram for each variable.
figure gplotmatrix(X,[],Cylinders,['c' 'b' 'm' 'g' 'r'],[],[],false); text([.08 .24 .43 .66 .83], repmat(-.1,1,5), varNames, 'FontSize',8); text(repmat(-.12,1,5), [.86 .62 .41 .25 .02], varNames, 'FontSize',8, 'Rotation',90);
The points in each scatter plot are color-coded by the number of cylinders: blue for 4 cylinders, green for 6, and red for 8. There is also a handful of 5 cylinder cars, and rotary-engined cars are listed as having 3 cylinders. This array of plots makes it easy to pick out patterns in the relationships between pairs of variables. However, there may be important patterns in higher dimensions, and those are not easy to recognize in this plot.
The scatter plot matrix only displays bivariate relationships. However, there are other alternatives that display all the variables together, allowing you to investigate higher-dimensional relationships among variables. The most straight-forward multivariate plot is the parallel coordinates plot. In this plot, the coordinate axes are all laid out horizontally, instead of using orthogonal axes as in the usual Cartesian graph. Each observation is represented in the plot as a series of connected line segments. For example, we can make a plot of all the cars with 4, 6, or 8 cylinders, and color observations by group.
Cyl468 = ismember(Cylinders,[4 6 8]); parallelcoords(X(Cyl468,:), 'group',Cylinders(Cyl468), ... 'standardize','on', 'labels',varNames)
The horizontal direction in this plot represents the coordinate axes, and the vertical direction represents the data. Each observation consists of measurements on five variables, and each measurement is represented as the height at which the corresponding line crosses each coordinate axis. Because the five variables have widely different ranges, this plot was made with standardized values, where each variable has been standardized to have zero mean and unit variance. With the color coding, the graph shows, for example, that 8 cylinder cars typically have low values for MPG and acceleration, and high values for displacement, weight, and horsepower.
Even with color coding by group, a parallel coordinates plot with a large number of observations can be difficult to read. We can also make a parallel coordinates plot where only the median and quartiles (25% and 75% points) for each group are shown. This makes the typical differences and similarities among groups easier to distinguish. On the other hand, it may be the outliers for each group that are most interesting, and this plot does not show them at all.
parallelcoords(X(Cyl468,:), 'group',Cylinders(Cyl468), ... 'standardize','on', 'labels',varNames, 'quantile',.25)
Another similar type of multivariate visualization is the Andrews plot. This plot represents each observation as a smooth function over the interval [0,1].
andrewsplot(X(Cyl468,:), 'group',Cylinders(Cyl468), 'standardize','on')
Each function is a Fourier series, with coefficients equal to the corresponding observation's values. In this example, the series has five terms: a constant, two sine terms with periods 1 and 1/2, and two similar cosine terms. Effects on the functions' shapes due to the three leading terms are the most apparent in an Andrews plot, so patterns in the first three variables tend to be the ones most easily recognized.
There's a distinct difference between groups at t = 0, indicating that the first variable, MPG, is one of the distinguishing features between 4, 6, and 8 cylinder cars. More interesting is the difference between the three groups at around t = 1/3. Plugging this value into the formula for the Andrews plot functions, we get a set of coefficients that define a linear combination of the variables that distinguishes between groups.
t1 = 1/3; [1/sqrt(2) sin(2*pi*t1) cos(2*pi*t1) sin(4*pi*t1) cos(4*pi*t1)]
ans = 0.7071 0.8660 -0.5000 -0.8660 -0.5000
From these coefficients, we can see that one way to distinguish 4 cylinder cars from 8 cylinder cars is that the former have higher values of MPG and acceleration, and lower values of displacement, horsepower, and particularly weight, while the latter have the opposite. That's the same conclusion we drew from the parallel coordinates plot.
Another way to visualize multivariate data is to use "glyphs" to represent the dimensions. The function glyphplot
supports two types of glyphs: stars, and Chernoff faces. For example, here is a star plot of the first 9 models in the car data. Each spoke in a star represents one variable, and the spoke length is proportional to the value of that variable for that observation.
h = glyphplot(X(1:9,:), 'glyph','star', 'varLabels',varNames, 'obslabels',Model(1:9,:)); set(h(:,3),'FontSize',8);
In a live MATLAB figure window, this plot would allow interactive exploration of the data values, using data cursors. For example, clicking on the right-hand point of the star for the Ford Torino would show that it has an MPG value of 17.
Plotting stars on a grid, with no particular order, can lead to a figure that is confusing, because adjacent stars can end up quite different-looking. Thus, there may be no smooth pattern for the eye to catch. It's often useful to combine multidimensional scaling (MDS) with a glyph plot. To illustrate, we'll first select all cars from 1977, and use the zscore
function to standardize each of the five variables to have zero mean and unit variance. Then we'll compute the Euclidean distances among those standardized observations as a measure of dissimilarity. This choice might be too simplistic in a real application, but serves here for purposes of illustration.
models77 = find((Model_Year==77)); dissimilarity = pdist(zscore(X(models77,:)));
Finally, we use mdscale
to create a set of locations in two dimensions whose interpoint distances approximate the dissimilarities among the original high-dimensional data, and plot the glyphs using those locations. The distances in this 2D plot may only roughly reproduce the data, but for this type of plot, that's good enough.
Y = mdscale(dissimilarity,2); glyphplot(X(models77,:), 'glyph','star', 'centers',Y, ... 'varLabels',varNames, 'obslabels',Model(models77,:), 'radius',.5); title('1977 Model Year');
In this plot, we've used MDS as dimension reduction method, to create a 2D plot. Normally that would mean a loss of information, but by plotting the glyphs, we have incorporated all of the high-dimensional information in the data. The purpose of using MDS is to impose some regularity to the variation in the data, so that patterns among the glyphs are easier to see.
Just as with the previous plot, interactive exploration would be possible in a live figure window.
Another type of glyph is the Chernoff face. This glyph encodes the data values for each observation into facial features, such as the size of the face, the shape of the face, position of the eyes, etc.
glyphplot(X(models77,:), 'glyph','face', 'centers',Y, ... 'varLabels',varNames, 'obslabels',Model(models77,:)); title('1977 Model Year');
Here, the two most apparent features, face size and relative forehead/jaw size, encode MPG and acceleration, while the forehead and jaw shape encode displacement and weight. Width between eyes encodes horsepower. It's notable that there are few faces with wide foreheads and narrow jaws, or vice-versa, indicating positive linear correlation between the variables displacement and weight. That's also what we saw in the scatter plot matrix.
The correspondence of features to variables determines what relationships are easiest to see, and glyphplot
allows the choice to be changed easily.
close
Matlabsolutions.com provides guaranteed satisfaction with a
commitment to complete the work within time. Combined with our meticulous work ethics and extensive domain
experience, We are the ideal partner for all your homework/assignment needs. We pledge to provide 24*7 support
to dissolve all your academic doubts. We are composed of 300+ esteemed Matlab and other experts who have been
empanelled after extensive research and quality check.
Matlabsolutions.com provides undivided attention to each Matlab
assignment order with a methodical approach to solution. Our network span is not restricted to US, UK and Australia rather extends to countries like Singapore, Canada and UAE. Our Matlab assignment help services
include Image Processing Assignments, Electrical Engineering Assignments, Matlab homework help, Matlab Research Paper help, Matlab Simulink help. Get your work
done at the best price in industry.
Desktop Basics - MATLAB & Simulink
Array Indexing - MATLAB & Simulink
Workspace Variables - MATLAB & Simulink
Text and Characters - MATLAB & Simulink
Calling Functions - MATLAB & Simulink
2-D and 3-D Plots - MATLAB & Simulink
Programming and Scripts - MATLAB & Simulink
Help and Documentation - MATLAB & Simulink
Creating, Concatenating, and Expanding Matrices - MATLAB & Simulink
Removing Rows or Columns from a Matrix
Reshaping and Rearranging Arrays
Add Title and Axis Labels to Chart
Change Color Scheme Using a Colormap
How Surface Plot Data Relates to a Colormap
How Image Data Relates to a Colormap
Time-Domain Response Data and Plots
Time-Domain Responses of Discrete-Time Model
Time-Domain Responses of MIMO Model
Time-Domain Responses of Multiple Models
Introduction: PID Controller Design
Introduction: Root Locus Controller Design
Introduction: Frequency Domain Methods for Controller Design
DC Motor Speed: PID Controller Design
DC Motor Position: PID Controller Design
Cruise Control: PID Controller Design
Suspension: Root Locus Controller Design
Aircraft Pitch: Root Locus Controller Design
Inverted Pendulum: Root Locus Controller Design
Get Started with Deep Network Designer
Create Simple Image Classification Network Using Deep Network Designer
Build Networks with Deep Network Designer
Classify Image Using GoogLeNet
Classify Webcam Images Using Deep Learning
Transfer Learning with Deep Network Designer
Train Deep Learning Network to Classify New Images
Deep Learning Processor Customization and IP Generation
Prototype Deep Learning Networks on FPGA
Deep Learning Processor Architecture
Deep Learning INT8 Quantization
Quantization of Deep Neural Networks
Custom Processor Configuration Workflow
Estimate Performance of Deep Learning Network by Using Custom Processor Configuration
Preprocess Images for Deep Learning
Preprocess Volumes for Deep Learning
Transfer Learning Using AlexNet
Time Series Forecasting Using Deep Learning
Create Simple Sequence Classification Network Using Deep Network Designer
Train Classification Models in Classification Learner App
Train Regression Models in Regression Learner App
Explore the Random Number Generation UI
Logistic regression create generalized linear regression model - MATLAB fitglm 2
Support Vector Machines for Binary Classification
Support Vector Machines for Binary Classification 2
Support Vector Machines for Binary Classification 3
Support Vector Machines for Binary Classification 4
Support Vector Machines for Binary Classification 5
Assess Neural Network Classifier Performance
Discriminant Analysis Classification
Train Generalized Additive Model for Binary Classification
Train Generalized Additive Model for Binary Classification 2
Classification Using Nearest Neighbors
Classification Using Nearest Neighbors 2
Classification Using Nearest Neighbors 3
Classification Using Nearest Neighbors 4
Classification Using Nearest Neighbors 5
Gaussian Process Regression Models
Gaussian Process Regression Models 2
Understanding Support Vector Machine Regression
Extract Voices from Music Signal
Align Signals with Different Start Times
Find a Signal in a Measurement
Extract Features of a Clock Signal
Filtering Data With Signal Processing Toolbox Software
Find Periodicity Using Frequency Analysis
Find and Track Ridges Using Reassigned Spectrogram
Classify ECG Signals Using Long Short-Term Memory Networks
Waveform Segmentation Using Deep Learning
Label Signal Attributes, Regions of Interest, and Points
Introduction to Streaming Signal Processing in MATLAB
Filter Frames of a Noisy Sine Wave Signal in MATLAB
Filter Frames of a Noisy Sine Wave Signal in Simulink
Lowpass Filter Design in MATLAB
Tunable Lowpass Filtering of Noisy Input in Simulink
Signal Processing Acceleration Through Code Generation
Signal Visualization and Measurements in MATLAB
Estimate the Power Spectrum in MATLAB
Design of Decimators and Interpolators
Multirate Filtering in MATLAB and Simulink