So the other day I was making a class project in MatLab for a "group" and I dumbly called my file group.
BAD KS.Never name your files after a MatLab command. Little did I know I would need that command the next day.
What does "group" do?
It's great.
Recently I've been working on a carbon project where I've been asked to subset 10332 points (10077 of these points have consistent data) into two groups, one with about 1480 members and the other with the remainder. The smaller group is representative of a certain carbon pattern and the larger group representative of another pattern.
Well, the other day I needed to perform PCA on my subsets. It would defeat the point of performing the PCA's separately because I wanted to know about whole site characteristics, but I wanted to identify the members by group from the whole site.
In short,the output I wanted was:
Look closely and you'll see the red peeking out-- that's the smaller set, and what's important is that it's peeking out in that left corner there where the blue is not, indicating that it is driven by low values of components 1 and 2, whereas the blue is driven primarily by component 2 which is somewhat insensitive to component 1 (blue bar across middle). I think...I'm still interpreting this one-- unfortunately PCA is very useful for visualization but hard to interpret without more data than we have re. other factors.
Anyway,how is this done?
It's not too hard...
First, import your data and name your columns. Standardize your data by dividing by the Standard Deviation (as required by PCA).
call PCA as
[coeff, scores, latent] = princomp (your matrix)
Also label your variables!
See below to start:
avgdem = sortdem(:,2)./std(sortdem(:,2));
avgwind = sortwind(:,2)./std(sortwind(:,2));
avgslope = sortslope(:,2)./std(sortslope(:,2));
x = zeros(length(avgwind), 3);
x(:,1) = avgdem;
x(:,2) = avgwind;
x(:,3) = avgslope;
[coeff, scores, latent] = princomp(x);
vbls = {'elevation','wind','slope'};
Okay, you got this.
Now what you have is a nice PCA. Coeff will tell you the coefficients for your eigenvectors. Scores is your eigenvectors, and latent is the cumulative variation explained by the eigenvectors. If you want to know the percent variation explained, you can just divide latent by the cumsum.
the code for making the PCA is below:
because you want to be more creative, using the biplot is not very good for this scenario. Note that my groups have been created by the "scores" since I am plotting them here. I use scatter3 instead of biplot to allow more variability.
the 0.3 makes the dots really little.'r.' and 'b.' make them red and blue.
figure(3)
clf
group1 = scores(1:1480,:);
group2= scores(1481:10077,:);
hold all
N1 = length(group1);
N2 = length(group2);
scatter3(group1(:,1),group1(:,2),group1(:,3),0.3,'r.')
scatter3(group2(:,1),group2(:,2), group2(:,3),0.3,'b.')
biplot(coeff(:,1:3)*4,'varlabels',vbls,'LineWidth',2,'Color',[0 0 0]);
hold off
title('LB and ND subsets')
xlabel('component1')
ylabel('component2')
zlabel('component3')
view(2)
legend('LB','ND')
I hope that this helps you all out! It took me a while to figure out how to present this okay, but I'm happy with the results.
Well, for the graph itself. The data, seriously... that data is mystery data.