Principal Component Analysis
By- Rounak Choudhary
By- Rounak Choudhary
Principal Component Analysis (PCA) is a statistical method used to simplify complex data by reducing the number of variables while preserving as much important information as possible. It works by identifying new variables, called principal components, which are combinations of the original traits and capture the most variation in the dataset. These components are independent of each other and are ordered so that the first few explain the majority of the variation. For example, if you're studying raptors and have data on several traits like wing length, body weight, and beak size, PCA can reduce this to just two or three components that summarize most of the differences between species. This makes it easier to visualize patterns, detect clusters or groups, and understand relationships in the data, all while minimizing redundancy. PCA is especially useful when working with large, multidimensional ecological or biological datasets.
Pre Requirements: PCA requires numeric data.
CSV 1 named as 'raptor1'
If your dataset includes categorical variables (like habitat type, diet, or color), you have two options:
Exclude them from the PCA (use only numeric traits).
Convert them to numeric form using techniques like:
One-hot encoding (creates binary columns like "Forest = 1, Grassland = 0")
Gower distance + PCA alternatives like PCoA (Principal Coordinates Analysis)
# Load the CSV file containing functional traits of raptors
# Set stringsAsFactors = FALSE to avoid automatic factor conversion
# Set row.names = 1 to use the first column as species names
PCA <- read.csv("C:/Users/Rounak Choudhary/Desktop/Raptor Trial mFD/raptor1.csv",
header=TRUE, sep=",", stringsAsFactors=FALSE, row.names = 1)
# Display the first 6 rows of the data to verify the structure
head(PCA)
# Load required libraries (uncomment install.packages() lines if not installed)
# install.packages("ade4") # For PCA and other multivariate methods
library(ade4)
# install.packages("vegan") # Tools for ecological analysis
library(vegan)
# install.packages("factoextra") # For visualizing PCA and clustering
library(factoextra)
# install.packages("FD") # For functional diversity metrics
library(FD)
# install.packages("tidyverse") # For data manipulation and visualization
library(tidyverse)
# install.packages("NbClust") # For determining optimal number of clusters
library(NbClust)
# install.packages("gtools") # General R programming utilities
library(gtools)
# Perform Principal Component Analysis (PCA) on the first 10 columns (traits) of the first 43 species
# scannf = FALSE disables interactive scree plot
# nf = 2 means retain the first two principal components
mutlispca <- dudi.pca(PCA[1:43, c(1:10)], scannf = FALSE, nf = 2)
# Extract the PCA scores for each species (individuals) from the 'li' component
pca_scores <- mutlispca$li
# Set a random seed to make k-means clustering results reproducible
set.seed(123)
# Perform k-means clustering on the PCA scores with 3 clusters
clusters <- kmeans(pca_scores, centers = 3)$cluster
# Convert the numeric cluster labels to a factor for plotting
clusters <- as.factor(clusters)
# Create a PCA biplot with individuals colored by their cluster
# repel = TRUE prevents overlapping text labels
# col.var = "black" colors variable arrows black
# col.ind = clusters colors individuals by their cluster
# palette = "Dark2" specifies the color palette
# labelsize = 3 sets the size of the text labels
# addEllipses = TRUE draws ellipses around each cluster
# theme() adjusts the font size of the entire plot
fviz_pca_biplot(mutlispca, repel = TRUE,
col.var = "black",
col.ind = clusters,
palette = "Dark2",
labelsize = 3,
addEllipses = TRUE) +
theme(text = element_text(size = 15))
# Note:
# If you encounter errors in dudi.pca, try reinstalling the 'ade4' and 'factoextra' packages
# To change the input size, modify:
# - 'c(1:10)' to select different columns (traits)
# - '1:43' to include a different number of species