1 Introduction

Since the dawn of time humans have felt fascination for the cosmos. Ancient Greeks already proposed the existence of life in other planets, and 19th century scientists (Carl Friedrich Gauss among others) contemplated ways of communicating with Martians and Venusians. For years and until this month, SETI (Search for Extraterrestrial Intelligence) has being monitoring radio signals for signs of transmissions from civilizations on other planets.

Since the early 1990s, though, scientists have embarked in a alternative, more fascinating and fructiferous journey; that of detecting planets outside our Solar System, this is, Exoplanets. The force driving these efforts is not to find little green (wo)men, but to find planets where life in its broadest sense could thrive, to delineate the planet taxonomy in our galaxy, and to understand the origins and fate of our own planetary system. This research topic has become so relevant that Michel Mayor and Didier Queloz were awarded half the Nobel Prize in Physics 2019 “for the discovery of an exoplanet orbiting a sun-like star”. As up to date, more than 4100 exoplanets have been detected and confirmed to be so. More than 2000 candidates are awaiting confirmation.

Exoplanets are usually categorized according to their mass and their phase state (rocky or gassy planets). In this project, I have tried to categorize the exoplanets using Gaussian Mixtures Models following two different approaches. First, I have applied multivariate mixtures to predict the distribution of exoplanet masses and radii, magnitudes that are measured independently. Then, I have found the best combination of univariate mixtures to predict the distribution of exoplanet densities, that are not measured but calculated from both the mass and radius. These two approaches will open the door to have information on the distribution of the exoplanets densities, a magnitude that helps categorizing them into rocky exoplanets (like the Earth or Mars) or gaseous/gassy (like Jupiter or Neptune). In a final step, I have checked whether the Gaussian mixtures found following the previous approaches have physical meaning and if they are consistent with existing exoplanet classifications.

2 Dataset description and preprocessing

The dataset was downloaded from Caltech’s IPAC website, that hosts the NASA Exoplanet Archive. This repository contains an up to date list of all the exoplanets that have been discovered and confirmed, together with available information on the exoplanet, its host star, and the discovery method. The full dataset has more than 100 columns, but I was interested only in the mass and radius. Using their interactive querying system, I selected these two columns and the one corresponding to the mass provenance. This categorical variable states whether the value of the mass is a lower bound or not, as some detection methods cannot provide the actual value. To be as accurate as possible, I filtered this column so that only exoplanets with an actual mass value were kept (pl_bmassprov == Mass). The values for the mass and radius are in “Earth units”, this is, they are normalized by the values of the Earth’s mass and radius. Finally, I downloaded the CSV file that contained the mass, radius and mass provenance as columns (pl_masse, pl_rade, and pl_bjmassprov, respectively).

When loading the dataset, the first 11 lines were skipped, as they just contained information about the applied filters and the downloaded columns. The column pl_bmassprov was removed, as it won’t be used. In addition, I removed the rows with missing values (20% of pl_rade).

# Load the data set as a data frame
exoplanets <- read.csv("exo_mass_radius_last.csv",sep = ",", dec = ".", skip=11)

# remove column with mass provenance label
df <- subset(exoplanets, select = -c(pl_bmassprov))

# Remove missing values
df <- na.omit(df)

This dataset contains information on exoplanets, i.e., those outside our Solar System. But, from the point of view of any of those exoplanets, we live in an exoplanet. Thus, I decided to augment the dataset by incorporating the mass and radius (in Earth units) of all the planets of the Solar System, including the Earth (obtained from the NASA SSDC Archive)

# add extra data from Solar System
df_solar <- c()
df_solar$pl_masse <- c(0.055,0.815,1,0.107,317.8,95.2,14.5,17.1)
df_solar$pl_rade <- c(0.383,0.949,1,0.532,11.21,9.45,4.01,3.88)
df_solar <- as.data.frame(df_solar)

# add rows to original dataset
df <- rbind(df,df_solar)

This way, the total number of exoplanets (observations) in the dataset amounted to:

nrow(df)
## [1] 750

Remarkably, from more than 4100 exoplanets, only less than 800 contain information on the mass and radius. This highlights the difficulties faced by astrophysicists when doing analysis. After that, I applied a decimal log transformation on both pl_masse and pl_rade. I could have used any other non-linear transformation onto the data (like a Box-Cox), but applying the decimal log-transform is standard procedure in Astrophysics for two reasons: to make the dataset more normally distributed, and to convert the rational, exponential, or power-law functions usually encountered in astrophysics laws into linear relations (see example below). In addition, I changed the column names to reflect the log-transformed character of the variables.

# log transform
df_log <- log10(df)

# change column names to reflect log-transformed character
names(df_log) <- c("log_masse","log_rade")

Finally, I added a new column to the dataset that contained the [log-transformed] exoplanet density in Earth units. By definition, the absolute density of the exoplanet is given by \(\rho_p=M_p/V_p\), where \(M_p\) is the absolute mass of the exoplanet and \(V_p=4\pi/3r_p^3\) is the absolute volume. To get the Earth units (\(\oplus\)) we have to normalize each magnitude by that of the Earth, this is:

\(\rho_\oplus=\frac{\rho_p}{\rho_E}=\frac{M_p/M_E}{V_p/V_E}=\frac{M_p/M_E}{r_p^3/r_E^3}=\frac{M_\oplus}{r_\oplus^3}\)

Taking the decimal logarithm at both sides of the equation, we get:

\(\log_{10}(\rho_\oplus)=\log_{10}(M_\oplus)-3\log_{10}(r_\oplus)\)

Identifying \(\log_{10}(M_\oplus)\) with log_masse and \(\log_{10}(r_\oplus)\) with log_rade, we can compute the log-transformed normalized density \(\log(\rho_\oplus)\) (log_dens).

# add column with calculated densities
log_dens <- df_log$log_masse-3*df_log$log_rade
df_p <- cbind(df_log,log_dens)

As a final comment, note that in these units the Earth has log_masse = log_rade = log_dens = 0.

3 User-defined functions

To avoid an excess of code, I found very handy to define a number of functions that will be used along the report more than once. The first one is devoted to compute the average values for the mixtures’ posterior means \(\mu\) from the MCMC simulation traces.

# function to compute the average values for the mixtures' posterior means (log_masse, log_rade, log_dens) from the MCMC simulations
mixtures_means <- function(x,k) {
  log_masse <- rep(0,k)
  log_rade <- rep(0,k)
  log_dens <- rep(0,k)
  for(i in 1:k){
    log_masse[i] <- mean(x$mu[,(2*i-1)])
    log_rade[i] <- mean(x$mu[,(2*i)])
    log_dens[i] <- log_masse[i]-3*log_rade[i]
  }
  # bind together
  mean_df <- cbind(log_masse,log_rade,log_dens)
  return(as.data.frame(mean_df))
}

Then I defined a function to plot the kernel densities (of log_masse, log_rade and log_dens) for the subsamples corresponding to each mixture, together with the corresponding posterior mean vectors \(\mu\) of the latter.

# Function to represent the different mixtures densities together with their posterior means
plot_mixtures <- function(x,k,mean,main){
  # set color palette
  col.clust <- rainbow(k)
  
  # get y-axis max for log_masse, log_rade and log_dens
  y_masse_max <- c() # empty vector to store values
  y_rade_max <- c() # empty vector to store values
  y_dens_max <- c() # empty vector to store values
  for(i in 1:k){
    d <- density(x$log_masse[x$Class == i])
    y_masse_max[i] <- max(d$y)
    d <- density(x$log_rade[x$Class == i])
    y_rade_max[i] <- max(d$y)
    d <- density(x$log_dens[x$Class == i])
    y_dens_max[i] <- max(d$y)
  }
  y_masse_max <- 1.05*max(y_masse_max)
  y_rade_max <- 1.05*max(y_rade_max)
  y_dens_max <- 1.05*max(y_dens_max)
  
  # plot mass distributions for each mixture
  plot(density(x$log_masse[x$Class == 1]),col=col.clust[1],lwd=2,
       xlim=c(min(x$log_masse),max(x$log_masse)),
       ylim=c(0,y_masse_max), xlab="log_masse",main="")  # density plot for cluster 1
  abline(v=mean$log_masse[1],col=col.clust[1],lty=2,lwd=1)   # mean for cluster 1
  for(i in 2:k){
    lines(density(x$log_masse[x$Class == i]),col=col.clust[i],lwd=2) # density plot for cluster i
    abline(v=mean$log_masse[i],col=col.clust[i],lty=2,lwd=1)   # mean for cluster i
  }
  abline(v=0, lty = 1, lwd=1) # Earth's mass
  abline(v=log10(17.1), lty = 2, lwd = 1) # Neptune's mass
  abline(v=log10(95.2), lty = 3, lwd = 1) # Saturn's mass
  abline(v=log10(318), lty = 4, lwd = 1) # Jupiter's mass
  
  # plot radius distributions for each mixture
  plot(density(x$log_rade[x$Class == 1]),col=col.clust[1],lwd=2,
       xlim=c(min(x$log_rade),max(x$log_rade)),
       ylim=c(0,y_rade_max),xlab="log_rade", main = "")  # density plot for cluster 1
  abline(v=mean$log_rade[1],col=col.clust[1],lty=2,lwd=1) # mean for cluster 1
  for(i in 2:k){
    lines(density(x$log_rade[x$Class == i]),col=col.clust[i],lwd=2) # density plot for cluster i
    abline(v=mean$log_rade[i],col=col.clust[i],lty=2,lwd=1)  # mean for cluster i
  }
  abline(v=0, lty = 1,lwd=1) # Earth's radius
  abline(v=log10(3.9), lty = 2, lwd = 1) # Neptune's radius
  abline(v=log10(9.45), lty = 3, lwd = 1) # Saturn's radius
  abline(v=log10(11.2), lty = 4, lwd = 1) # Jupiter's radius
  # add title
  title(paste(main, "for K = ", k), line = -1, outer = TRUE)
  
  # plot density distributions for each mixture
  plot(density(x$log_dens[x$Class == 1]),col=col.clust[1],lwd=2,
       xlim=c(min(x$log_dens),max(x$log_dens)),
       ylim=c(0,y_dens_max), xlab="log_dens",main="")  # density plot for cluster 1
  abline(v=mean$log_dens[1],col=col.clust[1],lty=2,lwd=1)  # mean for cluster 1
  for(i in 2:k){
    lines(density(x$log_dens[x$Class == i]),col=col.clust[i],lwd=2)# density plot for cluster i
    abline(v=mean$log_dens[i],col=col.clust[i],lty=2,lwd=1)  # mean for cluster i
  }
  abline(v=0, lty = 1, lwd=1) # Earth's density
  abline(v=(log10(17.1)-3*log10(3.9)), lty = 2, lwd=1) # Neptune's density
  abline(v=(log10(95.2)-3*log10(9.45)), lty = 3, lwd=1) # Saturn's density
  abline(v=(log10(318)-3*log10(11.2)), lty = 4, lwd=1) # Jupiter's density
}

Finally, to check for the existence of Label-switching, I defined a function to represent the trace plots for the posterior means in the MCMC simulation.

# Label switching plot function
plot_label_switching <- function(x,k,main){
  # set color palette
  col.clust <- rainbow(k)
  
  # get y-axis max and min for log_masse
  y_max_1 <- c()
  y_min_1 <- c()
  for(i in 1:k){
    y_max_1[i] <- max(x$mu[,2*i-1])
    y_min_1[i] <- min(x$mu[,2*i-1])
  }
  y_max_1 <- max(y_max_1)
  y_min_1 <- min(y_min_1)
  
  # get y-axis max and min for log_rade
  y_max_2 <- c()
  y_min_2 <- c()
  for(i in 1:k){
    y_max_2[i] <- max(x$mu[,2*i])
    y_min_2[i] <- min(x$mu[,2*i])
  }
  y_max_2 <- max(y_max_2)
  y_min_2 <- min(y_min_2)
  
  # plot evolution of log_masse means
  plot(x$mu[,1],type="l",ylim=c(y_min_1,y_max_1),col=col.clust[1],
       xlab = "Iteration", ylab="mu_masse",main = main)
  for(i in 2:k){
    lines(x$mu[,2*i-1],col=col.clust[i])
  }
  # plot evolution of log_rade means
  plot(x$mu[,2],type="l",ylim=c(y_min_2,y_max_2),col=col.clust[1],
       xlab = "Iteration", ylab="mu_rade")
  for(i in 2:k){
    lines(x$mu[,2*i],col=col.clust[i])
  }
}

4 Clustering in terms of exoplanet mass and radius

For this section, only the mass and radius (log_masse and log_rade) were used to find the Gaussian mixtures.

# set variables for clustering
VARS <- c("log_masse","log_rade")

The first step was to check the relationship between log_masse and log_rade and their histograms.

# load and install libraries
library(ggplot2)

if (!require(patchwork)){ 
  install.packages("patchwork") 
}
library(patchwork)

xy <- df_p[,1:2]
x <- df_p$log_masse
y <- df_p$log_rade
nbins <- round(sqrt(nrow(df_p))) # number of bins set to square root of number of observations (standard convention)

plot1 <- ggplot(xy, aes(x = x, y = y)) + geom_point()+ xlab("log_masse") + ylab("log_rade")
dens1 <- ggplot(xy, aes(x = x)) + geom_histogram(color = "black", fill = "white", bins = nbins) +
  theme_void()
dens2 <- ggplot(xy, aes(x = y)) + geom_histogram(color = "black", fill = "white", bins = nbins) + 
  theme_void() + coord_flip()

dens1 + plot_spacer() + plot1 + dens2 + 
  plot_layout(ncol = 2, nrow = 2, widths = c(4, 1), heights = c(1, 4))

Clearly there were at least 3 or 4 groups.

4.1 Frequentist clustering

To assess the performance of the Bayesian clustering algorithm on the exoplanet dataset, I found convenient to try first an alternative method as a baseline. Thus, I tried the frequentist (or semi-Bayesian) Gaussian Mixtures clustering algorithm Mclust. Although this clustering method is as well based on finite Gaussian mixture models, the Maximum Likelihood Estimation (MLE) is obtained “classically” by using the Expectation-Maximization (EM) algorithm. In addition, Mclust assumes certain approximations for the covariance matrices. The optimal model (number and “shape” of the mixtures) is then selected according to BIC.

################## MCLUST #################
# Compute the value of the BIC for all the possible models
if (!require(mclust)){ 
  install.packages("mclust") 
}
library(mclust)
n_clust <- mclustBIC(df_p[,VARS],G=1:10)
plot(n_clust)

According to the EM algorithm, the distribution of log_masse and log_rade can be described with 4 mixtures with “ellipsoidal” covariance matrices (correlated variables with different variances) and equal shapes (covariance matrices for all mixtures have the same eigenvalues). With this covariance matrices configuration, the final Mclust mixture model was run, and the optimal number of mixtures and the mean vectors of the mixtures were stored for future analysis. In addition, and for visualization purposes, a new dataframe df_exo_EM was defined by adding to the original dataset a column with the predicted mixture membership for each observation.

# Run Mclust for the optimal solution (4 clusters with ellipsoidal, equal shape)
exo_Mclust <- Mclust(df_p[,VARS],x=n_clust)
# Store optimal number of mixtures
k_opt_EM <- exo_Mclust$G

# means for different mixtures
mean_EM <- t(exo_Mclust$parameters$mean)
# add column with mean density
log_dens <- mean_EM[,1]-3*mean_EM[,2]
mean_EM <- as.data.frame(cbind(mean_EM,log_dens))

# add a column with the predicted cluster membership
class_EM <- data.frame(seq.int(1,k_opt_EM,1)[exo_Mclust$classification])
names(class_EM) <- c('Class')
class_EM$Class <- as.factor(class_EM$Class)
df_exo_EM <- cbind(class_EM,df_p)

Finally, the predicted joint marginal distribution was computed and stored.

# predict joint marginal distribution
densMarg_EM<-densityMclust(df_p[,VARS],G=k_opt_EM,model=exo_Mclust$modelName)

4.2 Bayesian clustering

For this section, I could have used the number of mixtures found by the EM algorithm. Nevertheless, it is better to estimate it using Reversible-Jump Markov Chain Monte Carlo (RJMCMC), an extension to the MCMC methodology “[…] that allows simulation of the posterior distribution on spaces of varying dimensions” (ref). In this methodology, the algorithm “jumps” between the parameter subspaces corresponding to different number of mixtures. For RJMCMC, the prior distribution for the number of mixtures is given by k ~ Uniform{1,k\(_{max}\)}, where k\(_{max}\) is fixed beforehand (15 in this case). The random seed was set (to my NIA, for example) for the sake of results reproducibility. In addition, as I did not wish to scale the data before running MCMC to later obtain directly the original scale mean vectors \(\mu\), I specified scale=list(shift=0, scale=1).

####### RJMCMC
library(mixAK)
# Define the properties of the prior distribution for the number of mixtures
kprior<- list(priorK="uniform", Kmax=15, delta=1)
# Define parameters for RJMCMC simulation
nMCMC <- c(burn=5000, keep=10000, thin=5, info=FALSE)
# Set random seed
set.seed(100417976)
# Run RJMCMC simulation
exo_clust <- NMixMCMC(y0=df_p[,VARS], prior=kprior, scale=list(shift=0, scale=1),
                      nMCMC=nMCMC,PED=F)
## 
## Chain number 1
## ==============
## MCMC sampling started on Sat Mar 21 19:19:23 2020.
## Burn-in iteration 5000
## Iteration 1000015000
## MCMC sampling finished on Sat Mar 21 19:20:30 2020.

Once the simulation was run, the posterior probabilities for each mixture size were plotted…

# plot posterior distribution of K
barplot(prop.table(table(exo_clust$K)),
        xlab="Number of mixtures", ylab="Posterior probability", main="Posterior distribution of K's")

…and the optimal number of mixtures retrieved and stored.

# find optimal number of clusters (most frequent one)
K_prob <- table(exo_clust$K) # table of number of clusters and absolute frequency of appearance
k_s_bayes <- as.numeric(names(K_prob)) # get column names as numbers
ind_max <- which.max(K_prob) # find index of K with maximum frequency
k_opt_MCMC <- k_s_bayes[ind_max] # store optimal K

Notice that the RJMCMC simulation found that the optimal number of mixtures is 6, in contrast with the EM algorithm, which suggested that this number was 4. Once the optimal number of mixtures was obtained, the standard MCMC simulation was run again with that number. This step wouldn’t be needed if I only wanted the predicted probability density distribution, but there is relevant information that cannot be accessed from the RJMCMC output.

# run clustering algorithm again with fixed Kmax = k_opt
kprior <- list(priorK="fixed", Kmax=k_opt_MCMC, delta=1)
nMCMC <- c(burn=5000, keep=10000, thin=5, info=FALSE)
exo_clust_1 <- NMixMCMC(y0=df_p[,VARS], prior=kprior, scale=list(shift=0, scale=1),
                        nMCMC=nMCMC,PED=F)
## 
## Chain number 1
## ==============
## MCMC sampling started on Sat Mar 21 19:20:31 2020.
## Burn-in iteration 5000
## Iteration 1000015000
## MCMC sampling finished on Sat Mar 21 19:21:25 2020.

Then, the mean vectors of the mixtures were obtained with the user-defined function mixtures_means and stored for future analysis. In addition, and for visualization purposes, a new dataframe df_exo_MCMC_1 was defined by adding to the original dataset a column with the predicted mixture membership for each observation.

# Get mixture mean vector (log_masse and log_rade) and corresponding mean of log_dens
mean_MCMC_1 <- mixtures_means(exo_clust_1,k_opt_MCMC)

# add a column with the predicted cluster membership
class <- data.frame(seq.int(1,k_opt_MCMC,1)[exo_clust_1$state.last$r])
names(class) <- c('Class')
class$Class <- as.factor(class$Class)
df_exo_MCMC_1 <- cbind(class,df_p)

Finally, the predicted marginal densities and the joint marginal distribution were computed and stored.

# predict density marginals
pdens_1 <- NMixPredDensMarg(exo_clust_1, lgrid=300)
# predict joint marginal distributions
pdens2_1 <- NMixPredDensJoint2(exo_clust_1)

4.3 Frequentist vs. Bayesian

Once all these calculations were performed, I could carry out a comparison of their performances. I started by comparing the predicted marginal densities for each of the covariates.

par(mfrow=c(1,2))
# histogram of log_masse
hist(df_p$log_masse, freq=F, ylim=c(0,1), xlab = "log_masse",
     main = "Histogram of log_masse")
lines(density(densMarg_EM$data[,1]),col=3,lwd=2) # predicted EM density 
lines(pdens_1$x$x1,pdens_1$dens$`1`,col=2,lwd=2) # predicted MCMC density
legend("top",c(paste("EM Mixtures for K =", k_opt_EM),
               paste("MCMC Mixtures for K =", k_opt_MCMC)),col = c(3,2),lty=c(1,1),lwd=c(2,2))
# histogram of log_rade
hist(df_p$log_rade, freq=F, ylim=c(0,3), xlab = "log_rade",
     main = "Histogram of log_rade")
lines(density(densMarg_EM$data[,2]),col=3,lwd=2) # predicted EM density 
lines(pdens_1$x$x2,pdens_1$dens$`2`,col=2,lwd=2) # predicted MCMC density

They provide similar density distributions, even when the number of mixtures is different. It seems that both fail to capture the true densities at the peaks and valleys but, in general, the result is satisfactory. Then, the joint marginal distributions were checked.

# EM algorithm
plot(densMarg_EM,what="density", nlevels= 50, drawlabels = FALSE, col = 2, lwd = 1,
     xlim = c(min(df_p$log_masse),max(df_p$log_masse)), 
     ylim = c(min(df_p$log_rade),max(df_p$log_rade)))
points(df_p$log_masse,df_p$log_rade, pch=20, cex=0.5)
title(main=paste("EM Mixtures for K =",k_opt_EM))

# MCMC algorithm
plot(pdens2_1, xylab=VARS, contour = TRUE, nlevels= 50, drawlabels = FALSE, col = 2, lwd = 1, 
     xlim = c(min(df_p$log_masse),max(df_p$log_masse)), 
     ylim = c(min(df_p$log_rade),max(df_p$log_rade)), 
     main=paste("MCMC Mixtures for K =",k_opt_MCMC))
points(df_p$log_masse,df_p$log_rade, pch=20, cex=0.5)

The differences between the two methods are now more noticeable, and it seems that the Bayesian approach does a better job capturing the joint probability distribution of log_masse and log_rade in the current dataset. It can be seen as well that the contour lines for the EM joint distribution are smoother, and this is consistent with the assumptions (restrictions) onto the mixtures covariance matrices carried out in the package Mclust. One could argument that this smoothness comes from the smaller number of mixtures used to generate the joint distribution, but Mclust was run again forcing 6 mixtures, and the results (not shown) show as well this smoothness. Hence, the Bayesian approach, while more computationally expensive, is a better alternative to find the correct mixtures.

In any case, the differences in the way the two methods clusterize this dataset can be visually explored. With this aim, the scatter plots of log_masse vs log_rade were displayed with each observation colored attending to its mixture membership.

# set colour palette for EM
colors.EM <- rainbow(k_opt_EM)[exo_Mclust$classification]
# set colour palette for MCMC
color_class <- exo_clust_1$state.last$r
color_ind_1 <- order(color_class)
colors.MCMC_1 <- rainbow(k_opt_MCMC)[color_class[color_ind_1]]

par(mfrow=c(1,2))
plot(df_p$log_masse,df_p$log_rade,col=colors.EM,bg=1,pch=21,
     xlab="log_masse",ylab="log_rade",main=paste("EM Mixtures for K =",k_opt_EM))
points(mean_EM$log_masse,mean_EM$log_rade,col=1,bg=rainbow(k_opt_EM),
       lwd=2,pch=24,cex=2)
plot(df_p$log_masse[color_ind_1],df_p$log_rade[color_ind_1],col=colors.MCMC_1,bg=1,pch=21,
     xlab="log_masse",ylab="log_rade",main=paste("MCMC Mixtures for K =",k_opt_MCMC))
points(mean_MCMC_1$log_masse,mean_MCMC_1$log_rade,col=1,bg=rainbow(k_opt_MCMC),
       lwd=2,pch=24,cex=2)

Again it can be seen that both methods rendered sensible clusterizations but, in my opinion, the fully Bayesian approach did a better selection. Nevertheless, the mixtures means for the MCMC simulation (large triangles) were not quite in agreement with what one could expect, at least for some mixtures. What was happening here?

4.4 Label-switching in MCMC simulations

While the MCMC analysis of mixtures is quite powerful and convenient, it pathologically suffers from the non-identifiability of the mixtures under symmetric priors, leading to the so called label-switching. In this problem, the actual values for the posterior mixtures mean vectors \(\mu\) are correctly found, but the labels assigned to each mixture can randomly switch between components. In practice, what one observes when assessing the posterior \(\mu\) evolution in the MCMC simulation, is a series of jumps in the values of \(\mu\) for each mixture. As the final posterior mixture mean vector \(\mu\) is the average of the distribution of \(\mu\) corresponding to the MCMC simulation, the jumps will lead to incorrect estimations of the posterior mixture mean vectors \(\mu\). Could this be happening to the current experiment? To check this extent, the traceplots for the mixture posterior means in the previous MCMC simulation were plotted.

# to check label switching
tracePlots(exo_clust_1, param=c("mu"))

BINGO! The MCMC algorithm did suffer from label-switching for this dataset and this particular number of mixtures. To solve this problem, I assumed a more informative prior distribution for the mixture weights, as described by Cavalcante and Gonçalves. Mixtures models usually assume that the weights are distributed following a Dirichlet prior \(\omega\) ~ D(\(\delta_1, \dots, \delta_K\)), such that \(\sum_{k=1}^K\omega_k=1\). By default, it is imposed that \(\delta_k=1\) for \(\forall k\) which, citing the course notes, “is a uniform prior in the unit simplex”. Cavalcante and Gonçalves showed that increasing \(\delta_k\) to 4 or 10 solved the “label-switching” problem in both homogeneous and heterogeneous populations.

Thus, following their example, the MCMC simulation was run again but using these two values of \(\delta_k\). Luckily for me, this functionality is already implemented in the function NMixMCMC, in the code snippet to define the priors options (kprior <- list(priorK="fixed", Kmax=k_opt_MCMC, delta=4)). For this simulation, the optimal number of mixtures found by RJMCMC (k_opt_MCMC) was kept, as the label-switching does not affect this parameter.

##### MCMC simulation with delta = 4
kprior <- list(priorK="fixed", Kmax=k_opt_MCMC, delta=4)
nMCMC <- c(burn=5000, keep=10000, thin=5, info=FALSE)
exo_clust_4 <- NMixMCMC(y0=df_p[,VARS], prior=kprior, scale=list(shift=0, scale=1),
                        nMCMC=nMCMC,PED=F)
## 
## Chain number 1
## ==============
## MCMC sampling started on Sat Mar 21 19:22:24 2020.
## Burn-in iteration 5000
## Iteration 1000015000
## MCMC sampling finished on Sat Mar 21 19:23:20 2020.
##### MCMC simulation with delta = 10
kprior <- list(priorK="fixed", Kmax=k_opt_MCMC, delta=10)
nMCMC <- c(burn=5000, keep=10000, thin=5, info=FALSE)
exo_clust_10 <- NMixMCMC(y0=df_p[,VARS], prior=kprior, scale=list(shift=0, scale=1),
                        nMCMC=nMCMC,PED=F)
## 
## Chain number 1
## ==============
## MCMC sampling started on Sat Mar 21 19:23:20 2020.
## Burn-in iteration 5000
## Iteration 1000015000
## MCMC sampling finished on Sat Mar 21 19:24:14 2020.

Again, the mixture mean vectors were retrieved, a column with the predicted mixture membership was added to the original dataset, and the density marginals were predicted.

# Get mixture mean vector (log_masse and log_rade) and corresponding mean of log_dens
mean_MCMC_4 <- mixtures_means(exo_clust_4,k_opt_MCMC)

# add a column with the predicted cluster membership
class <- data.frame(seq.int(1,k_opt_MCMC,1)[exo_clust_4$state.last$r])
names(class) <- c('Class')
class$Class <- as.factor(class$Class)
df_exo_MCMC_4 <- cbind(class,df_p)

# predict density marginals
pdens_4 <- NMixPredDensMarg(exo_clust_4, lgrid=300)

# Get mixture mean vector (log_masse and log_rade) and corresponding mean of log_dens
mean_MCMC_10 <- mixtures_means(exo_clust_10,k_opt_MCMC)

# add a column with the predicted cluster membership (for the last state)
class <- data.frame(seq.int(1,k_opt_MCMC,1)[exo_clust_10$state.last$r])
names(class) <- c('Class')
class$Class <- as.factor(class$Class)
df_exo_MCMC_10 <- cbind(class,df_p)

# predict density marginals
pdens_10 <- NMixPredDensMarg(exo_clust_10, lgrid=300)

To check the presence of label-switching in the new simulations, and for the sake of space reduction, the user-defined function plot_label_switching was used instead of the built-in function tracePlots.

# let's check the label switching effects
par(mfrow=c(3,2))
plot_label_switching(exo_clust_1,k_opt_MCMC,"Traceplots for delta = 1")
plot_label_switching(exo_clust_4,k_opt_MCMC,"Traceplots for delta = 4")
plot_label_switching(exo_clust_10,k_opt_MCMC,"Traceplots for delta = 10")

As can be clearly seen in the previous plots, increasing \(\delta_k\) to 4 or 10 solved the “label-switching” problem in different degrees. Although the mixing was not perfect, the outcome was very promising and much better than when an un-informative prior for the weights was used (\(\delta_k=1\)). Interestingly, like the optimal number of mixtures, which is not affected by the label-switching problem, the predicted densities for log_masse and log_rade are very similar irrespective of the value for \(\delta_k\), since they don’t depend on how the components are labeled (Cavalcante and Gonçalves).

par(mfrow=c(1,2))
hist(df_p$log_masse, freq=F, ylim=c(0,0.8), xlab = "log_masse", main = "Histogram of log_masse")
lines(pdens_1$x$x1,pdens_1$dens$`1`,col=2,lwd=2)
lines(pdens_4$x$x1,pdens_4$dens$`1`,col=3,lwd=2)
lines(pdens_10$x$x1,pdens_10$dens$`1`,col=4,lwd=2)
legend("topleft",c("Delta = 1","Delta = 4","Delta = 10"),col = c(2,3,4),lty=c(1,1,1),lwd=c(2,2,2))
hist(df_p$log_rade, freq=F, ylim=c(0,3), xlab = "log_rade", main = "Histogram of log_rade")
lines(pdens_1$x$x2,pdens_1$dens$`2`,col=2,lwd=2)
lines(pdens_4$x$x2,pdens_4$dens$`2`,col=3,lwd=2)
lines(pdens_10$x$x2,pdens_10$dens$`2`,col=4,lwd=2)

Attending to the predicted marginal densities and the traceplots for the posterior means, it seemed that \(\delta_k=4\) managed better the label-switching problem and resulted in a better mixing in the Markov chain when exploring the whole space state. An alternative way to test this is to compute the DIC of each model:

DIC <- c(exo_clust_1$DIC$DIC,exo_clust_4$DIC$DIC,exo_clust_10$DIC$DIC)
names(DIC) <- c("Delta = 1","Delta = 4","Delta = 10")
DIC <- as.data.frame(DIC)
DIC
##                 DIC
## Delta = 1  1097.268
## Delta = 4  1094.651
## Delta = 10 1095.200

In view of this analysis, the best model was the one using \(\delta_k=4\). Accordingly, this model was the one used from now on. Finally, it was visually checked that the mixture posterior means were now closer to the expected ones.

# set colour palette for new MCMC
color_class <- exo_clust_4$state.last$r
color_ind_4 <- order(color_class)
colors.MCMC_4 <- rainbow(k_opt_MCMC)[color_class[color_ind_4]]

par(mfrow=c(1,2))
plot(df_p$log_masse[color_ind_1],df_p$log_rade[color_ind_1],col=colors.MCMC_1,bg=1,pch=21,
     xlab="log_masse",ylab="log_rade",
     main=paste("MCMC Mixtures for delta = 1 and K =",k_opt_MCMC))
points(mean_MCMC_1$log_masse,mean_MCMC_1$log_rade,col=1,bg=rainbow(k_opt_MCMC),
       lwd=2,pch=24,cex=2)
plot(df_p$log_masse[color_ind_4],df_p$log_rade[color_ind_4],col=colors.MCMC_4,bg=1,pch=21,
     xlab="log_masse",ylab="log_rade",
     main=paste("MCMC Mixtures for delta = 4 and K =",k_opt_MCMC))
points(mean_MCMC_4$log_masse,mean_MCMC_4$log_rade,col=1,bg=rainbow(k_opt_MCMC),
       lwd=2,pch=24,cex=2)

5 Clustering in terms of exoplanet density

For this section, only the exoplanet density log_dens was used to find the Gaussian mixtures.

VARS <- c("log_dens")

The first step was to check the distribution of exoplanets densities.

# Histogram with density plot
nbins <- round(sqrt(nrow(df_p))) # number of bins set to square root of number of observations (standard convention)
ggplot(df_p, aes(x=log_dens)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white", bins = nbins)

It is not easy to say, but at least 3 or 4 mixtures could be needed to find the adequate density distribution.

5.1 Bayesian clustering

As I showed before, the Bayesian clustering performs better than the frequentist EM mixture model when the label-switching problem is managed. Then, in this section only the former clustering method was applied. The procedure that I followed is analogous to the previous one and, thus, some of the steps won’t be explained.

In short, RJMCMC was used first to find the optimal number of mixtures (k_opt_MCMC_d). Then the standard MCMC simulation was run with this number and assuming \(\delta_k=10\) for the prior weights distribution. I used \(\delta_k=10\) instead of 4 because, whereas the DIC was marginally worse, the Markov Chain mixing was better. After that, the mixture mean vectors were retrieved, a column with the predicted mixture membership was added to the original dataset, and the density marginals were predicted.

# Define the properties of the prior distribution for the number of mixtures
kprior<- list(priorK="uniform", Kmax=15, delta=1)
# Define parameters for RJMCMC simulation
nMCMC <- c(burn=5000, keep=10000, thin=5, info=FALSE)
# Set random seed
set.seed(100417976)
# Run RJMCMC simulation
exo_clust_0 <- NMixMCMC(y0=df_p[,VARS], prior=kprior, scale=list(shift=0, scale=1),
                        nMCMC=nMCMC,PED=F)
## 
## Chain number 1
## ==============
## MCMC sampling started on Sat Mar 21 19:24:33 2020.
## Burn-in iteration 5000
## Iteration 1000015000
## MCMC sampling finished on Sat Mar 21 19:25:10 2020.
# find optimal number of clusters (most frequent one)
K_prob <- table(exo_clust_0$K) # table of number of clusters and absolute frequency of appearance
k_s_bayes <- as.numeric(names(K_prob)) # get column names as numbers
ind_max <- which.max(K_prob) # find index of K with maximum frequency
k_opt_MCMC_d <- k_s_bayes[ind_max] # store optimal K

# run clustering algorithm again with fixed Kmax = k_opt and delta = 10
kprior <- list(priorK="fixed", Kmax=k_opt_MCMC_d, delta=10)
nMCMC <- c(burn=5000, keep=10000, thin=5, info=FALSE)
exo_clust_d <- NMixMCMC(y0=df_p[,VARS], prior=kprior, scale=list(shift=0, scale=1),
                        nMCMC=nMCMC,PED=F)
## 
## Chain number 1
## ==============
## MCMC sampling started on Sat Mar 21 19:25:10 2020.
## Burn-in iteration 5000
## Iteration 1000015000
## MCMC sampling finished on Sat Mar 21 19:25:38 2020.
# add a column with the predicted cluster membership
class <- data.frame(seq.int(1,k_opt_MCMC_d,1)[exo_clust_d$state.last$r])
names(class) <- c('Class')
class$Class <- as.factor(class$Class)
df_exo_MCMC_d <- cbind(class,df_p)

# Get mixture mean vector (log_dens) and corresponding log_masse and log_rade obtained from the subsamples means
log_dens <- c()
log_masse <- c()
log_rade <- c()
for(i in 1:k_opt_MCMC_d){
  log_dens[i] <- mean(exo_clust_d$mu[,i])
  log_masse[i] <- mean(df_exo_MCMC_d$log_masse[df_exo_MCMC_d$Class == i])
  log_rade[i] <- mean(df_exo_MCMC_d$log_rade[df_exo_MCMC_d$Class == i])}

mean_dens <- as.data.frame(cbind(log_masse,log_rade,log_dens))

# Predict density marginals
grid_d <- seq(1.05*min(df_p$log_dens),1.05*max(df_p$log_dens),length.out = 300) 
pdens_d <- NMixPredDensMarg(exo_clust_d, grid = grid_d)

In the next figure the results of the previous procedure are summarized.

par(mfrow=c(1,3))
# plot posterior distribution of K
barplot(prop.table(table(exo_clust_0$K)), 
        xlab="Number of mixtures", ylab="Posterior probability",main="Posterior distribution of K's")

# plot label switching
# set color palette
col.clust <- rainbow(k_opt_MCMC_d)
# get y-axis max and min for log_dens
y_max <- c()
y_min <- c()
for(i in 1:k_opt_MCMC_d){
  y_max[i] <- max(exo_clust_d$mu[,i])
  y_min[i] <- min(exo_clust_d$mu[,i])
}
y_max <- max(y_max)
y_min <- min(y_min)
# plot evolution of log_dens mixture means mu
plot(exo_clust_d$mu[,1],type="l",ylim=c(y_min,y_max),col=col.clust[1], 
     xlab = "Iteration", ylab="mu_dens", main = "Traceplot for mixture means")
for(i in 2:k_opt_MCMC_d){
  lines(exo_clust_d$mu[,i],col=col.clust[i])
}

# plot histogram and predicted density distribution
hist(df_p$log_dens, freq=F, ylim=c(0,0.7), xlab = "log_dens", main = "Histogram of log_dens")
lines(pdens_d$x$x1,pdens_d$dens$`1`,col=2,lwd=2)

The optimal number of mixtures amounted to 3, very close to what one could foresee. The traceplot for the mixture means did not show evidences of label-switching and displayed a good mixing of the Markov chain. In addition, the predicted density distribution captured very well the subtleties of the log_dens histogram. Finally, the different groups that the Bayesian clustering found attending to the exoplanet densities was visually explored by means of a scatter plot of log_masse vs log_rade with each observation colored attending to its mixture membership.

# set color palette
color_class <- exo_clust_d$state.last$r
colors.MCMC <- rainbow(k_opt_MCMC_d)[color_class]

plot(df_p$log_masse,df_p$log_rade,col=colors.MCMC,bg=1,pch=21,
     xlab="log_masse",ylab="log_rade",main=paste("MCMC Mixtures for delta = 10 and K =",k_opt_MCMC_d))
points(mean_dens$log_masse,mean_dens$log_rade,col=1,bg=rainbow(k_opt_MCMC_d),lwd=2,pch=24,cex=2)

This clusterization visualization makes sense. In the traceplot above, it can be seen that the blue trace corresponds to the largest mean density, whereas the red one corresponds to the smallest one. On the other hand, for a fixed log_masse, the denser the exoplanet is, the smaller log_rade becomes. As the color code in the trace and scatter plots are the same, the “blue” exoplanets are the ones with the a largest mean densities and thus they should have a smaller mean radius, as indeed happens. The same goes for the “red” exoplanets, that have the smallest mean density and thus they display the largest mean radius.

6 Physical meaning of mixtures parameters

There is not a unique way of categorizing exoplanets, but most of them take into account their mass and their phase state. In this report I have followed the one depicted in the cartoon below (based on actual classifications), which is meaningful and funny. The main distinction is made between rocky planets with densities similar or larger than the Earth, or gassy planets with Neptune/Jupiter like densities or smaller. Then, based on their masses, they are usually subdivided into: Subterrans, Terrans and Superterrans (or mini-neptunes), and Gas Dwarfs, Gas Giants and SuperJupiters (AKA Superjovians). So, a total of 6 classes. But truth be said, the existing classifications have no strong theoretical base and are more on the empirical side. In any case, I focused overall in the distribution of exoplanet densities.

So, how does the analysis performed in the previous sections reflect this classification? To make this comparison easier, I checked, with the help of the user-defined function plot_mixtures, the marginal distributions of log_masse, log_rade, and log_dens for each of the mixtures found in the previous sections. This is, when clustering according to log_masse and log_rade, and clustering according to log_dens exclusively.

par(mfrow=c(1,3))
plot_mixtures(df_exo_MCMC_4,k_opt_MCMC,mean_MCMC_4, "MCMC Mixtures according to log_masse and log_rade")

plot_mixtures(df_exo_MCMC_d,k_opt_MCMC_d,mean_dens, "MCMC Mixtures according to log_dens")

In these plots, I have included the mixture means \(\mu\) of log_masse, log_rade, and log_dens found by MCMC for each covariate as vertical lines with the same color as its corresponding mixture. For comparison purposes, I have included as well vertical black lines marking the values of log_masse, log_rade, and log_dens for the Earth (solid line), Neptune (dashed line), Saturn (dotted line) and Jupiter (dash-dotted line).

The Gaussian mixtures obtained according to log_masse and log_rade suggest that there would be 4 types of exoplanets in terms of density. There would be one group of super dense exoplanets that are at least one order of magnitude denser than the Earth (Type I, red), two groups of exoplanets with Earth-like densities (Type II, green and yellow), one group with a density close to that of Neptune or Jupiter (Type III, blue), and, finally, a couple of groups of Saturn-like exoplanets (Type IV, cyan and magenta). Type I exoplanets, that could be identified with the Superterrans above, would have a mass similar to that of Jupiter in a size smaller than Neptune. Type II would have a mass and radius smaller than those of Neptune. This type of exoplanets could be matched with the Subterrans and Terrans above. Type III exoplanets would have a mass and radius larger than those of Jupiter and, thus, they could be identified with Superjovians. Type IV exoplanets would come in two sizes and masses between those of Saturn and Jupiter and would represent Gas Dwarfs and Gas Giants in the followed classification.

In contrast, the Gaussian mixtures obtained according to log_dens are not so easy to compare with existing classifications. In terms of densities, there would be 3 degenerated types of exoplanets, each of them displaying two different ranges of masses and radii. This amounts to 6 different groups, like in the previous case, but a new classification would be needed to identify each of them. In addition, one of the mixtures shows a very large posterior variance in log_dens, in such a way that the subsample corresponding to that mixture contains observations in the whole range of log_masse, log_rade, and log_dens. Thus, it does not serve as a good indicator to identify exoplanets types. Does it mean that the Gaussian Mixtures Model (GMM) is incorrect? Not really, the Gaussian mixtures found using MCMC capture very well the density distribution of log_dens (see histogram above), and that is the main goal of GMMs. Nevertheless, that doesn’t mean that their posterior parameters must always have a physical meaning.

Hence, if one wants to use Gaussian mixtures to classify exoplanets, it is better to use log_masse and log_rade.

7 Summary and conclusion

In this project, I have tried to categorize exoplanets from their masses and radii using Gaussian Mixtures Models (GMM). The dataset was downloaded from NASA Exoplanet Archive and was augmented with the data corresponding to the Solar System planets. The dataset was then preprocessed: removal of missing values, application of a decimal logarithmic transformation, and creation of a new column with the calculated exoplanet densities. The resulting dataset contained three variables (log_masse,log_rade, and log_dens) with 750 observations.

Classical (EM algorithm Mclust) and Bayesian ([RJ]MCMC algorithm NMixMCMC) multivariate GMMs were then applied to the sample formed by log_masse and log_rade. The Bayesian approach, which found that 6 mixtures were needed to capture the joint probability distribution, was shown to perform better than the classical version. Nevertheless, the Bayesian clustering algorithm suffered from label-switching, so that the predicted mixture posterior mean vectors were incorrectly determined. Using a more informative prior for the mixture weights distribution largely solved this problem.

Next, a univariate Bayesian GMM was applied to the distribution of exoplanets densities log_dens. In this case, the Reversible Jump MCMC formalism detected that 3 mixtures were needed to describe the probability distribution of log_dens. To avoid the label-switching problem when retrieving the mixture posterior parameters, a more informative prior for the weights was directly assumed.

Finally, it was checked whether the mixtures parameters had any physical meaning and could reflect existing exoplanet empirical classifications. The mixtures obtained by clustering using log_dens didn’t seem to serve as a good indicator to identify exoplanets types. Nevertheless, the mixtures parameters obtained by clustering using the exoplanet masses and radii (log_masse and log_rade), were quite in agreement with the mean properties of the exoplanet types that have been classified attending both to their phase state (rocky and gassy planets) and their mass.

In conclusion, the Bayesian GMMs based on the MCMC methodology not only do an excellent good job capturing the probability distribution of the exoplanet properties but, if the label-switching problem is managed, serve as a tool to identify exoplanets types.