Visualise your fitted
non-linear dimension reduction model
in the high-dimensional data space

Jayani P. G. Lakshika

Joint work with Prof Dianne Cook, Dr Paul Harrison, Dr Michael Lydeamore, Dr Thiyanga S. Talagala

Motivation

Single-cell gene expression: same data, different NLDR + hyper-parameters

Which is the most reasonable representation of the structure(s) present in the
high-dimensional data?

How do you decide which is the most reasonable representation?

This is the published figure.

Peripheral Blood Mononuclear Cells (PBMC)

Here is the \(9\text{-}D\) data viewed using a grand tour, linear projections into \(2\text{-}D\).

Software: langevitour

Show “model-in-the-data-space”

data-in-the-model-space

model-in-the-data-space

data-in-the-model-space







What is the model?

data-in-the-model-space

model-in-the-data-space

Overview of method

1. Construct the \(2\text{-}D\) model

2. Lift the model into high-dimensions

Steps of the algorithm

1. Construct the \(2\text{-}D\) model

  1. NLDR layout, b. hexagon bins (hex_binning() and geom_hexgrid()), c. bin centroids (merge_hexbin_centroids()), d. triangulated centroids (tri_bin_centroids(), gen_edges(), update_trimesh_index(), and geom_trimesh()).

Steps of the algorithm

2. Lift the model into high-dimensions

avg_highd_data()

show_langevitour()

Factors for fitting and measuring fit

  • NLDR layout, different methods and different hyper-parameters
  • Number of bins
  • Bin start position
  • Low density removal (find_low_dens_hex())
  • HBE in high-dimensions: The square root of the sum of squared differences between observed and fitted values (glance())

\[\sqrt{\frac{1}{n}\sum_{h = 1}^{b}\sum_{i = 1}^{n_h}\sum_{j = 1}^{p} ({x}_{hij} - C^{(p)}_{hj})^2}\] \(n =\) the number of observations,

\(b =\) the number of bins,

\(n_h =\) the number of observations in \(h^{th}\) bin,

\(p =\) the number of variables,

\({x}_{hij} =\) the \(j^{th}\) dimensional data of \(i^{th}\) observation in \(h^{th}\) hexagon.

HBE of candidates

Chosen fit for PBMC data set

tSNE with perplexity: 30

Clusters with small separations, non-linear clusters

Densed points, filled out clusters

Prediction into \(2\text{-}D\)

  • Predict a new observation’s value in the NLDR, for any method (predict_emb())

  • For a new observation

    • Determine the closest bin centroid in high-dimensions using fitted model
    • Predict it to be the centroid of this bin in \(2\text{-}D\)



quollr





questioning how a high-dimensional object looks in low-dimensions using r

Interactivity

Summary

  • Provided a method to create a model from a NLDR layout that
    can be displayed with the data to assess the fit.
  • Make it easier for researchers to make better decisions on which
    NLDR layout is best
    for their work.
  • It has the additional benefit that for any method you can now
    provide predictions for new data, of where these points will be
    positioned in the NLDR.

R package

Draft paper

Jayani P.G. Lakshika


Collaborators: Prof Dianne Cook, Dr Paul Harrison, Dr Michael Lydeamore, Dr Thiyanga S. Talagala