# Modeling PTR ratios using ESM-2 embeddings

## Background
While transcript abundance is a major determinant of protein abundance, post-transcriptional processes can substantially modulate the relationship between RNA and protein expression [1]. This discrepancy is evident from the weak correlations between mRNA and protein abundance observed for important gene classes across tissues and cell types [2]. Mechanisms such as translation initiation, elongation, and termination, and protein degradation contribute to these differences. Understanding the regulatory elements governing these processes is essential to delineate gene regulatory programs and their implications in disease predisposition [3].

In prior work, we predicted protein-to-mRNA ratios (PTR ratios) from sequence using matched proteome and transcriptome expression levels for 11,575 genes across 29 human tissues [4]. We incorporated known post-transcriptional regulatory elements as well as de novo motifs identified using a k-mer approach across the 5’ untranslated region (UTR), coding sequence, and 3’ UTR.

A key limitation of this approach is that it relies on local sequence context, as captured by the k-mer method. Language models have emerged as a new way to encapsulate rich contextual information across different biological modalities. Protein language models, such as ESM-2 [5], have demonstrated state-of-the-art performance in several downstream tasks, suggesting their potential to encode information about protein stability.

### References

[1]	Y. Liu, A. Beyer, and R. Aebersold, "On the Dependency of Cellular Protein Levels on mRNA Abundance," Cell, vol. 165, no. 3, pp. 535–550, Apr. 2016, doi: 10.1016/j.cell.2016.03.014.

[2]	N. Fortelny, C. M. Overall, P. Pavlidis, and G. V. C. Freue, "Can we predict protein from mRNA levels?," Nature, vol. 547, no. 7664, pp. E19–E20, Jul. 2017, doi: 10.1038/nature22293.

[3]	I. Karbassi et al., "A Standardized DNA Variant Scoring System for Pathogenicity Assessments in Mendelian Disorders," Hum. Mutat., vol. 37, no. 1, pp. 127–134, Jan. 2016, doi: 10.1002/humu.22918.

[4]	B. Eraslan et al., "Quantification and discovery of sequence determinants of protein‐per‐mRNA amount in 29 human tissues," Mol. Syst. Biol., vol. 15, no. 2, p. e8513, Feb. 2019, doi: 10.15252/msb.20188513.

[5]	Z. Lin et al., "Evolutionary-scale prediction of atomic-level protein structure with a language model," Science, vol. 379, no. 6637, pp. 1123–1130, Mar. 2023, doi: 10.1126/science.ade2574.ein-to-RNA Ratio (PTR) using ESM-2

## Case Study
### Presentation
Please prepare a ~10 minute presentation about the work done by _Eraslan et al_.  
  
Use a 5-slide presentation structure:  
  
- Motivation: why is that paper important. What question does it address?   
- Highlighted abstract (copy the paper title + abstract where the main claims are highlighted).  
- One 1st figure or figure panel of choice - what you find the most important claim. Explain the figure. On a method paper the 1st figure could be the model or the algorithm.   
- One 2nd figure of choice - what you find the next most important claim. Explain the figure.  
- How could protein or DNA language models complement the work done in this paper?  

### Task
You are provided with pre-computed ESM-2 embeddings for a subset of the PTR dataset described above. Choose a linear (e.g. ridge regression) or non-linear model (e.g. neural network) of your choice to predict the PTR ratios across 29 human tissues. Use the **R²** metric to evaluate your model's performance and visualize your results.

> **Hint:** Consider applying a pooling operation over the sequence length (e.g., mean or max) to obtain a fixed-length representation for downstream prediction.

### Dataset

This dataset is based on the [_Eraslan et al._ publication](https://doi.org/10.15252/msb.20188513). It contains PTR ratios for **1,097 genes** in **29 human tissues**. The dataset has been filtered to only include proteins ≤ 300 residues long and no missing values.

#### `ptr_subset.csv`

- **Type:** CSV file containing protein-to-RNA ratio (PTR) data
- **Size:** 1,098 rows (including header)
- **Structure:** 
  - **First column:** `GeneName` - contains gene identifiers
  - **Remaining 29 columns:** PTR values for different human tissues/organs
  - **Tissues include:** Adrenal, Appendices, Brain, Colon, Duodenum, Endometrium, Esophagus, Fallopian tube, Fat, Gallbladder, Heart, Kidney, Liver, Lung, Lymph node, Ovary, Pancreas, Placenta, Prostate, Rectum, Salivary gland, Small intestine, Smooth muscle, Spleen, Stomach, Testis, Thyroid, Tonsil, and Urinary bladder
- **Content:** Each row represents a gene and its PTR values across different tissues

#### `esm2_subset.h5`

- **Type:** HDF5 file containing ESM-2 protein embeddings
- **Size:** 1,097 gene entries
- **Structure:** 
  - Each gene name serves as a key in the HDF5 file
  - Each dataset contains embeddings with shape `(L, 1280)`
  - `L` is the sequence length of the respective protein (amino acids)
  - The **1,280 dimension** represents the ESM-2 embedding features per amino acid, extracted from the last layer of ESM-2
- **Content:** Pre-computed protein sequence embeddings generated using the ESM-2 (Evolutionary Scale Modeling 2) transformer model, which captures evolutionary and structural information from protein sequences