High-throughput identification of MHC class I binding peptides using an ultradense peptide array

Data files (in the data subfolder):

Extensions: txt = tab separated file, tsv = tab separated file, csv = comma separated file, gz uses the gzip compression.

File Name

Description

Mamu_MHC_Class_1_SIV_Peptide_Intensities_From_Vendor.tsv.gz

Raw peptide data for the SIVMAC239 peptides as received from the vendor. 

Includes peptides from other virus strains as well. Includes MHC A001,A002, B002 and B017

GagCM9_TatSL8_Substitution_Array_From_Vendor.txt.gz Raw peptide data as received from the vendor for the maturation plot for MHC A001.  Includes substitution data for Gag CM9  and Tat SL8.
Mamu_MHC_Class_1_SIV_Peptide_metadata.tsv Look up table for Vendor to MHC Name translation and additional information about the samples
SIVMAC239_CORR_KEY.csv.gz SIV MAC239 probe sequence to Sequence ID, Virus Name,  Protein Name, and Position in the sequence
IC50_log2_binding_score.csv.gz

Manually Parsed table baed on the results of SIV_MHC_Peptide_Array_Ranking_Pipeline.ipynb and publicly available IC50 values

Used as an input for SIV_MHC_IC50_vs_Peptide_Array_ROC_Wilcoxin_Signed_Test.ipynb 

 

Code:

There are 3 scripts used to aggregate, plot and perform statistical tests on the data.  These scripts are in the CODE subfolder.

 

File name Description
SIV_MHC_Peptide_Array_Ranking_Pipeline.ipynb

This script transforms and aggregates raw peptide data by taking the log transform of the data, and the median value of replicate peptide values.  

Next, the sample columns are renamed from the vendor_naming convention to the corresponding MHC Name

Next, the data is merged with the corresponding key (to join the Probe Sequence to the Sequence data.

Finally the data is ranked based aggregated intensity value, grouped by the MHC Name

SIV_MHC_IC50_vs_Peptide_Array_ROC_Wilcoxin_Signed_Test.ipynb

This takes the aggregated data and performs the statistical tests. 

This tests the data for normality and chooses the statistical test based on the inputs. 

The final pipeline used the non-parametric, unpaired Wilcoxon Rank-Sum test.

Quantile-quantile plots, Histograms, and Box Plots are generated.

 GagCM9_TatSL8_Substitution_Analysis.ipynb

This script transforms and aggregates the raw peptide data as receive from the vendor to a form that is for suitable for statistical analysis and plotting.

The script performs the substitutions, based on the vendor's standard.  Finally line plots and heatmaps are generated of the data

 

System Requirements

  • The code has been written and tested on macOS and Linux-based operating systems. 
  • The code has not been tested on a Windows OS, and may require Windows OS specific modifications to run on a Windows OS
  • The code is compatible with google colab environment.
  • Though each system is different, the tested systems had at least dual-core processor, 8 GB RAM, and 10 GB free space .
  • The packages were installed using jupyter notebook !pip install.
  • The files were written in jupyter notebook using Python 3.6 environment and the following packages: scipy, numpy, pandas, matplotlib
  • Not all of the defined functions were used in the final pipeline.  
  • You will have to rearrange the folder paths to reflect the location on your file system.