Profile Picture

My previous paper pointed out that ancestry is important for understanding the genetic diversity of cancer cell lines and their implications for therapeutic responses. This tutorial outlines the computational pipeline that I derived to reliably classify cancer cell lines into ancestries based on genotype data. The pipeline consists of three key stages: data processing, quality control, genotype imputation, and ancestry inference, each explained in detail to guide its implementation.

Disclaimer: The paper was written in 2019 as part of a project whose main goal was to evaluate the impact of ancestry to drug response in cancer cell lines. Today there might be more advanced ancestry inference methods, applied not only to cancer cell lines. It would be interesting to see the performance of our method in comparison to the more recent methods.

Preparation

I first created a few folders in the working directory as following, and put the raw data into the data folder:

|--- working dir 
   |--- data # Raw data
   |--- transformed # Plink text files (PED, MAP, FAM)
   |--- plink # Plink binary files (BED & BIM)
   |--- qc # QCed data
   |--- input # Input for ancestry reference
   |--- reference dir # All reference files
        |--- legend/1000GP_Phase3_combined.legend
        |--- vcf
        |--- m3vcf
        |--- bcf
        |--- map/genetic_map_hg19_withX.txt

I also set up a few things before the main process. First I downloaded 1000 Genome Project’s reference VCF files for all populations and put to reference dir/vcf/, legend file and put to ref dir/legend/, genetic map file and put to ref dir/map.

I also installed the following software in advance: R, R packages ggplot2, dplyr, biomaRt and stringr, Perl, Ensembl API, Eagle2, minimac3, minimac4. You can check out this Dockerfile that I used to install these softwares.

Finally, I put the following utility R scripts in the working directory which will assist the subsequent analyses:

check_position.R:


    #!/usr/bin/env Rscript

    args <- commandArgs(trailingOnly=TRUE)

    sample <- read.table(args[1], header = F, sep = "\t")
    ref <- read.table(args[2], header = F)
    index <- match(sample$V4, ref$V1)
    index_unmatched <- is.na(index)  
    unmatched_position <- sample[index_unmatched,] 
    write.table(unmatched_position, file = "qc/unmatched_position.txt", row.names = F, col.names = F, quote = F)

check_refalt.R:


    #!/usr/bin/env Rscript

    args <- commandArgs(trailingOnly=TRUE)

    sample <- read.table(args[1], header = F, sep = "\t", stringsAsFactors = F)
    ref <- read.table(args[2], header = F, stringsAsFactors = F)
    ref <- ref[match(sample$V4, ref$V1),]

    for (i in 1:nrow(sample)){
        sam_ref <- sample$V6[i]
        sam_alt <- sample$V5[i]
        ref_ref <- ref$V2[i]
        ref_alt <- ref$V3[i]
        
        if (sam_ref == ref_ref && sam_alt == ref_alt){
            next
        } else if ( sam_ref == ref_alt && sam_alt == ref_ref){
            next
        } else {
            cat(paste(sample[i,2]), file="./qc/unmatched_alleles.txt", append=TRUE, sep = "\n")
        }
    }

Data preprocessing

The first step in the pipeline involves data processing to prepare genotype data for subsequent analysis. In our case, the input data were in CSV format of the Affymetrix SNP6.0 arrays containing 1007 cancer cell lines and 884,148 SNPs ranging from chromosome 1 to chromosome 22, which are deposited in the European Genome-Phenome Archive (EGAS00001000978). Particularly, there was one CSV file per cell line, of which rows were SNPs and columns were he SNPs’ features. Below is an example of an input CSV (not a real cell line and only 5 rows are shown):

Figure 1

Example CSV file for genotypes of cancer cell lines

The data was first transformed into PLINK file formats (PED, MAP, BED, BIM, and FAM), which are suitble for subsequent genomic analyses. During preprocessing, CSV text files were manipulated through multiple steps to output appropriate PLINK formats. Run the following code for data processing:


    WDIR=[working directory]
    DDIR=[data directory]
    REFDIR=[reference directory]
    LEGFILE=${REFDIR}/legend/1000GP_Phase3_combined.legend
    VCFFILE=${REFDIR}/vcf
    MAPFILE=${REFDIR}/map/genetic_map_hg19_withX.txt

    ###Transform genotyping files into .ped and .map files and arrange in chromosomes
    for chr in {1..21}
    do
        for file in  ${filename[@]} 
        do
            #Filter for chromosome and add whitespace within geneotypes
            awk -F "\"*,\"*" -v chr=$chr '$1==chr{print $12}' $DDIR/$file | sed 's/./& /g' > tmp.txt 
            
            #Add '0' to missing geneotypes
            awk '!$1 { $1="0 0 " }1 ' tmp.txt > tmp_filled.txt 
            
            #Flip column -> row
            paste -s -d "" tmp_filled.txt > tmp_flipped.txt
            echo "$file" > cell_name_tmp.txt
            
            #Remove unneccesary part in cell name
            perl -pi -e 's/_complexGenotypes.csv//g' cell_name_tmp.txt 
            paste -d " " cell_name_tmp.txt tmp_flipped.txt >> chr"$chr"_tmp.txt
        done

        awk '{print $1}' chr"$chr"_tmp.txt > cell_name.txt
        cut -f1 -d " " --complement chr"$chr"_tmp.txt > chr"$chr"_tmp_trimmed.txt

        while IFS= read -r line; do
            #Create essential fields for the .ped file
            echo "1 $line 0 0 0 0" >> chr"$chr"_prefix.txt 
        done < cell_name.txt

        #Make .ped file
        paste -d " " chr"$chr"_prefix.txt chr"$chr"_tmp_trimmed.txt > ./transformed/chr"$chr".ped 

        #Make .map file
        awk -F "\"*,\"*" -v chr=$chr '$1==chr{print $1, $6, $2}' $DDIR/8-MG-BA_complexGenotypes.csv > ./transformed/chr"$chr".map

        #Remove rows (cell lines) with fewer columns than expected (unknown technical error)
        cp ./transformed/chr"$chr".ped ./transformed/"$chr".ped 
        max=$(awk '{print NF}' ./transformed/"$chr".ped | sort -n | tail -n 1 )
        awk -v MAX=$max 'NF==MAX{print}' ./transformed/"$chr".ped > ./transformed/chr"$chr".ped 
        rm ./transformed/"$chr".ped

        #Make plink binary files
        ./plink --file ./transformed/chr"$chr" --make-bed --out ./plink/chr"$chr"
        ./plink --freq --bfile ./transformed/chr"$chr" --out ./plink/chr"$chr" 
    done

Quality control

The transformed data was then checked for its quality before imputation. During quality control, SNPs with more than 10% missing data or minor allele frequencies below 0.05 were excluded. The positions of retained SNPs were compared against a legend file of the same reference build from the 1000 Genomes Project, and any mismatched SNPs were removed. Additional steps were taken to identify and exclude SNPs with inconsistencies such as reference strand swapping, alternate allele mismatches, or strand flipping, ensuring only high-confidence data remained for downstream analyses. Run the following code to execute quality control:


    for chr in {1..21}
    do
        # Checking genotyping rate, MAF and HWE integrity
        plink --bfile ./plinkd/chr$chr --silent --missing --out ./qc/chr$chr 
        sed 1d ./qc/chr${chr}.lmiss | awk '{if ($5 > 0.1) print $2,$5}' > ./qc/chr${chr}_missing_snps.txt
        if [[ -s ./qc/chr${chr}_missing_snps.txt ]] 
        then 
            echo "These SNPs have larger than 10% missing rate (see complete list in ./qc/chr${chr}_missing_snps.txt):"
            cat ./qc/chr${chr}_missing_snps.txt | head 
            echo '...' | sed G
        else 
            echo "All SNPs have higher than 90% typing rate" | sed G
        fi
        plink --bfile ./plinkd/chr${chr} \
        --silent \
        --geno 0.1 --recode --out ./qc/chr${chr}_freqfiltered

        #SNPs with unmatched positions
        #Checking positions compared to reference legend file
        cat ${LEGFILE} | awk -v CHR=${chr} '{if ($2==CHR && $6=="Biallelic_SNP") print}' > ./qc/ref_chr${chr}.txt
        awk '{print $3}' ./qc/ref_chr${chr}.txt > ./qc/ref_chr${chr}_coord.txt
        echo "Reference legend written to ./qc/ref_chr${chr}.txt and ./qc/ref_chr${chr}_coord.txt" | sed G
        Rscript --vanilla check_position.R ./qc/chr${chr}_freqfiltered.map ./qc/ref_chr${chr}_coord.txt
        awk '{print $2}' ./qc/unmatched_position.txt > ./qc/chr${chr}_unmatched_position.txt
        if [[ -s ./qc/chr${chr}_unmatched_position.txt ]] 
        then 
            echo "SNPs having unmatched positions (see complete list at ./qc/chr${chr}_unmatched_position.txt):"
            cat ./qc/chr${chr}_unmatched_position.txt | head
            echo '...' | sed G
            echo Removing these SNPs... | sed G
            plink --file ./qc/chr${chr}_freqfiltered --silent --exclude ./qc/chr${chr}_unmatched_position.txt --recode --out ./qc/chr${chr}_posfiltered
        else 
            echo All SNPs have correct positions | sed G
            plink --file ./qc/chr${chr}_freqfiltered --silent --recode --out ./qc/chr${chr}_posfiltered
        fi 

        #Check REF/ALT and flip strands
        awk '{print $3,$4,$5}' ./qc/ref_chr${chr}.txt > ./qc/ref_chr${chr}_alleles.txt
        echo "Legend alleles are written to ./qc/ref_chr${chr}_alleles.txt" | sed G
        
        ##Checking unmatching and flipping...
        rm -f ./qc/swapped_alleles.txt ./qc/unmatched_alleles.txt
        plink --file ./qc/chr${chr}_posfiltered --silent --make-bed --out ./qc/chr${chr}_posfiltered
        Rscript --vanilla check_refalt.R ./qc/chr${chr}_posfiltered.bim ./qc/ref_chr${chr}_alleles.txt
        cp ./qc/unmatched_alleles.txt ./qc/chr${chr}_unmatched_alleles.txt
        if [[ -s ./qc/chr${chr}_unmatched_alleles.txt ]]
        then
            echo "These SNPs have invalid REF/ALT alleles (see complete list at ./qc/chr${chr}_unmatched_alleles.txt):"
            cat ./qc/chr${chr}_unmatched_alleles.txt | head
            echo '...' | sed G
            echo Removing them... | sed G
            plink --bfile ./qc/chr${chr}_posfiltered --silent --exclude ./qc/chr${chr}_unmatched_alleles.txt --make-bed --out ./qc/chr${chr}_QCed
        else 
            echo No invalid alleles found | sed G
            plink --bfile ./qc/chr${chr}_posfiltered --silent --make-bed --out ./qc/chr${chr}_QCed
        fi
        echo "Finish QC check for chromosome ${chr}. Final QCed files at ./qc/chr${chr}_QCed" | sed G | sed G
    done

Imputation

After quality control, we needed to impute the missing genotypes that were not included in the Affymetrix detection. This is an important step since the SNPs of our interest (involved in the ancestry inference) were not detected in the original dataset. To this end, the binary file sets were converted into VCF files using PLINK. A reference genome in the form of phased VCF files was obtained from the 1000 Genomes Project at the beginning and subsequently converted to BCF format using BCFtools and to M3VCF format using Minimac3. The input VCF files were phased using Eagle2, utilizing the BCF reference files and the genetic map from the 1000 Genomes Project. Once phased, the VCF files were imputed with Minimac4 to fill in missing SNPs on each chromosome, relying on the M3VCF reference genome for accurate predictions.


    WDIR=[working directory]
    DDIR=[data directory]
    REFDIR=[reference directory]
    LEGFILE=${REFDIR}/legend/1000GP_Phase3_combined.legend
    VCFFILE=${REFDIR}/vcf
    M3VCFFILE=${REFDIR}/m3vcf
    BCFFILE=${REFDIR}/bcf
    MAPFILE=${REFDIR}/map/genetic_map_hg19_withX.txt

    echo Creating BCF reference and target files for phasing... | sed G
    for chr in {1..21} 
    do
        echo Creating indexed BCF files for chromosome ${chr}...
        bcftools view ${VCFFILE}/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -O b -o ${BCFFILE}/ref_chr${chr}.bcf
        bcftools index ${BCFFILE}/ref_chr${chr}.bcf
        plink --silent --bfile ./qc/chr${chr}_QCed --recode vcf-iid bgz --out ./input/chr${chr}
        bcftools view ./input/chr${chr}.vcf.gz -O b -o ./input/chr${chr}.bcf
        bcftools index ./input/chr${chr}.bcf
    done

    echo Phasing... | sed G 
    for chr in {1..21}
    do
        echo Phasing chromosome ${chr}...
        eagle --vcfRef ${BCFFILE}/ref_chr${chr}.bcf \
                --vcfTarget ./input/chr${chr}.bcf  \
                --geneticMapFile ${MAPFILE} \
                --outPrefix ./phased/chr${chr}.phased \
                --allowRefAltSwap \
                --numThreads 5 \
                --vcfOutFormat z 
    done

    echo Imputing...
    for chr in {1..21}
    do
        echo Imputing chromosome ${chr}...
        minimac4 --refHaps ${M3VCFFILE}/${chr}.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz \
                --haps ./phased/chr${chr}.phased.vcf.gz \
                --prefix ./imputed/chr${chr}.imputed \
                --cpus 5  \
                --noPhoneHome \
                --format GT,DS,GP \
                --allTypedSites --minRatio 0.00001
    done

After the imputation, post processing needed to be done to ensure the quality of the imputed SNPs and extract the geneotypes of the SNPs of interest. For inferring ancestry, I leveraged the set of 100 SNPs that were shown to be associated with ancestry (genetic markers) by Sampson et al.. The raw genotype data from GDSC only included 26 from the 100 SNPs, so I did the imputation as shown above to retrive the genotypes of the rest 74 SNPs. Following imputation, I got the complete genotype information of the 100 SNPs for each cell line. In order to extract the SNPs from the imputed data, I first need the names of the SNPs in a text file, saved in the working directory as 100snps.txt:

rs194014
rs1356733
rs4751629
rs1281340
rs11164559
...

The I used the text file to get the SNPs’ coordinates and other properties using the R package biomaRt, which retrived the information from the Ensembl database. The retrived information will be used to post process the imputed data. To this end I run the R script:


    #!/usr/bin/env Rscript

    library(ggplot2)
    library(dplyr)
    library(biomaRt)
    library(stringr)

    #Load 100 SNPs
    data_snp <- read.csv("100snps.txt", sep = "\t", header = F)

    #Get info from biomart
    snp_mart <- useMart(biomart="ENSEMBL_MART_SNP", 
                    host="grch37.ensembl.org", path="/biomart/martservice", 
                    dataset="hsapiens_snp")
    data_snp_mart <- getBM(attributes=c(
    "refsnp_id", "chr_name","allele", "allele_1", "minor_allele",
    "minor_allele_freq", "synonym_name", "variation_names"),
    filters="snp_filter", values=data_snp$V1,
    mart=snp_mart, uniqueRows=TRUE)
    
    #Get coordinates for 100 SNPs (for genotype imputation using other tools)
    data_snp_coord <- getBM(attributes=c(
    "refsnp_id", "chr_name", "chrom_start", "chrom_end"),
    filters="snp_filter", values=rsID_ethc,
    mart=snp_mart, uniqueRows=TRUE)
    data_snp_coord <- data_snp_coord[-grep("H",data_snp_coord$chr_name),]
    write.table(data_snp_coord, file = "snp.txt", sep = "\t", row.names = F, quote = F)

After getting the ancestral SNP coordinates, I run the following script to process the imputed results the extract the relevant SNPs for subsequenct ancestry inference:


    WDIR=[working directory]
    DDIR=[data directory]
    REFDIR=[reference directory]
    LEGFILE=${REFDIR}/legend/1000GP_Phase3_combined.legend
    VCFFILE=${REFDIR}/vcf
    M3VCFFILE=${REFDIR}/m3vcf
    BCFFILE=${REFDIR}/bcf
    MAPFILE=${REFDIR}/map/genetic_map_hg19_withX.txt
    SNP=${WDIR}/snp.txt

    for chr in {1..21}
    do
        echo Processing imputed chromosome ${chr}
        
        plink --vcf ./imputed/chr${chr}.imputed.dose.vcf.gz \
            --snps-only \
            --silent \
            --recode --out ./scratch/chr${chr}_filtered
        if [[ ! -s ./scratch/chr${chr}_filtered.ped ]]; then 
            echo No snp passing filtering | sed G
            continue 
        fi
        perl -pi -e 's/"//g' $SNP
        awk -v CHR=${chr} '$2==CHR{print}' $SNP > ./scratch/snp_${chr}.txt
        Rscript --vanilla match_ids.R ./scratch/chr${chr}_filtered.map ./scratch/snp_${chr}.txt ${chr}
        
        awk '{print $5}' ./scratch/snp_${chr}_withIDs.txt > ./scratch/extracted_ids_${chr}.txt 
        plink --file ./scratch/chr${chr}_filtered \
            --silent \
            --extract ./scratch/extracted_ids_${chr}.txt \
            --recode vcf-iid --out ./scratch/chr${chr}_filtered_extracted
        awk '!/#|##/ {print}' ./scratch/chr${chr}_filtered_extracted.vcf > ./scratch/chr${chr}_filtered_extracted_trimmed.vcf
        awk '/#CHROM/{print}' ./scratch/chr${chr}_filtered_extracted.vcf > ./ancestral_snp/snp_imputed_${chr}.txt
        #perl -pi -e 's/0_1_//g' ./ancestral_snp/snp_imputed_${chr}.txt
        while IFS= read -r line; do
            echo $line > ./scratch/tmp.txt
            REF=$(awk '{print $4}' ./scratch/tmp.txt)
            ALT=$(awk '{print $5}' ./scratch/tmp.txt)
            perl -pi -e 's/0\/0/'${REF}''${REF}'/g' ./scratch/tmp.txt
            perl -pi -e 's/0\/1/'${REF}''${ALT}'/g' ./scratch/tmp.txt
            perl -pi -e 's/1\/1/'${ALT}''${ALT}'/g' ./scratch/tmp.txt
            cat ./scratch/tmp.txt >> ./ancestral_snp/snp_imputed_${chr}.txt
        done < ./scratch/chr${chr}_filtered_extracted_trimmed.vcf
        perl -pi -e 's/ /\t/g' ./ancestral_snp/snp_imputed_${chr}.txt 
    done

Ancestry inference

The observed and imputed ancestral SNPs were leveraged to infer ancestry of the cell lines. To this end, Bayesian inference was employed to estimate the likelihood of a cell line belonging to a specific population given its observed genotypes, the genotype frequencies of the population, and their weights in the 1000 Genomes Project.

First, the population genotype frequencies were retrieved using the Ensembl API. After installing the API, I run the following Perl script to retrieve the general frequencies for the SNPs of interest:


    #One needs to install Ensembl API before using this script
    use strict;
    use warnings;
    use Bio::EnsEMBL::Registry;

    my $registry = 'Bio::EnsEMBL::Registry';

    # Load the registry from db
    $registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
    );

    my $va_adaptor = $registry->get_adaptor('human','variation','variation');
    $va_adaptor->db->use_vcf(1);

    my $filename = '100snps.txt';
    open(my $fh, '<:encoding(UTF-8)', $filename)
    or die "Could not open file '$filename' $!";
    
    my @snps = <$fh>;
    chomp @snps;

    foreach my $snps (@snps) {
    print "$snps\n";
    }

    foreach my $snps (@snps) {
    my $va = $va_adaptor->fetch_by_name($snps); #get the Variation from the database using the name
    my $genotypes = $va->get_all_PopulationGenotypes();
    foreach my $genotype (@{$genotypes}) {
    my $pop_name = $genotype->population->name;
    my $gt_name = $genotype->genotype_string;
    my $freq = $genotype->frequency;
        open (FH, '>>', 'genotypes.txt') or die "Could not open file $!";
    print FH "$snps $pop_name $gt_name $freq\n";
    close (FH);
    }
    }

    print('Write to genotype successfully!')

After running the Perl script, I got the result stored in genotype.txt in the working directory, which is a text file of four columns for SNP names, populations, genotypes and frequency, respectively, as example below:

rs194014 AFR GG 0.142208774583964 # Frequencies of the 3 genotypes of the SNP for Africans
rs194014 AFR AG 0.453857791225416
rs194014 AFR AA 0.40393343419062
rs194014 AMR GG 0.786743515850144 # Frequencies of the 3 genotypes of the SNP for Americans
rs194014 AMR AG 0.198847262247839
rs194014 AMR AA 0.0144092219020173
...

I also needed the population sizes which I obtained from the 1000 Genomes Project and use them as the population weights in my inference pipeline. You can get it here and put it in the working directory.

For a cell line \(Y_i\) with the genotype \(G_i\) derived from ancestral SNPs that occur independently, the probability \(P(G_i)\) that \(Y_i\) belongs to a specific population 𝑘 (where 𝑘 represents one of the 25 subpopulations in the 1000 Genomes Project) is calculated as follows:

\[P(\vec{G}_i) = \frac{\prod_{j=1}^{100} \hat{P}(Y_i = k) \times {P}(Y_i = k)}{\hat{P}(G_{ij})}\]

The subpopulation with the highest probability was assigned to the cell line. To implement this, I run the following script:


    #!/usr/bin/env Rscript

    library(ggplot2)
    library(dplyr)
    library(biomaRt)
    library(stringr)

    #Import imputed data
    for (chr in seq(1:21)){
    file <- paste0("./ancestral_snps/snp_imputed_",chr,".txt")
    if (file.exists(file)){
        assign(paste0("snp_",chr,"_imputed"),read.delim(file, header = T, sep = "\t", row.names = NULL))
    }
    }
    cname <- colnames(snp_1_imputed)
    for (chr in seq(1:21)){
    name <- paste0('snp_',chr,'_imputed')
    if (exists(name)) {
        obj=get(name)
        obj=obj[,cname]
        assign(name,obj)
    }
    }

    snp_imputed <- do.call(rbind, lapply( paste0("snp_",c(1:21),"_imputed") , get) )
    snp_imputed$ID <- as.character(snp_imputed$ID)
    index <- match(snp_imputed$POS,data_snp_coord$chrom_start)
    for (i in 1:length(index)){
    snp_imputed$ID[i] <- data_snp_coord$refsnp_id[index[i]]
    }

    snp_imputed <- snp_imputed[,-c(1,2,4:9)]
    rownames(snp_imputed) <- snp_imputed$ID
    data_final <- as.matrix(snp_imputed[,-1])

    #Label cell lines with estimated ethnicity probability using frequencies from 1000Gen
    size_1kg <- read.csv("sample_size_1kg.csv", header = T)
    data_1kg <- read.delim("genotypes2.txt", header = F, sep = "",
                            col.names = c("rsID", "population", "genotype", "frequency"))
    data_1kg$sup_pop <- vector(mode = "character", length = nrow(data_1kg))
    data_1kg$size <- vector(mode = "numeric", length = nrow(data_1kg))
    index <- match(data_1kg$population, size_1kg$Population)

    for (i in 1:nrow(data_1kg)){
    data_1kg$size[i]<- size_1kg$Number.of.genotypes[index[i]]
    if (data_1kg$population[i]=="ALL"){
        data_1kg$sup_pop[i] <- "ALL"
    } else if (data_1kg$population[i] %in% c("ACB","AFR","ASW","ESN","GWD","LWK","MSL","YRI")){
        data_1kg$sup_pop[i] <- "AFR"
    } else if (data_1kg$population[i] %in% c("AMR","CLM","MXL","PEL","PUR")){
        data_1kg$sup_pop[i] <- "AMR"
    } else if (data_1kg$population[i] %in% c("EAS","CDX","CHB","CHX","JPT","KHV")){
        data_1kg$sup_pop[i] <- "EAS"
    } else if (data_1kg$population[i] %in% c("EUR","CEU","FIN","GBR","IBS","TSI")){
        data_1kg$sup_pop[i] <- "EUR"
    } else if (data_1kg$population[i] %in% c("SAS","BEB","GIH","ITU","PJL","STU")){
        data_1kg$sup_pop[i] <- "SAS"
    } else {
        data_1kg$sup_pop[i] <- "NA"
    }
    }
    data_1kg <- data_1kg[-grep("NA",data_1kg$sup_pop),] 
    data_1kg <- data_1kg[!data_1kg$population %in% c("ALL","AFR","AMR","EAS","EUR","SAS"),] %>%
    droplevels()

    prob <- vector(mode = "list", length = ncol(data_final))
    names(prob) <- colnames(data_final)
    prob_mat <- matrix(nrow = ncol(data_final), ncol = nlevels(data_1kg$population),
                    dimnames = list(colnames(data_final),levels(data_1kg$population)))

    for (n in 1:ncol(data_final)){
    prob[[n]] <- matrix(nrow = nrow(data_final), ncol = nlevels(data_1kg$population),
                        dimnames = list(rownames(data_final),levels(data_1kg$population)))
    for (i in 1:nrow(data_final)) {
        for (j in 1:ncol(prob[[n]])) {
        sub_data <- data_1kg[as.character(data_1kg$rsID) == rownames(data_final)[i] &
                                as.character(data_1kg$population) == colnames(prob[[n]])[j], ]
        temp_index <- match(data_final[i,n], sub_data$genotype)
        if (is.na(temp_index)){
            tmp <- strsplit(data_final[i,n],NULL)[[1]]
            tmp_rev <- rev(tmp)
            data_final[i,n] <- paste(tmp_rev,collapse = '')
            temp_index <- match(data_final[i,n], sub_data$genotype)
        }
        temp_snp_freq <- sub_data$frequency[temp_index]
        temp_pop_size <- sub_data$size[temp_index]
        prob[[n]][i,j] <- as.numeric(temp_snp_freq) * as.numeric(temp_pop_size)  
        } 
    }
    prob_mat[n,] <- apply(prob[[n]], 2, prod, na.rm=T) 
    }

    prob_vec <- prob_vec = apply(prob_mat,1,function(x) names(which.max(x)))
    cell_ancestry <- data.frame(cell=names(prob_vec), ancestry=prob_vec)
    cell_ancestry$sup_pop <- match(cell_ancestry$ancestry,data_1kg$population)
    cell_ancestry$sup_pop <- data_1kg$sup_pop[cell_ancestry$sup_pop]
    cell_ancestry$cell <- str_replace_all(cell_ancestry$cell, c("\\."="-", "^X"="")) 

    saveRDS(prob_mat, file = "prob_mat.rds")
    write.table(cell_ancestry, file = "cell_ancestry.csv", row.names = F, quote = F, sep = ",")

The inferred ancestry for all cell lines in our study looks like this:

Figure 1

Ancestry distribution among cell lines and cancer types

Above is the step by step guide of how to infer ancestry from raw genotype data of cancer cell lines. The comprehensive pipeline could be found in my github repository for this project.