Hi,

We are biology students of Avans Breda, are trying to run a machine learning script in R with a dataset of human genome sequences. We came across some errors and hope that one of you can help us with this.

The scripts we are trying to run are located at https://github.com/cancer-genomics/delfi_scripts.

Our error comes up while running the 04-script on a single bam file from the original dataset, which is part of the scripts located at the mentioned github.

As far as we understand this script joins the product of the previous scripts (an .rds file) with the sample_reference.csv file located on the github, and than splits the data into 5mb bins. Which are later used in a stochastic gradient boosted alogrithm. The problem is in this bit of code:

1 df.fr <- readRDS("../.../.../ourfilespecification_frags_bin_100kb.rds")

2 master <- read_csv("sample_reference.csv")

3 df.fr2 <- inner_join(df.fr, master, by=c("sample"="WGS ID"))

4 hic.eigen <- (df.fr2 %>% filter(sample=="PGDX10346P1"))$hic.eigen

But while joining, it gives the following error message:

Error in UseMethod("inner_join") :

no applicable method for 'inner_join' applied to an object of class "c('GRanges', 'GenomicRanges', 'Ranges', 'GenomicRanges_OR_missing', 'GenomicRanges_OR_GenomicRangesList', 'GenomicRanges_OR_GRangesList', 'List', 'Vector', 'list_OR_List', 'Annotated', 'vector_OR_Vector')"

Calls: inner_join

We assumed this meant that the inner_join function is not compatible with the GRanges class. We tried changing the object class by first changing the GRanges to a dataframe.

1 df.fr <- data.frame(readRDS("../.../.../ourfilespecification_frags_bin_100kb.rds"))

When we ran the script again we error message changed to this:

Error: by can't contain join column sample which is missing from LHS

Backtrace:

█

├─dplyr::inner_join(df.fr, master, by = c(sample = "WGS ID"))

└─dplyr:::inner_join.data.frame(df.fr, master, by = c(sample = "WGS ID"))

├─base::as.data.frame(...)

├─dplyr::inner_join(tbl_df(x), y, by = by, copy = copy, ...)

└─dplyr:::inner_join.tbl_df(...)

├─dplyr::common_by(by, x, y)

└─dplyr:::common_by.character(by, x, y)

└─dplyr:::common_by.list(by, x, y)

└─dplyr:::bad_args(...)

└─dplyr:::glubort(fmt_args(args), ..., .envir = .envir)

Execution halted

It seems to us that there are no identical keys to match up the two dataframes. When we looked at the input file, created from the previous scripts, there is no column called sample. The sample_reference.csv file does have a "WGS ID" column.

Is it possible to join these files, and continue with the scripts.

Another thing which bothers us is that the PGDX10346P1 in the code is the name of a bam file. Do we have to change it to the bam file which we use to run the scripts? But first, the inner_join problem.

Could anyone help us fix this error message? The authors of the paper told us the man who wrote the script is currently unavailable because of medical reasons, so we can't ask them.

Please keep in mind that we are biology students, not informatics or mathmatics students, we are not very good with coding.

With kind regards, School of Life Sciences, Avans university of applied sciences, Breda, The Netherlands.