Sunday, September 26, 2021

Running multiple parallel scripts from a bash script using GNU parallel

 Supposedly, if we want to run any perl/any language program in bash on mutiple input files (say 1000 inputs files) and if the number of threads in my computer is either 48 or 64 only, sometimes there might be overload which can lead to Resource temporarily unavailable error.

So, to circumvent this problem, we want to run a batch of 48 scripts/commands parallely without overloading the resources. This can be done using GNU parallel. The advantage of GNU parallel is, By default, parallel runs as many jobs in parallel as there are CPU cores.

In order to do this, we will create a file with commands and pass it to GNU parallel.

time for d in $(ls */*fq); 

do 

echo "perl /storage/apps/SNP_Validation_Scripts/tools/Q20_Q30_Stats_wo_functions.pl $d";

 done >parallel_script_Q20_Q30_stats.sh


parallel < parallel_script_Q20_Q30_stats.sh

And thats it! 

Tuesday, September 21, 2021

Upgrade R version 3.5 to 4.1 in CentOS 7

$ sudo yum update R
 

$ yum list installed | grep R
$ sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
 

$ sudo yum install yum-utils
$ sudo yum-config-manager --enable "rhel-*-optional-rpms"
 

$ export R_VERSION=4.1.1
$ curl -O https://cdn.rstudio.com/r/centos-7/pkgs/R-${R_VERSION}-1-1.x86_64.rpm
$ sudo yum install R-${R_VERSION}-1-1.x86_64.rpm

$ /opt/R/${R_VERSION}/bin/R --version
R version 4.1.1 (2021-08-10) -- "Kick Things"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

$ sudo ln -s /opt/R/${R_VERSION}/bin/R /usr/local/bin/R

$ sudo ln -s /opt/R/${R_VERSION}/bin/Rscript /usr/local/bin/Rscript


# Let's check the R version is changed

$ R

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-conda_cos6-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

# Looks like the we have to create a softlink of R 4.1 instead of R 3.5 which is at:


$ which R

/storage/apps/anaconda3/bin/R

# change the R3.5 version softlinkname
$ sudo mv /storage/apps/anaconda3/bin/R /storage/apps/anaconda3/bin/R3.5

# softlink
$ sudo ln -s /opt/R/${R_VERSION}/bin/R /storage/apps/anaconda3/bin/R

# Now we have latest R version
$ R --version
R version 4.1.1 (2021-08-10) -- "Kick Things"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.


Wednesday, September 8, 2021

BEAST - ESS of combined logs vs ESS of individual run logs

Sometimes, when running BEAST, it is seen that ESS values of the individual runs is not > 200 but the combined ESS of all the independent runs is > 200 and traces of three independent runs converge in the combined trace. Is it acceptable then to use combined estimate of parameters such as mutation rate?


Here is an answer from BEAST author:

https://groups.google.com/g/beast-users/c/b11vV19nqkI

User:

Is it acceptable to use an estimate of say, TMRCA, from the combined analysis of three identical runs (from different random seeds) if the ESS is >200 for all parameters in the combined file (but not in each individual file) in Tracer?

Or do I need to do that three times (for a total of 9 runs) and use log combiner to combine the nine files?

alexei....@gmail.com

3 runs combined to give an ESS of >200 is probably safe but you need to be a little bit careful. 10 runs to get an ESS of >200 is probably not safe because if individual runs are given ESS estimates of about 20 then there is a question of whether those ESS estimates are valid at all. You should use Tracer to visually inspect that your 3 runs are giving essentially the same answers for all the parameters and the likelihood and prior. The traces should substantially overlap and given basically the same mean and variance. If your three runs give traces that don't overlap then you can't combine them no matter what the combined ESS says! 

Cheers

Alexei