Thursday, December 29, 2022

Resolved - scenic+ pip install error

  pip install -e .

Obtaining file:///home/prakki/sw/scenicplus

  Preparing metadata (setup.py) ... error

  ERROR: Command errored out with exit status 1:

   command: /home/prakki/sw/pyenv/versions/3.6.4/bin/python3.6 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/prakki/sw/scenicplus/setup.py'"'"'; __file__='"'"'/home/prakki/sw/scenicplus/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-rn24et6_

       cwd: /home/prakki/sw/scenicplus/

  Complete output (22 lines):

  Traceback (most recent call last):

    File "<string>", line 1, in <module>

    File "/home/prakki/sw/scenicplus/setup.py", line 48, in <module>

      "Operating System :: OS Independent",

    File "/home/prakki/.local/lib/python3.6/site-packages/setuptools/__init__.py", line 145, in setup

      return distutils.core.setup(**attrs)

    File "/home/prakki/sw/pyenv/versions/3.6.4/lib/python3.6/distutils/core.py", line 108, in setup

      _setup_distribution = dist = klass(attrs)

    File "/home/prakki/.local/lib/python3.6/site-packages/setuptools/dist.py", line 444, in __init__

      k: v for k, v in attrs.items()

    File "/home/prakki/sw/pyenv/versions/3.6.4/lib/python3.6/distutils/dist.py", line 281, in __init__

      self.finalize_options()

    File "/home/prakki/.local/lib/python3.6/site-packages/setuptools/dist.py", line 732, in finalize_options

      ep.load()(self, ep.name, value)

    File "/home/prakki/.local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2434, in load

      return self.resolve()

    File "/home/prakki/.local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2440, in resolve

      module = __import__(self.module_name, fromlist=['__name__'], level=0)

    File "/home/prakki/sw/scenicplus/.eggs/setuptools_scm-7.1.0-py3.6.egg/setuptools_scm/__init__.py", line 5

      from __future__ import annotations

      ^

  SyntaxError: future feature annotations is not defined

  ----------------------------------------

WARNING: Discarding file:///home/prakki/sw/scenicplus. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.


Solution:

For some reason, although I installed python3.8 version for scenic+, still it reported error. And surprisingly, my python version was found to be 3.6

$ python --version

Python 3.6.4


but, I had python3.8 version installed in the scenicplus environment.

$ which python3.8

/home/prakki/anaconda3/envs/scenicplus/bin/python3.8

Assuming I am using older version of pip, I checked the above bin had a latest pip version.

and running the below command finally worked in the installation of scenic

/home/prakki/anaconda3/envs/scenicplus/bin/pip install .





Wednesday, November 30, 2022

devtools installation error

 installing *source* package ‘textshaping’ ...

** package ‘textshaping’ successfully unpacked and MD5 sums checked

** using staged installation

Package harfbuzz was not found in the pkg-config search path.

Perhaps you should add the directory containing `harfbuzz.pc'

to the PKG_CONFIG_PATH environment variable

No package 'harfbuzz' found

Package fribidi was not found in the pkg-config search path.

Perhaps you should add the directory containing `fribidi.pc'

to the PKG_CONFIG_PATH environment variable

No package 'fribidi' found

Using PKG_CFLAGS=

Using PKG_LIBS=-lfreetype -lharfbuzz -lfribidi -lpng

--------------------------- [ANTICONF] --------------------------------

Configuration failed to find the harfbuzz freetype2 fribidi library. Try installing:

 * deb: libharfbuzz-dev libfribidi-dev (Debian, Ubuntu, etc)

 * rpm: harfbuzz-devel fribidi-devel (Fedora, EPEL)

 * csw: libharfbuzz_dev libfribidi_dev (Solaris)

 * brew: harfbuzz fribidi (OSX)

If harfbuzz freetype2 fribidi is already installed, check that 'pkg-config' is in your

PATH and PKG_CONFIG_PATH contains a harfbuzz freetype2 fribidi.pc file. If pkg-config

is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:

R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'

-------------------------- [ERROR MESSAGE] ---------------------------

<stdin>:1:10: fatal error: hb-ft.h: No such file or directory

compilation terminated.

--------------------------------------------------------------------

ERROR: configuration failed for package ‘textshaping’

* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/textshaping’

Warning in install.packages :

  installation of package ‘textshaping’ had non-zero exit status

ERROR: dependency ‘textshaping’ is not available for package ‘ragg’

* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/ragg’

Warning in install.packages :

  installation of package ‘ragg’ had non-zero exit status

ERROR: dependency ‘ragg’ is not available for package ‘pkgdown’

* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/pkgdown’

Warning in install.packages :

  installation of package ‘pkgdown’ had non-zero exit status

ERROR: dependency ‘pkgdown’ is not available for package ‘devtools’

* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/devtools’

Warning in install.packages :

  installation of package ‘devtools’ had non-zero exit status


The downloaded source packages are in

‘/tmp/RtmpbqMThv/downloaded_packages’



* using staged installation

Package libtiff-4 was not found in the pkg-config search path.

Perhaps you should add the directory containing `libtiff-4.pc'

to the PKG_CONFIG_PATH environment variable

No package 'libtiff-4' found

Package libtiff-4 was not found in the pkg-config search path.

Perhaps you should add the directory containing `libtiff-4.pc'

to the PKG_CONFIG_PATH environment variable

No package 'libtiff-4' found

Using PKG_CFLAGS=

Using PKG_LIBS=-lfreetype -lpng16 -ltiff -lz -ljpeg -lbz2

-----------------------------[ ANTICONF ]-------------------------------

Configuration failed to find one of freetype2 libpng libtiff-4. Try installing:

 * deb: libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev (Debian, Ubuntu, etc)

 * rpm: freetype-devel libpng-devel libtiff-devel libjpeg-turbo-devel (Fedora, CentOS, RHEL)

 * csw: libfreetype_dev libpng16_dev libtiff_dev libjpeg_dev (Solaris)

If freetype2 libpng libtiff-4 is already installed, check that 'pkg-config' is in your

PATH and PKG_CONFIG_PATH contains a freetype2 libpng libtiff-4.pc file. If pkg-config

is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:

R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'

-------------------------- [ERROR MESSAGE] ---------------------------

<stdin>:1:10: fatal error: ft2build.h: No such file or directory

compilation terminated.

--------------------------------------------------------------------

ERROR: configuration failed for package ‘ragg’

* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/ragg’

Warning in install.packages :

  installation of package ‘ragg’ had non-zero exit status

ERROR: dependency ‘ragg’ is not available for package ‘pkgdown’

* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/pkgdown’

Warning in install.packages :

  installation of package ‘pkgdown’ had non-zero exit status

ERROR: dependency ‘pkgdown’ is not available for package ‘devtools’

* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/devtools’

Warning in install.packages :

  installation of package ‘devtools’ had non-zero exit status


The downloaded source packages are in

‘/tmp/RtmpmOvwEJ/downloaded_packages’

Needed to run these packages on Linux terminal:

sudo apt install build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev

sudo apt-get -y build-dep libcurl4-gnutls-dev

sudo apt-get -y install libcurl4-gnutls-dev

sudo apt-get install libcurl4-openssl-dev libssl-dev

sudo apt install libharfbuzz-dev libfribidi-dev

sudo apt-get install libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev



Tuesday, November 29, 2022

Resolved- Error in scale.default: length of 'scale' must equal the number of columns of 'x'

I encountered the following error when running prcomp function in my shiny app:

Error in scale.default: length of 'scale' must equal the number of columns of 'x'

My command was:

prcomp(Annotated_Gene_PA_df_t, scale = pca_scale)

pca_scale was a radio button with code:

radioButtons("pca_scale", label = "Scale", inline = TRUE, choices = c("TRUE", "FALSE"), selected = "FALSE"),

When giving input as choices from the GUI, passing TRUE/FALSE will result in an error because they are passed in as strings meaning that they are passed with quotes "TRUE"/"FALSE" as in the above case. 

Whereas we need to pass TRUE or FALSE as boolean values - without any quotes.

The below command takes user input as true/false as 1/0 and then the next command converts it into a logical boolean value.

So I changed the above line in ui.R like this:

radioButtons("pca_scale", label = "Scale", inline = TRUE, choices = c("TRUE"=1, "FALSE"=0), selected = "FALSE")

and in the server.R, I changed input$pca_scale to :

as.logical(as.numeric(input$pca_scale))



Resolved - UnsatisfiableError: The following specifications were found to be incompatible with each other

I encountered an error creating a conda envrionment when following single cell best practices tutorial available at:

https://www.sc-best-practices.org/introduction/raw_data_processing.html#a-real-world-example  

Below is the issue:

conda create -n af_xmpl -y -c bioconda python=3.9 salmon alevin-fry pyroe

Collecting package metadata (current_repodata.json): done

Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.

Collecting package metadata (repodata.json): done

Solving environment: | 

Found conflicts! Looking for incompatible packages.

This can take several minutes.  Press CTRL-C to abort.

failed                                                                                                                                                      

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Below commands did not work!

conda update --all --yes
conda config --set channel_priority flexible
conda config --set channel_priority false

conda create -n af_xmpl 
conda activate af_xmpl
conda install -c bioconda python=3.9 salmon alevin-fry pyroe

The following command worked when I added -c conda-forge

conda create -n af_xmpl 
conda activate af_xmpl
conda install -c conda-forge -c bioconda python=3.9 salmon alevin-fry pyroe

Friday, October 7, 2022

Resolved - dist function - NAs introduced by coercion

# Get euclidean dist

euclideanDist <- dist(simMatrix, method = "euclidean")

Warning message:

In dist(simMatrix, method = "euclidean") : NAs introduced by coercion


This is because the first column is factor. So, convert the 1st column into rownames which will take care of this error.

simMatrix <- simMatrix %>% tibble::column_to_rownames("Column1Header")

  # Get euclidean dist

  euclideanDist <- dist(simMatrix, method = "euclidean")

Now, no more error since all the columns in the matrix have only numerical values.

Thursday, September 15, 2022

Installing Seurat Package in Rstudio

While installing  seurat, I had to install several dependencies in linux system. Keeping a log of them.

install.packages('Seurat')

Had an issue with curl:

ERROR: configuration failed for package ‘curl’

* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/curl’

Warning in install.packages :

  installation of package ‘curl’ had non-zero exit status

sudo apt-get install libcurl4-openssl-dev

configure: error: geos-config not found or not executable.

ERROR: configuration failed for package ‘rgeos’

* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/rgeos’

 sudo apt install libgeos-dev

 install.packages('Seurat')

ERROR: configuration failed for package ‘nloptr’
* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/nloptr’
Warning in install.packages :
  installation of package ‘nloptr’ had non-zero exit status

sudo apt-get install libnlopt-dev

ERROR: configuration failed for package ‘xml2’
* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/xml2’
Warning in install.packages :
  installation of package ‘xml2’ had non-zero exit status

sudo apt-get install libxml2-dev 

install.packages("xml2")

ERROR: configuration failed for package ‘hdf5r’
* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/hdf5r’

sudo apt-get install libhdf5-dev

---------------------------

The below are not necessary for Seurat but I needed to run for other packages

ERROR: configuration failed for package ‘systemfonts’
* removing ‘/home/ramadatta/R/x86_64-pc-linux-gnu-library/4.1/systemfonts’

sudo apt -y install libfontconfig1-dev

Thursday, September 8, 2022

Bash - partially resolved - cannot create temp file for here-document: no space left on device


Deleted some of the files from /var/log first

Then docker was not used for sometime, so deleted all the dockers images which reclaimed some space!!

1
2
3
4
5
cd /var/log
du -sh *

sudo du -ahx / | sort -rh | head -20
sudo docker image prune --all

Wednesday, August 31, 2022

Useful conda commands list

List the conda environments

conda info --envs

Create a conda environment

conda create --name PlasClass

Activate a conda environment

conda activate PlasClass

Install a package in conda environment

conda install -c bioconda plasclass

Install multiple packages of specific version while creating a conda environment

conda create -n PlasClass python=3.7 scikit-learn=0.21.3 plasclass

Install specific version of package in a conda environment

conda install spades=3.9.0

Search for the package versions available to install in a conda environment

conda search spades

Remove an environment

conda env remove -n PlasClass
Export 
conda env export > singlecell_env.yml
install using yml file
working with mamba  fast installations
useful tips


Monday, August 8, 2022

Difference between PCA and PCoA

 PCA and PCoA are really similar. In fact, PCA is just a type of PCoA that uses euclidean distances! So we could say:

Type of Ordination: - MDS - CCA - PCoA - PCoA of Jaccard distances - PCoA of Bray-Curtis dissimilarities - PCoA of Euclidian distances (this is also called PCA) - PCoA of UniFrac distances
source: https://forum.qiime2.org/t/pca-vs-pcoa-which-is-the-appropriate-one-for-microbiome-data/5974/2

Tuesday, July 19, 2022

Calculate the frequency for a range of nucleotides aligned to reference from a BAM file

I wanted to find out the frequency for a range of nucleotides from multiple reads aligned to reference sequence from a BAM file. 

As you see from figure below, I want to find the frequency of nucleotides from position 13 to 15 and give their count.


For this, I use samtools mpileup command with a few bash commands as below.

samtools mpileup snps.bam -Q 0 -r CY116646.1:13-15 \| # Extract aligned region
sed 's/\^\]//g' \|  # "^]" signifies a base that was at the start of a read 
awk '{print $5}' \| # Read bases
sed -e 's/./&,/g' -e 's/,$//g' >infile && # temp file
ruby -F, -lane 'BEGIN{$a=[]};$a.concat([$F]);END{$a.transpose.each{|x|puts x.join(",")}}' infile \| # pass temp file to ruby oneliner
sed 's/,//g' \| #remove delimiter
sort \| 
uniq -c # Count


This results in:
[mpileup] 1 samples in 1 input files
     10 GGT
      1 GTT


EDIT 1:

But this would not work if the mpileup bases are not of same length.

This could be achived using a R library package. All we need is input bam file and coordinates of the sequence.

1
2
3
library(GenomicAlignments)    
stacked <- stackStringsFromBam("snps.bam", param = GRanges(CY116646.1:13-15))
write.table(stacked, Aligned_Reads_at_Region_Of_Interest.txt, quote = FALSE, col.names = F, row.names = F)


If you have multiple positions, then you can run the following R script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  fileName <- "regions.list"
  conn <- file(fileName, open = "r")
  linn <- readLines(conn)

  for (i in 1:length(linn)) {
    stacked <- stackStringsFromBam("snps.bam", param = GRanges(linn[i]))
    write.table(
      stacked,
      paste0("Region", linn[i], ".txt"),
      quote = FALSE,
      col.names = F,
      row.names = F
    )
  }
  close(conn)

Some reference posts:

https://www.biostars.org/p/259690/

Sunday, July 17, 2022

Useful Blast Commands and one-liners

 # Blast command with additional details such as query length and subject length

blastn -db /storage/data/Storage/nt/nt/nt -query query.fa -outfmt '6 qseqid sseqid pident nident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen' -out query.fa_vs_nt_with_stitle_blastresults.txt -num_threads 48

# To get title of subject, For example: Enterobacter_aerogenes_strain_G7,_complete_genome 

blastn -db /storage/data/Storage/nt/nt/nt -query query.fa -outfmt '6 qseqid sseqid pident nident length mismatch gapopen qstart qend sstart send evalue bitscore stitle qlen slen' -out query.fa_vs_nt_with_stitle_blastresults.txt -num_threads 48

# Tophits

$ awk '!x[$1]++' query.fa_vs_nt_with_stitle_blastresults.txt

# Tophits with plasmid assignment

$ awk '!x[$1]++' query.fa_vs_nt_with_stitle_blastresults.txt | sed 's/ /_/g' | awk '{print $0 "\t" $5*100/$(NF-1)}' | column -t | grep 'plasmid'

Thursday, July 14, 2022

Running MLST tool on individual contigs in fasta vs whole genome assembly fasta

This is more of a kind of observation.

When MLST tool was used to assign species and ST on assembled contigs as a whole file the assignment was ecoli species and ST 448. 

But when assembled file is separated into multiple contigs, only one contig was assigned with species "senterica", without ST assignment. The rest of the contigs were not assigned by any species and ST. 

It appears, running all the contigs together in a file gives MLST tool more confidence than running the tool on each single contig. 




Tuesday, June 28, 2022

Contigs with no read coverage (zero reads mapped) after mapping using Bowtie2

I recently mapped reads using Bowtie2 against viral genome segments to find the read counts for each of the segments. To my surprise, I found contigs with no reads mapped on them.

That is when I came to know that bowtie2 used END to END read alignment using default settings (see here). In reality, the contigs from many assemblers are further broken down into kmers and are stitched together. In this case, I used the Megahit metagenome assembler, which uses the default sequence of 21, 41, 61, 81 and 99 kmers (Reference). So, aligning reads using the END to END parameter settings may not work sometimes resulting in zero reads count or no read coverage for the particular contig. 

In the figure down, we see the PB2 and NP (colored read) using END to END has no reads mapped but the contig is created by the assembler. This more likely seems to be happened because the assemblers use kmer approach to create contigs. This is when, changing the setting from END to END to LOCAL makes more sense. 

In the local alignment when some of the bases at the ends of the read do not participate, they are omitted (or "soft trimmed" or "soft clipped") from the beginning or from the end. That's how we see PB2 and NP have now read counts. For other segments, the read counts seemed to increase. 

So, in this particular case of alignment, local alignment of the reads using bowtie2 makes more sense.



For more discussion, see here:

Tuesday, June 21, 2022

Downloading viral genome database for viral read classification - Metagenomics

  1.  Kraken2 Metagenomic Virus Database - redirects to globus - does not have an option to download

  2. Default kraken2 command:
    • kraken2-build --download-library viral --db $DBNAME

      Results in the error - adding --use-ftp did not help
      rsync: getaddrinfo: ftp.ncbi.nlm.nih.gov 873: Name or service not known
      rsync error: error in socket IO (code 10) at clientserver.c(127) [Receiver=3.1.3]
      Error downloading assembly summary file for viral, exiting.


    • Tried changing code in specific scripts based on this error thread - still no luck! :(
    • Tried changing code in specific scripts based on this error thread - still no luck! :(

  3. A python script that helps with updating the kraken databases :
    • error - FileNotFoundError: [Errno 2] No such file or directory: 'assembly_summary_refseq.txt'

  4. Downloaded NCBI Viral database from here - have not tested - but fasta sqeuences needed to converted to kraken2 database format.

  5. Downloaded Viral RefSeq database from a PeerJ paper - worked!
    • Pre-compiled databases
    • Looks comprehensive! Size is big (6.6 gb)
    • Could classify the viral reads using kraken2 command

Thursday, June 16, 2022

Resolved-How to load an HTML page on Github as a normal HTML page we see in browser?

Github is all about code and there is no automatic way to load and view a .html file uploaded in our github repo. So, only the html code is displayed when clicked on the .html document.

A simple way to overcome this problem is to prepend the following before the actual link.

For example, 

my actual html file is located at:

https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.html

to load it as html file, we should prepend :

https://htmlpreview.github.io/?

Finally, my link in the browser should look like this:

https://htmlpreview.github.io/?https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.html



Monday, April 25, 2022

BRIG genome visualization missing rings - Solved!

Problem: 

Missing rings for the query sequences and BRIG doesn't give any error message.

Solution that worked for me:

This is an issue with java version. I was running Java 1.8 version and had same problem as the OP's question. Then I followed the answers and finally changing the java version worked for me.

I ran the following steps

1) Go to link: https://www.oracle.com/java/technologies/javase-java-archive-javase6-downloads.html 

2) mkdir Java1.6

3) Download "jdk-6u45-linux-x64.bin" and save in Java1.6

3) cd Java1.6 && chmod +x jdk-6u45-linux-x64.bin

4) You should see java file in Java1.6/jdk1.6.0_45/bin

5) Java1.6/jdk1.6.0_45/bin/java -version - should give "1.6.0_45"

6) Open BRIG using Java1.6 as follows:

Java1.6/jdk1.6.0_45/bin/java -jar BRIG.jar

Note: I did not uninstall my original Java1.8 rather used the downloded java1.6 execulatable file using path. That way I retain both versions without any problem.

This gave me all the rings without any issue!! Hope this helps someone!


https://www.researchgate.net/post/Why_does_BRIG_fail_to_create_rings_for_some_of_the_genomes

Posted my answer in research gate - click above link to see the discussion

Sunday, April 24, 2022

SRST2 requires a specific naming convention for input fastq files

 SRST2 Trouble installing in server

 Also wanted to mention that SRST2 has been there for a while now (last updated was 2015) and I had some issues with running server because of dependency issues of bowtie2 tool (libstd, libtbb). 

 

SRST2 Trouble with specific versions of bowtie

 

Also, SRST2 only needs a specific version of bowtie2 (between 2.10 to 2.2.9 - but the latest is 2.4.1) causing some problem to work on server. 

 

SRST2 installation

 

Finally installed SRST2 on using conda. https://anaconda.org/bioconda/srst2

 

SRST2 requires a specific naming convention

 

An other issue with SRST2 is the naming, since it accepts the names like _S1_L001_R1_001.fastq.gz. 


For those samples which does not have a convention " _S1_L001_R1_001.fastq", created a symlink dummy with " _S1_L001_R1_001.fastq". 


For example, "filename_L001_R1_001.fastq" does not follow the SRST2 input convention, so created a symlink file filename_S1_L001_R1_001.fastq.

 


Monday, March 21, 2022

Nextflow: Missing 'bind' declaration in input parameter - resolved

$ time nextflow test4_20200322.nf

N E X T F L O W  ~  version 21.04.3

Launching `test4_20200322.nf` [happy_leakey] - revision: d7d5d88709

reads: /data01/nextflow_test/2022_test/sample_data/*_{1,2}.fastq

Missing 'bind' declaration in input parameter

 -- Check script 'test4_20200322.nf' at line: 9 or see '.nextflow.log' file for more details


Paste this line at the 2nd line : nextflow.enable.dsl=2




Tuesday, February 15, 2022

Tips for uploading data to NCBI SRA

 ### Case: 

So, If have 20 bacterial samples, then I will have 20 x 2 = 40 fastq files for Illumina paired data and 20 x 1 = 20 fastq files for Nanopore data that I need to upload.

1) Create a Bioproject (Get the Bioproject accession)

2) Create a Biosample (Use Bioproject accession to fill the Microbe template (template depends on the organism bacterial in this case;can be downloaded from NCBI only as going through the steps)

3) For Illumina data 

   - Create a SRA submission 1

   - Need to fill the metadata template using the SAMN numbers generated from Biosample submission.

   - Upload the data either using aspera or FTP.

4) For Nanopore data 

   - Create a SRA submission 2

   - Need to fill the metadata template using the SAMN numbers generated from Biosample submission.

   - Upload the data either using aspera or FTP.



# Pointers:

1. Always best to create a bioproject first and get PRJNA number

2. Then create a BioSample project - in the Microbe1.0 excel provide the Bioproject (PRJNA) number

3. Then do the SRA submission.

4. For the SRA submission, if you have two types of data such as Ilumina and Nanopore sequences for the same sample, then in the SRA_meta_acc file have the Illumina information and followed by nanopore details. 

5. The SAM number (Biosample number) will be same for both Illumina and Nanopore data

6. If your data samples are more than 1000, then single submission is not possible in NCBI. So you have to split the data accordingly.

7. aspera command line seems a bit confusing to use, ftp seems to be straightforward. Just need to get username password from SRA submission

8. Advantage with ftp is: Even if sometimes the internet connection drops, we can resume from where transfer stopped

9. We cannot upload half of our data with FTP and half of our data aspera. It is either aspera or FTP. 

Wednesday, January 12, 2022

Extract accessory genes as fasta files from the roary output for downstream functional characterization


1. First generate the locus_tag of the accessory genes (output file: pan_genome_results)
query_pan_genome -a complement -g clustered_proteins *.gff
mv pan_genome_results accessory_genome_locus_tags.list
2. Use the locus_tags of the accessory genes to extract the sequences from prokka folder
fgrep 'locus_tag=' *.gff >Combined_local_gffs.txt # search 'locus_tag' in all gff file and combine - for speeding purpose

time for locus_tag in $(awk '{print $2}' accessory_genome_locus_tags.list); do 
file=`fgrep -m1 "$locus_tag" Combined_local_gffs.txt | awk -F':' '{print $1}' | sed 's/.gff//g'`; # extract the samplename (foldername) where the gene is present
awk -v var="$locus_tag" 'BEGIN {RS=">"} $0~var {print ">"$0}' Prokka/MySampleFolders/"$file"_prokka_out/"$file".ffn >accessory_gene_representatives/"$locus_tag".fasta;  
done 

These genes can be combined and can be functionally characterised for downstream analysis

Extracting 5477 genes took me around ~19 mins by this way