Sunday, December 19, 2021

Add file name after ">" in a multi fasta file

More solutions here: https://www.biostars.org/p/435813/


awk '/>/{sub(">","&"FILENAME"_");sub(/\.fasta/,x)}1'  assembly.fasta | sed 's/ /_/g' | grep '>'

Wednesday, December 15, 2021

Dowloading raw fastq data from European Nucleotide Archive (ENA) using ERR IDs

 Step1: Paste the comma separated ERR IDs in the address bar like this:

https://www.ebi.ac.uk/ena/browser/view/ERR3417384,ERR3418576,ERR3418577,ERR3307235?show=reads


Step 2: This will list the ID in the webpage like this:



Step 3: Click Download XML file - this downloaded ena_data_20211215-0958.xml file on to my desktop

Step 4: In the linux terminal:

$ fgrep 'filename="run' ena_data_20211215-0958.xml | 
sed -e 's/.*run/ftp.sra.ebi.ac.uk\/vol1\/run/g' -e 's/\".*//g' >downloadlinks.list


Step 5 : Looping through the links and using wget command from linux, we can download like this:

$ time for d in $(cat downloadlinks.list); do echo $d; wget $d; done

Step 6 : But, wget command was very slow and reconnecting many times like this:

--2021-12-16 09:50:41--  (try:13)  http://ftp.sra.ebi.ac.uk/vol1/run/ERR329/ERR3297045/1_P_PA_1.fastq.gz
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 1842088178 (1.7G), 606664946 (579M) remaining [application/octet-stream]
Saving to: ‘1_P_PA_1.fastq.gz’

1_P_PA_1.fastq.gz                                     67%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++                                       ]   1.15G  --.-KB/s    in 15m 0s  

2021-12-16 10:05:43 (0.00 B/s) - Read error at byte 1235423232/1842088178 (Connection timed out). Retrying.

--2021-12-16 10:05:53--  (try:14)  http://ftp.sra.ebi.ac.uk/vol1/run/ERR329/ERR3297045/1_P_PA_1.fastq.gz
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 1842088178 (1.7G), 606664946 (579M) remaining [application/octet-stream]
Saving to: ‘1_P_PA_1.fastq.gz’

1_P_PA_1.fastq.gz                                     67%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++                                       ]   1.15G  --.-KB/s    in 15m 0s  

2021-12-16 10:20:54 (0.00 B/s) - Read error at byte 1235423232/1842088178 (Connection timed out). Retrying.

--2021-12-16 10:21:04--  (try:15)  http://ftp.sra.ebi.ac.uk/vol1/run/ERR329/ERR3297045/1_P_PA_1.fastq.gz
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 1842088178 (1.7G), 606664946 (579M) remaining [application/octet-stream]
Saving to: ‘1_P_PA_1.fastq.gz’

1_P_PA_1.fastq.gz                                     81%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++================>                      ]   1.40G  --.-KB/s    in 26m 54s 

So, used aspera pluggin to download at better speed.

Step 7:  Locate asperaweb_id_dsa.openssh in ubuntu. The absolute path is required for the next step

$ locate asperaweb_id_dsa.openssh
/home/user/.aspera/connect/etc/asperaweb_id_dsa.openssh

Step 8: parsing the downloadlinks.list and creating a new file for aspera download.


$ cat downloadlinks.list | sed 's/ftp.sra.ebi.ac.uk\/vol1\/run\///g' >downloadlinks_ascp.list

$ time for d in $(cat downloadlinks_ascp.list); do echo $d; ascp -v -l 300m -P33001 -i /home/prakki/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/run/$d /ena/download/folder/ ; done