Thursday, September 27, 2018

Downloading fasta using efetch - CentOS


$ efetch -db nucleotide -id KJ413946.1 -format fasta


501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=KJ413946.1&rettype=fasta&retmode=text&edirect_os=linux&edirect=9.90&tool=edirect&email=datta@ttshbio'
Result of do_post http request is
$VAR1 = bless( {
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
                 '_rc' => 501,
                 '_headers' => bless( {
                                        'client-warning' => 'Internal response',
                                        'client-date' => 'Thu, 27 Sep 2018 06:42:06 GMT',
                                        'content-type' => 'text/plain',
                                        '::std_case' => {
                                                          'client-warning' => 'Client-Warning',
                                                          'client-date' => 'Client-Date'
                                                        }
                                      }, 'HTTP::Headers' ),
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_request' => bless( {
                                        '_content' => 'db=nucleotide&id=KJ413946.1&rettype=fasta&retmode=text&edirect_os=linux&edirect=9.90&tool=edirect&email=datta@ttshbio',
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi')}, 'URI::https' ),
                                        '_headers' => bless( {
                                                               'user-agent' => 'libwww-perl/6.05',
                                                               'content-type' => 'application/x-www-form-urlencoded'
                                                             }, 'HTTP::Headers' ),
                                        '_method' => 'POST'
                                      }, 'HTTP::Request' )
               }, 'HTTP::Response' );

Installing "perl-LWP-Protocol-https" has solved the problem!


$ sudo yum install perl-LWP-Protocol-https


$ efetch -db nucleotide -id KJ413946.1 -format fasta
>KJ413946.1 Escherichia coli strain ECS01 plasmid pNDM-ECS01, complete sequence
ATGGCAGAGGAAAGCAAACAGCTAACCAAACGGCAACAAAAAGCCATTGATACAGCGGCGTTAATCCGGC
AGGAGCCGCCGCAGGGTGAAGATATGGCATTCACCCACTCCATTCTGTGCCAGGTCGGTTTGCCCCGTTC
TAAGGTGGCAGGGCGTGAGTTTATGCGCCGTTCTGGTGATGCCTGGCTCGTCGTACAGGCAGGCTGGATT
GATGAAGGCAGTGGCCCGGTAGAGCAGCCTTTACCCTATGGCGCTATGCCGCGACTCACGTTCGCCTGGA
TTTCATCGTATGCACTGCGCAACAAAACGCGGGAAATCGCCATCGGCCACAGCGCTAATGAGTTTCTTCA
CCTTATGGGGATGGACTCACAGGGAACCCGTCATAAAACGCTGCGTACACAAATGCAGGCGCTGGCCGCG
TGTCGTTTGCAGCTGGGCTTTAAGGGCC

For downloading multiple records from NCBI directly with a list of accession using "efetch" and make accession as filename:

$ time for d in $(cat Plasmid.list);do echo $d; efetch -db nucleotide -id $d -format fasta >$d.fasta; done ##takes a bit longer time 
 
real    6m47.812s
user    0m43.515s
sys    0m7.811s  

For extracting multiple records from local NT/NR database with a list of accessions using "blastdbcmd" and make accession as filename

$ time for d in $(cat Plasmid.list);do echo $d; blastdbcmd -db nt -entry $d >$d.fasta; done

real    0m26.736s
user    0m7.004s
sys    0m4.384s  

Notice the advantage of having downloaded local databases. It just took 26 secs!!

Wednesday, September 26, 2018

Cannot find xml2-config - R Studio - CentOS

Came across this error while installing CINNA package on CentOS.


checking for xml2-config... no

Cannot find xml2-config
ERROR: configuration failed for package XML
* removing /home/datta/R/x86_64-redhat-linux-gnu-library/3.4/XML
Warning in install.packages :
  installation of package XML had non-zero exit status

On terminal:


$ sudo yum install libxml2-devel

Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile 
 
 * base: centos.netonboard.com
 * epel: mirror.premi.st
 * extras: mirror.myren.net.my
 * ius: hkg.mirror.rackspace.com
 * nux-dextop: mirror.li.nux.ro
 * updates: centos.ipserverone.com

Resolving Dependencies
--> Running transaction check
---> Package libxml2-devel.x86_64 0:2.9.1-6.el7_2.3 will be installed
--> Finished Dependency Resolution

Dependencies Resolved
...................

Installed:
  libxml2-devel.x86_64 0:2.9.1-6.el7_2.3                                                                                                                                                                          
Complete!

then in R:

install.packages("XML")

This worked without error! 

Sunday, September 23, 2018

makeblastdb - Segmentation fault (core dumped)

I had a list of sequence which I wanted to use as NCBI BLAST database. makeblastdb (version 2.7.1+) was throwing me the following error!


$ grep '>' NCBI_PlasmidCluster_TophitsSeqs.fasta
 
>NZ_CM008903.1_Enterobacter_cloacae_strain_22ES_plasmid_p22ES-4970,_whole_genome_shotgun_sequence
>NZ_CM004622.1_Escherichia_coli_strain_Ec47VL_plasmid_pEC47a,_whole_genome_shotgun_sequence
>KC355363.1_Aeromonas_hydrophila_strain_AH1_plasmid_pN6,_partial_sequence
>HQ651092.1_Klebsiella_oxytoca_plasmid_pFP10-1_replication_protein_gene,_partial_cds;_hypothetical_protein_(pKP048_08),_antirestriction_protein_KlcA_(klcA),_hypothetical_protein_(pKP048_09),_transcriptional_repressor_protein_KorC_(korC),_putative_ISKpn6-like_transposase_(tnpA),_and_carbapenemase_KPC-2_(blaKPC-2)_genes,_complete_cds;_and_transposon_Tn3_truncated_TEM-beta-lactamase_(bla)_and_ISKpn8_transposase_(tnpA)_genes,_complete_cds,_insertion_sequence_ISKpn8,_complete_sequence,_and_TnpR_(tnpR)_gene,_partial_cds


$ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids

Building a new DB, current time: 09/24/2018 11:02:13
New DB name:   NCBI_PlasmidCluster_TophitsSeqs.fasta.DB
New DB title:  NCBI_PlasmidCluster_TophitsSeqs.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum  size: 1000000000B


CFastaReader: Near line 1, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string.
CFastaReader: Near line 74, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string.
CFastaReader: Near line 132, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string.
CFastaReader: Near line 298, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string.

Adding sequences from FASTA; added 4 sequences in 0.00256801 seconds.

Segmentation fault (core dumped)

Tried the following:

1) Let's remove comma's 


$ sed 's/,/_/g' NCBI_PlasmidCluster_TophitsSeqs.fasta >NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta

$ grep '>' NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta


>NZ_CM008903.1_Enterobacter_cloacae_strain_22ES_plasmid_p22ES-4970__whole_genome_shotgun_sequence
>NZ_CM004622.1_Escherichia_coli_strain_Ec47VL_plasmid_pEC47a__whole_genome_shotgun_sequence
>KC355363.1_Aeromonas_hydrophila_strain_AH1_plasmid_pN6__partial_sequence
>HQ651092.1_Klebsiella_oxytoca_plasmid_pFP10-1_replication_protein_gene__partial_cds;_hypothetical_protein_(pKP048_08)__antirestriction_protein_KlcA_(klcA)__hypothetical_protein_(pKP048_09)__transcriptional_repressor_protein_KorC_(korC)__putative_ISKpn6-like_transposase_(tnpA)__and_carbapenemase_KPC-2_(blaKPC-2)_genes__complete_cds;_and_transposon_Tn3_truncated_TEM-beta-lactamase_(bla)_and_ISKpn8_transposase_(tnpA)_genes__complete_cds__insertion_sequence_ISKpn8__complete_sequence__and_TnpR_(tnpR)_gene__partial_cds


$ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids


Building a new DB, current time: 09/24/2018 11:08:16
New DB name:   NCBI_PlasmidCluster_TophitsSeqs.fasta.DB
New DB title:  NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta
Sequence type: Nucleotide
Keep MBits: T

Maximum  size: 1000000000B
Adding sequences from FASTA; added 4 sequences in 0.00210285 seconds.

Segmentation fault (core dumped)

2) Now let's try removing brackets in fasta headers


$ sed 's/[()]/_/g' NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta >NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta


$ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids


Building a new DB, current time: 09/24/2018 11:11:39

New DB name:   NCBI_PlasmidCluster_TophitsSeqs.fasta.DB
New DB title:  NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum  size: 1000000000B 
Adding sequences from FASTA; added 4 sequences in 0.00201416 seconds.

Segmentation fault (core dumped)

3. Shorten the header - This has worked!!


$ cat NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta | cut -f1 -d ";" >NCBI_PlasmidCluster_TophitsSeqs_3_Shortened.fasta


$ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs_3_Shortened.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids



Building a new DB, current time: 09/24/2018 11:19:48

New DB name:   NCBI_PlasmidCluster_TophitsSeqs.fasta.DB
New DB title:  NCBI_PlasmidCluster_TophitsSeqs_3_Shortened.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum  size: 1000000000B

Adding sequences from FASTA; added 4 sequences in 0.00199008 seconds.


Appears like makeblastdb has a limit for the length for the headers in fasta. Should take note of it!

Wednesday, September 19, 2018

BLASTN Error: NCBI C++ Exception

Ran into this error today morning.


Error: NCBI C++ Exception:
    T0 "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_380670_130.14.18.6_9008__PrepareRelease_Linux64-Centos_1508370803/c++/compilers/unix/../../src/serial/objistrasnb.cpp", line 179: Error: ncbi::CObjectIStreamAsnBinary::UnexpectedTagClassByte() - byte 0: unexpected tag: application/constructed/11 (107), should be constructed/None (32) ( at Blast-def-line-set[])


Found that incomplete or corrupted database causes this problem. I had two copies of the same NT database, so I ran md5sum for each of the file in the database and found the md5sum of one of the file was not same. (see below)

 
the long string above is md5sum,file name. - Note the md5sums of the same file are not same.

So just copied the nt.00.nhr file from the archive and blast worked without trouble. If not, you have to download the database again or run makeblastdb again on the fasta sequences.