Sunday, September 23, 2018

makeblastdb - Segmentation fault (core dumped)

I had a list of sequence which I wanted to use as NCBI BLAST database. makeblastdb (version 2.7.1+) was throwing me the following error!


$ grep '>' NCBI_PlasmidCluster_TophitsSeqs.fasta
 
>NZ_CM008903.1_Enterobacter_cloacae_strain_22ES_plasmid_p22ES-4970,_whole_genome_shotgun_sequence
>NZ_CM004622.1_Escherichia_coli_strain_Ec47VL_plasmid_pEC47a,_whole_genome_shotgun_sequence
>KC355363.1_Aeromonas_hydrophila_strain_AH1_plasmid_pN6,_partial_sequence
>HQ651092.1_Klebsiella_oxytoca_plasmid_pFP10-1_replication_protein_gene,_partial_cds;_hypothetical_protein_(pKP048_08),_antirestriction_protein_KlcA_(klcA),_hypothetical_protein_(pKP048_09),_transcriptional_repressor_protein_KorC_(korC),_putative_ISKpn6-like_transposase_(tnpA),_and_carbapenemase_KPC-2_(blaKPC-2)_genes,_complete_cds;_and_transposon_Tn3_truncated_TEM-beta-lactamase_(bla)_and_ISKpn8_transposase_(tnpA)_genes,_complete_cds,_insertion_sequence_ISKpn8,_complete_sequence,_and_TnpR_(tnpR)_gene,_partial_cds


$ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids

Building a new DB, current time: 09/24/2018 11:02:13
New DB name:   NCBI_PlasmidCluster_TophitsSeqs.fasta.DB
New DB title:  NCBI_PlasmidCluster_TophitsSeqs.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum  size: 1000000000B


CFastaReader: Near line 1, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string.
CFastaReader: Near line 74, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string.
CFastaReader: Near line 132, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string.
CFastaReader: Near line 298, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string.

Adding sequences from FASTA; added 4 sequences in 0.00256801 seconds.

Segmentation fault (core dumped)

Tried the following:

1) Let's remove comma's 


$ sed 's/,/_/g' NCBI_PlasmidCluster_TophitsSeqs.fasta >NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta

$ grep '>' NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta


>NZ_CM008903.1_Enterobacter_cloacae_strain_22ES_plasmid_p22ES-4970__whole_genome_shotgun_sequence
>NZ_CM004622.1_Escherichia_coli_strain_Ec47VL_plasmid_pEC47a__whole_genome_shotgun_sequence
>KC355363.1_Aeromonas_hydrophila_strain_AH1_plasmid_pN6__partial_sequence
>HQ651092.1_Klebsiella_oxytoca_plasmid_pFP10-1_replication_protein_gene__partial_cds;_hypothetical_protein_(pKP048_08)__antirestriction_protein_KlcA_(klcA)__hypothetical_protein_(pKP048_09)__transcriptional_repressor_protein_KorC_(korC)__putative_ISKpn6-like_transposase_(tnpA)__and_carbapenemase_KPC-2_(blaKPC-2)_genes__complete_cds;_and_transposon_Tn3_truncated_TEM-beta-lactamase_(bla)_and_ISKpn8_transposase_(tnpA)_genes__complete_cds__insertion_sequence_ISKpn8__complete_sequence__and_TnpR_(tnpR)_gene__partial_cds


$ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids


Building a new DB, current time: 09/24/2018 11:08:16
New DB name:   NCBI_PlasmidCluster_TophitsSeqs.fasta.DB
New DB title:  NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta
Sequence type: Nucleotide
Keep MBits: T

Maximum  size: 1000000000B
Adding sequences from FASTA; added 4 sequences in 0.00210285 seconds.

Segmentation fault (core dumped)

2) Now let's try removing brackets in fasta headers


$ sed 's/[()]/_/g' NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta >NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta


$ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids


Building a new DB, current time: 09/24/2018 11:11:39

New DB name:   NCBI_PlasmidCluster_TophitsSeqs.fasta.DB
New DB title:  NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum  size: 1000000000B 
Adding sequences from FASTA; added 4 sequences in 0.00201416 seconds.

Segmentation fault (core dumped)

3. Shorten the header - This has worked!!


$ cat NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta | cut -f1 -d ";" >NCBI_PlasmidCluster_TophitsSeqs_3_Shortened.fasta


$ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs_3_Shortened.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids



Building a new DB, current time: 09/24/2018 11:19:48

New DB name:   NCBI_PlasmidCluster_TophitsSeqs.fasta.DB
New DB title:  NCBI_PlasmidCluster_TophitsSeqs_3_Shortened.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum  size: 1000000000B

Adding sequences from FASTA; added 4 sequences in 0.00199008 seconds.


Appears like makeblastdb has a limit for the length for the headers in fasta. Should take note of it!

No comments:

Post a Comment