$ grep '>' NCBI_PlasmidCluster_TophitsSeqs.fasta
>NZ_CM008903.1_Enterobacter_cloacae_strain_22ES_plasmid_p22ES-4970,_whole_genome_shotgun_sequence >NZ_CM004622.1_Escherichia_coli_strain_Ec47VL_plasmid_pEC47a,_whole_genome_shotgun_sequence >KC355363.1_Aeromonas_hydrophila_strain_AH1_plasmid_pN6,_partial_sequence >HQ651092.1_Klebsiella_oxytoca_plasmid_pFP10-1_replication_protein_gene,_partial_cds;_hypothetical_protein_(pKP048_08),_antirestriction_protein_KlcA_(klcA),_hypothetical_protein_(pKP048_09),_transcriptional_repressor_protein_KorC_(korC),_putative_ISKpn6-like_transposase_(tnpA),_and_carbapenemase_KPC-2_(blaKPC-2)_genes,_complete_cds;_and_transposon_Tn3_truncated_TEM-beta-lactamase_(bla)_and_ISKpn8_transposase_(tnpA)_genes,_complete_cds,_insertion_sequence_ISKpn8,_complete_sequence,_and_TnpR_(tnpR)_gene,_partial_cds $ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids Building a new DB, current time: 09/24/2018 11:02:13 New DB name: NCBI_PlasmidCluster_TophitsSeqs.fasta.DB New DB title: NCBI_PlasmidCluster_TophitsSeqs.fasta Sequence type: Nucleotide Keep MBits: T Maximum size: 1000000000B CFastaReader: Near line 1, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string. CFastaReader: Near line 74, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string. CFastaReader: Near line 132, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string. CFastaReader: Near line 298, the sequence id string contains 'comma' symbol, which has been replaced with 'underscore' symbol. Please correct the sequence id string. Adding sequences from FASTA; added 4 sequences in 0.00256801 seconds. Segmentation fault (core dumped)
Tried the following:
1) Let's remove comma's
$ sed 's/,/_/g' NCBI_PlasmidCluster_TophitsSeqs.fasta >NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta $ grep '>' NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta >NZ_CM008903.1_Enterobacter_cloacae_strain_22ES_plasmid_p22ES-4970__whole_genome_shotgun_sequence >NZ_CM004622.1_Escherichia_coli_strain_Ec47VL_plasmid_pEC47a__whole_genome_shotgun_sequence >KC355363.1_Aeromonas_hydrophila_strain_AH1_plasmid_pN6__partial_sequence >HQ651092.1_Klebsiella_oxytoca_plasmid_pFP10-1_replication_protein_gene__partial_cds;_hypothetical_protein_(pKP048_08)__antirestriction_protein_KlcA_(klcA)__hypothetical_protein_(pKP048_09)__transcriptional_repressor_protein_KorC_(korC)__putative_ISKpn6-like_transposase_(tnpA)__and_carbapenemase_KPC-2_(blaKPC-2)_genes__complete_cds;_and_transposon_Tn3_truncated_TEM-beta-lactamase_(bla)_and_ISKpn8_transposase_(tnpA)_genes__complete_cds__insertion_sequence_ISKpn8__complete_sequence__and_TnpR_(tnpR)_gene__partial_cds $ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids Building a new DB, current time: 09/24/2018 11:08:16 New DB name: NCBI_PlasmidCluster_TophitsSeqs.fasta.DB New DB title: NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta Sequence type: Nucleotide Keep MBits: T Maximum size: 1000000000B Adding sequences from FASTA; added 4 sequences in 0.00210285 seconds. Segmentation fault (core dumped)
2) Now let's try removing brackets in fasta headers
$ sed 's/[()]/_/g' NCBI_PlasmidCluster_TophitsSeqs_1NoCommas.fasta >NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta $ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids Building a new DB, current time: 09/24/2018 11:11:39 New DB name: NCBI_PlasmidCluster_TophitsSeqs.fasta.DB New DB title: NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta Sequence type: Nucleotide Keep MBits: T Maximum size: 1000000000B
Adding sequences from FASTA; added 4 sequences in 0.00201416 seconds. Segmentation fault (core dumped)
3. Shorten the header - This has worked!!
$ cat NCBI_PlasmidCluster_TophitsSeqs_2NoBrackets.fasta | cut -f1 -d ";" >NCBI_PlasmidCluster_TophitsSeqs_3_Shortened.fasta $ makeblastdb -in NCBI_PlasmidCluster_TophitsSeqs_3_Shortened.fasta -dbtype nucl -out NCBI_PlasmidCluster_TophitsSeqs.fasta.DB -parse_seqids Building a new DB, current time: 09/24/2018 11:19:48 New DB name: NCBI_PlasmidCluster_TophitsSeqs.fasta.DB New DB title: NCBI_PlasmidCluster_TophitsSeqs_3_Shortened.fasta Sequence type: Nucleotide Keep MBits: T Maximum size: 1000000000B Adding sequences from FASTA; added 4 sequences in 0.00199008 seconds.
Appears like makeblastdb has a limit for the length for the headers in fasta. Should take note of it!
No comments:
Post a Comment