Friday, August 3, 2018

Setting up AWS CLI and Downloading S3 Files using command line interface

1. Install the AWS Command Line Interface (CLI) using:

$ pip install awscli 


For more details such as upgrade, check the amazon webpage here.

Note: if installation was not successful, reinstall forcefully. Check here

The next step is to configure. AWS requires the following items for this:
(how to find these?)
  1. Access key ID
  2. Secret Access key
  3. Region name
  4. Valid output format
2. Configuration

Now, run the following command and answer the prompts:

$ aws configure
 
AWS Access Key ID [None]: <enter the access key>
AWS Secret Access Key [None]: <enter the secret access key>
Default region name [None]: <enter region>
Default output format [None]: <text>


3. Copying files from AWS S3 folder to Local Desktop 

once this is done, we can start using AWS CLI for the file operations. For example, if it want to copy WBB3097.gz file in my AWS account to my local desktop, which is under "qgen" username and in the folder MIX7341, I do the following

$ aws s3 cp s3:URI  <Local Desktop Location>

$ aws s3 cp s3://qgen/MIX7341/WBB3097.gz ~/Desktop/AWS

4. List all files in AWS S3 folder 

Similarly to list all the files in the MIX7341, we can use the following command


$ aws s3 ls s3://qgen/MIX7341/

2018-08-02 11:03:58  459407360 WBB3066.gz
2018-08-02 11:03:01  337848320 WBB3067.gz
2018-08-02 11:02:57  327219200 WBB3068.gz
2018-08-02 11:02:57  298680320 WBB3069.gz
2018-08-02 11:02:57  280514560 WBB3070.gz

5. Copy multiple files using BASH for loop

Off hand, I had to download multiple files from AWS account. To click each file and download was taking time. So, I just listed the files (step1 below) and used "for loop" in bash to download the files (step2 below).


$ aws s3 ls s3://qgen/MIX7341/ | awk '{print $NF}' >files_in_AWS.list ##step 1

$ for d in $(cat files_in_AWS.list); do echo $d; aws s3 cp s3://qgen/MIX7341/$d . ; done ##step 2




border:solid green;border-width:.1em .1em .1em .8em;padding:.2em .6em;
Python ; Pastie ; www. hilitie.me

Thursday, August 2, 2018

ImportError: No module named botocore.session - Solved

My aws cli had some installation problems and ended up using the following error


$ aws

Traceback (most recent call last):

  File "/usr/bin/aws", line 19, in <module>

    import awscli.clidriver

  File "/usr/lib/python2.7/site-packages/awscli/clidriver.py", line 17, in <module>

    import botocore.session

ImportError: No module named botocore.session

Running the following command fixed the problem:


$ sudo pip install awscli --force-reinstall --upgrade

Collecting awscli

  Using cached https://files.pythonhosted.org/packages/18/99/a1a1ea5a91161d5be5f434550ac1de800d79da7d4f68a5b5c8f265fcbd58/awscli-1.15.70-py2.py3-none-any.whl

Collecting s3transfer<0.2.0,>=0.1.12 (from awscli)

  Using cached https://files.pythonhosted.org/packages/d7/14/2a0004d487464d120c9fb85313a75cd3d71a7506955be458eebfe19a6b1d/s3transfer-0.1.13-py2.py3-none-any.whl

Collecting colorama<=0.3.9,>=0.2.5 (from awscli)

  Using cached https://files.pythonhosted.org/packages/db/c8/7dcf9dbcb22429512708fe3a547f8b6101c0d02137acbd892505aee57adf/colorama-0.3.9-py2.py3-none-any.whl

Collecting rsa<=3.5.0,>=3.1.2 (from awscli)

  Using cached https://files.pythonhosted.org/packages/e1/ae/baedc9cb175552e95f3395c43055a6a5e125ae4d48a1d7a924baca83e92e/rsa-3.4.2-py2.py3-none-any.whl

Collecting docutils>=0.10 (from awscli)

  Using cached https://files.pythonhosted.org/packages/50/09/c53398e0005b11f7ffb27b7aa720c617aba53be4fb4f4f3f06b9b5c60f28/docutils-0.14-py2-none-any.whl

Collecting PyYAML<=3.13,>=3.10 (from awscli)

  Downloading https://files.pythonhosted.org/packages/9e/a3/1d13970c3f36777c583f136c136f804d70f500168edc1edea6daa7200769/PyYAML-3.13.tar.gz (270kB)

    100% |████████████████████████████████| 276kB 2.8MB/s

Collecting botocore==1.10.69 (from awscli)

  Using cached https://files.pythonhosted.org/packages/45/f3/0e5006bc6198213b853ca3975c537335a91716ffa1ed768e2b2ffe12d7ed/botocore-1.10.69-py2.py3-none-any.whl

Collecting futures<4.0.0,>=2.2.0; python_version == "2.6" or python_version == "2.7" (from s3transfer<0.2.0,>=0.1.12->awscli)

  Using cached https://files.pythonhosted.org/packages/2d/99/b2c4e9d5a30f6471e410a146232b4118e697fa3ffc06d6a65efde84debd0/futures-3.2.0-py2-none-any.whl

Collecting pyasn1>=0.1.3 (from rsa<=3.5.0,>=3.1.2->awscli)

  Downloading https://files.pythonhosted.org/packages/d1/a1/7790cc85db38daa874f6a2e6308131b9953feb1367f2ae2d1123bb93a9f5/pyasn1-0.4.4-py2.py3-none-any.whl (72kB)

    100% |████████████████████████████████| 81kB 3.5MB/s

Collecting jmespath<1.0.0,>=0.7.1 (from botocore==1.10.69->awscli)

  Using cached https://files.pythonhosted.org/packages/b7/31/05c8d001f7f87f0f07289a5fc0fc3832e9a57f2dbd4d3b0fee70e0d51365/jmespath-0.9.3-py2.py3-none-any.whl

Collecting python-dateutil<3.0.0,>=2.1; python_version >= "2.7" (from botocore==1.10.69->awscli)

  Using cached https://files.pythonhosted.org/packages/cf/f5/af2b09c957ace60dcfac112b669c45c8c97e32f94aa8b56da4c6d1682825/python_dateutil-2.7.3-py2.py3-none-any.whl

Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1; python_version >= "2.7"->botocore==1.10.69->awscli)

  Using cached https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl

Building wheels for collected packages: PyYAML

  Running setup.py bdist_wheel for PyYAML ... done

  Stored in directory: /root/.cache/pip/wheels/ad/da/0c/74eb680767247273e2cf2723482cb9c924fe70af57c334513f

Successfully built PyYAML

ipaclient 4.5.4 requires jinja2, which is not installed.

rtslib-fb 2.1.63 has requirement pyudev>=0.16.1, but you'll have pyudev 0.15 which is incompatible.

ipapython 4.5.4 has requirement dnspython>=1.15, but you'll have dnspython 1.12.0 which is incompatible.

Installing collected packages: futures, jmespath, docutils, six, python-dateutil, botocore, s3transfer, colorama, pyasn1, rsa, PyYAML, awscli

  Found existing installation: futures 3.2.0

    Uninstalling futures-3.2.0:

      Successfully uninstalled futures-3.2.0

  Found existing installation: jmespath 0.9.3

    Uninstalling jmespath-0.9.3:

      Successfully uninstalled jmespath-0.9.3

  Found existing installation: docutils 0.14

    Uninstalling docutils-0.14:

      Successfully uninstalled docutils-0.14

  Found existing installation: six 1.9.0

    Uninstalling six-1.9.0:

      Successfully uninstalled six-1.9.0

  Found existing installation: python-dateutil 2.7.3

    Uninstalling python-dateutil-2.7.3:

      Successfully uninstalled python-dateutil-2.7.3

  Found existing installation: botocore 1.10.69

    Uninstalling botocore-1.10.69:

      Successfully uninstalled botocore-1.10.69

  Found existing installation: s3transfer 0.1.13

    Uninstalling s3transfer-0.1.13:

      Successfully uninstalled s3transfer-0.1.13

  Found existing installation: colorama 0.3.9

    Uninstalling colorama-0.3.9:

      Successfully uninstalled colorama-0.3.9

  Found existing installation: pyasn1 0.1.9

    Uninstalling pyasn1-0.1.9:

      Successfully uninstalled pyasn1-0.1.9

  Found existing installation: rsa 3.4.2

    Uninstalling rsa-3.4.2:

      Successfully uninstalled rsa-3.4.2

  Found existing installation: PyYAML 3.10

Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

You are using pip version 10.0.1, however version 18.0 is available.

You should consider upgrading via the 'pip install --upgrade pip' command.

 

$ aws
usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]

To see help text, you can run:


  aws help

  aws <command> help

  aws <command> <subcommand> help

aws: error: too few arguments

 

We can check  aws version as following:


$ aws --version

aws-cli/1.15.70 Python/2.7.5 Linux/3.10.0-693.17.1.el7.x86_64 botocore/1.10.69
 

Thats it ! We can now use AWS CLI to access the buckets in our AWS console account. :)


Tuesday, July 10, 2018

IndexError: string index out of range - Parsnp - Solved

$ parsnp_linux_ignoresize -r ecoli_NZ_CP027462.fasta -d . -o coregenome -x -c
|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
***************************
SETTINGS:
|-refgenome:    /storage/data/DATA4/analysis/Datta/CRE_outbreak/Trees/Parnsp_Gubbins_Concordant_Samples/CRE_Paola-GIS/ecoli_gte1kbContigs/ecoli_NZ_CP027462.fasta
|-aligner:    libMUSCLE
|-seqdir:    .
|-outdir:    coregenome
|-OS:        Linux
|-threads:    32
***************************

<<Parsnp started>>

-->Reading Genome (asm, fasta) files from ...
  |->[OK]
-->Reading Genbank file(s) for reference (.gbk) ..
  |->[WARNING]: no genbank file provided for reference annotations, skipping..
Traceback (most recent call last):
  File "<string>", line 656, in <module>
IndexError: string index out of range

Solution: a previous failed run generated empty parsnp.fasta file which led to this error. By deleting parsnp.fasta fixed the script run again.

Wednesday, June 20, 2018

Mutation rate to Substitution per site per genome per year

Converting date to fractional date for BEAST phylogenetic analysis software

While running BEAST, we need to generate an xml with the parameters. When we import the alignment file into BEAUti software and we require the sampling time in fractional years. Here is one way to do that in the excel. Regardless of your date is in any of the formats, the formula would convert that into a fractional year.


=YEAR(A1)+((A1)-DATE(YEAR(A1),1,0))/(DATE(YEAR(A1)+1,1,0)-DATE(YEAR(A1),1,0))


Tuesday, February 13, 2018

Merged entries in the non-redundant Nucleotide (nt) database

I ran the following command for to extract a sequence from nt database:


$ blastdbcmd -db /mnt/Storage/nt/nt/nt -entry KJ413946.1 | grep '>'

>KJ413946.1 Escherichia coli strain ECS01 plasmid pNDM-ECS01, complete sequence >KP900017.1 Raoultella ornithinolytica strain YNKP001 plasmid pYNKP001-NDM, complete sequence

I noticed that this is a merged entry in the 'non-redundant' Nucleotide (nt) database, where two nucleotide sequences which have the same sequence are combined under a single FASTA entry.

When I used the second identifier in the merged entry, blastdbcmd command gave me the same result in return:

$ blastdbcmd -db /mnt/Storage/nt/nt/nt -entry KP900017.1 | grep '>'

>KJ413946.1 Escherichia coli strain ECS01 plasmid pNDM-ECS01, complete sequence >KP900017.1 Raoultella ornithinolytica strain YNKP001 plasmid pYNKP001-NDM, complete sequence

Curious to know, I checked if they are the same length. Yes they are of same length (41190).


$ blastdbcmd -db /mnt/Storage/nt/nt/nt -entry KP900017.1 | grep -v '>' | tr -d '\n' | wc 

      0       1   41190

$ blastdbcmd -db /mnt/Storage/nt/nt/nt -entry KJ413946.1 | grep -v '>' | tr -d '\n' | wc

      0       1   41190

Again curious to know,  if these two sequences are sharing an exact similarity match, I blasted both of them. And with no surprise, they share 100 % identity. See blast screenshots below:





So, how to sort this out? I need only one annotation instead of multiple for my analysis and do not want the second identifier to confuse my downstream analysis scripts.

By searching around, problem could solved by using  "-target_only"  option while running the command. I believe this situation might arise in nr database too but switching this on should help to achieve the purpose.


$ blastdbcmd -db /mnt/Storage/nt/nt/nt -entry KJ413946.1 -target_only | grep '>'

>KJ413946.1 Escherichia coli strain ECS01 plasmid pNDM-ECS01, complete sequence

Wednesday, February 7, 2018

[bcf_sync] incorrect number of fields (0 != 5) at 0:0

The following samtools v0.1.18 mpileup command gave me error:

$ samtools mpileup -E -M0 -Q25 -q30 -m2 -D -S test_SortSam_withReadGroup.bam | bcftools view -vcg - > test.vcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[bcf_sync] incorrect number of fields (0 != 5) at 0:0
[afs] 0:0.000

including option "-g" helped to overcome the problem.

$ samtools mpileup -g -E -M0 -Q25 -q30 -m2 -D -S test_S8_SortSam_withReadGroup.bam | bcftools view -vcg - > test.vcf

"-g " computes genotype likelihoods and output them in the binary call format (BCF).



Thursday, February 1, 2018

One liner to convert a long fasta-file into many separate single fasta sequences

Taken from the Biostar post: https://www.biostars.org/p/105388/


while read line; do if [[ ${line:0:1} == '>' ]]; then outfile=${line#>}.fa; echo $line > $outfile; else echo $line >> $outfile; fi; done < combined_kpneumoniae.tfa

Wednesday, January 10, 2018

Coloring specific tips labels of phylogenetic tree in R

Some time ago, I wanted to change colors of specific branches and tips labels in a phylogenetic tree. I have R programming for changing colors. I have posted a screenshot of what I did.




Tuesday, January 9, 2018

Converting FastQ to FastA - comparison of one-liners and toolkits

Just out of curiosity, was checking time it takes to convert fastq to fasta by different tools and one-liners. I took the commands/one-liners from this biostar post.

The input file is gz file and contains 6 M reads of 301 bp.