Package: AnnotationHubData
Authors: Martin Morgan [ctb],
Marc Carlson [ctb],
Dan Tenenbaum [ctb],
Sonali Arora [ctb],
Paul Shannon [ctb],
Bioconductor Package Maintainer [cre]
Modified: 11 October, 2014
Compiled: Thu Feb 18 21:23:25 2016
The AnnotationHubData package provides tools to acquire, annotate, convert and store data for use in Bioconductor's AnnotationHub. BED files from the Encode project, gtf files from Ensembl, or annotation tracks from UCSC, are examples of data that can be downloaded, described with metadata, transformed to standard Bioconductor data types, and stored so that they may be conveniently served up on demand to users via the AnnotationHub client.
This document will deal with the nuts and bolts of how the back end of the AnnotationHub works and operates.
For the 1st part of the description of how to put new data into the AnnotationHub, please see the vignette titled “How to write recipes for new resources for the AnnotationHub” in the AnnotationHub package. This will explain how to create a recipe and an AnnotationHubMetadata generating function.
After those initial two steps though, you will have to actually run the recipe. This will require that you follow the instructions described here.
Once you have a recipe and an AnnotationHubMetadata generating functions you will need to call the makeAnnotationHubResource function to do some automatic setup. This function only has two required arguments. The 1st is basically the name of a class that describes the kind of resource you are writing code to import. It just needs to be a unique class name and will be used internally to create a class for dispatch. The 2nd argument is the name of your metadata processing function from the 1st step in the other vignette. If this function name is NAMESPACED in another package (IOW if it's a contributed recipe) you might have to import it for the next step to work.
require(AnnotationHubData)
makeAnnotationHubResource("Inparanoid8ImportPreparer",
makeinparanoid8ToAHMs)
Once you have added this call to AnnnotationHubData, the only step left is to export the class name into the NAMESPACE (remember that this is that string you are providing as your 1st argument). Then reinstall AnnotationHubData and go to the next step.
Next install the modified AnnotationHubData package and see if the recipe works correctly. You just need to call the function that will try to update the resources for your recipe. That function is called updateResources(). The updateResources() function takes several arguments. These are:
library(AnnotationHubData)
mdinp = updateResources(ahroot=getwd(),
BiocVersion=biocVersion(),
preparerClasses = "Inparanoid8ImportPreparer",
insert = FALSE,
metadataOnly=TRUE)
When you call updateResources() like this, a set of AnnotationHubMetadata objects should be returned. You should inspect those to make sure that they contain all the metadata that you intended to use. And if you run it again with metadataOnly set to FALSE, then it should generate the final objects for you in the ahroot directory that you specified. If both of these things happen to your satisfaction then your recipe is ready to be run.
When a recipe is run a lot of metadata is also required and collected about the data that is processed by that recipe. This data is stored as a record in a MySQL database back end. When the user runs their client, this database is downloaded (if needed) as a sqlite dump of that MySQL backend. Performance benefits from using MySQL come into play whenever we need to upload new records into the back end, but it was also decided that the portability of a sqlite DB is better for local cached access, so both kinds of databases are used in this implementation. But the 'parent' database is the MySQL db (the sqlite ones are just cached copies of that).
Normally when we work with this database we will 1st do testing on a local instance on “MachineName” and then update the production machine later.
To actually push data to this local database you need to set an option in your .Rprofile for where the json should be posted to. That should look like this:
options("AH_SERVER_POST_URL"="http://<MachineName>:9393/resource")
To log in to the database on “MachineName”“ use this (you will need the ahuser password):
mysql -p -u ahuser annotationhub
\begin{verbatim} mysql -p -u ahuser annotationhub \end{verbatim}
And for production you will need to connect here:
ssh ubuntu@annotationhub.bioconductor.org
And then connect to the database:
mysql -p -u ahuser annotationhub
Whenever you have updated the MySQL back end, a cron job should do the job of creating the locally cached sqlite database from the MySQL source db. This is done by a ruby script called convert_db.rb. This file in part of code that is checked in at the following URL in github.
https://github.com/Bioconductor/AnnotationHubServer3.0
The process of updating the production server should follow this process:
You can run the recipe by calling updateResources with the insert and metadataOnly arguments set up like this:
mdinp = updateResources(ahroot,
BiocVersion,
preparerClasses = "Inparanoid8ImportPreparer",
insert = TRUE,
metadataOnly=FALSE)
See section below titled "Final test that it works on gamay”
For this step you need to log in, and back up the database on production. So that should look like this (use the ahuser password for this). Instead of 2014_11_05, use the current date.
ssh ubuntu@annotationhub.bioconductor.org
cd ~/AnnotationHubDumps/prodDumps/
mysqldump -p -u ahuser annotationhub | gzip > dbdump_2014_11_05_fromProd.sql.gz
Then copy it down (from gamay):
cd /mnt/cpb_anno/mcarlson/AnnotationHubDumps/prodDumps/
scp ubuntu@annotationhub.bioconductor.org:/home/ubuntu/AnnotationHubDumps/prodDumps/dbdump_2014_11_06_fromProd.sql.gz .
And then unpack it on gamay (use the MySQL root password for this):
mysql -p -u root
Then in SQL drop and create the DB
drop database annotationhub;
create database annotationhub;
\q
At this point there is no data in the database, so AnnotationHub is not working! So be sure and do the next step right away.
Then unpack the dump into the DB
zcat dbdump_2014_11_05_fromProd.sql.gz | mysql -p -u root annotationhub
mdinp = updateResources(ahroot,
BiocVersion,
preparerClasses = "Inparanoid8ImportPreparer",
insert = TRUE,
metadataOnly=FALSE)
The 1st part of this is that the server on gamay has to have produced the .sqlite version of the mySQL database we just modified. To make that happen you can start by 1st impersonating dan and go to his directory with the ruby scripts:
sudo su - dtenenba
cd AnnotationHubServer3.0/
From there you can look at his cron job to see how it runs every 10 mins…
crontab -l
But we don't want to wait ten minutes! So run the following ruby script to make the DB right now.
ruby convert_db.rb
exit
Sometimes if you are rolling back there may be an issue where the ruby conversion script doesn't run because it thinks the new DB has an older timestamp. If that happens remove the timestamp file and run it again:
rm dbtimestamp.cache
ruby convert_db.rb
exit
Once this is done, you then have to actually test the client and make sure that your data is in there etc. This means that you need to launch AnnotationHub so that it points to gamay for metadata. You can do that like this:
options(ANNOTATION_HUB_URL="http://gamay:9393")
library(AnnotationHub)
ah = AnnotationHub()
length(ah) ## should be larger than you started with
tail(ah) ## should have some new records
To manually test metadata you could just do this:
foo = mcols(ah)
And then Look for your data in 'foo'… It's also a good idea to try and download that data with it's “AHID”
mysqldump -p -u ahuser annotationhub | gzip > dbdump_2014_11_05_fromGamay.sql.gz
Then copy it up from gamay back to production:
scp /mnt/cpb_anno/mcarlson/AnnotationHubDumps/gamayDumps/dbdump_2014_11_05_fromGamay.sql.gz ubuntu@annotationhub.bioconductor.org:/home/ubuntu/AnnotationHubDumps/gamayDumps/
And then uppack it (use the MySQL password for this):
mysql -p -u root
Then in SQL drop and create the DB
drop database annotationhub;
create database annotationhub;
\q
Then unpack the dump into the DB
zcat dbdump_2014_11_05_fromGamay.sql.gz | mysql -p -u root annotationhub
Unlike the metadata, the data that it stored after a recipe is run is
stored in something that looks like a file system. When testing on
To copy things up to the S3 buckets you need to use the aws command line tools. You can see help like this
aws s3 help
And you can get “copying” help like this
aws s3 cp help
And you can copy something to be read only by using the –acl argument like this:
aws s3 cp --acl public-read hello.txt s3://annotationhub/foo/bar/hello.txt
To sign in at the AWS console go here: (use your username and password the account is bioconductor)
https://bioconductor.signin.aws.amazon.com/console
And then click 'S3', 'annotationhub' etc.
And to recursively copy back down from the S3 bucket do like this:
aws s3 cp --dryrun --recursive s3://annotationhub/ensembl/release-75/fasta/ .
And then to copy back up you could do it like this:
aws s3 cp --dryrun --recursive --acl public-read ./fasta/ s3://annotationhub/ensembl/release-75/fasta/
But that will copy everything (not very efficient since some things may be the same and already there) So to actually copy back we should really use 'aws s3 sync' instead of 'aws s3 cp'.
aws s3 sync help
The server is implemented in Ruby. Its code is here with a README that explains how to install it.
(more to come…)