Key Concepts of working with official AMP PD Workspaces

0

The AMP PD program provides curated workspaces to the AMP PD community. These are official AMP PD workspaces. When you sign up for access to AMP PD. These will become available to you under the AMP PD release B1 billing project. Today, these AMP PD workspaces are focused on the getting started workspace. These are like the quickstarts that you walked through with Ali. Where those are maintained by the Broad for the Terra community. These Getting Started workspaces are developed and maintained by AMP PD, so they’re tied to the AMP PD community. They’re integrated with all of our clinical RNA-Seq and HOGENOM sequence data. And they provide a variety of essential tools for working with our AMP PD data.


Getting Started workspaces are organized for you depending on your tier of access. So, if you signed up for Tier 1 access during registration. You would only see Getting Started workspace for clinical and summary data. If you signed up for Tier 2 full access during registration. You would also have access to the Getting Started workspace for clinical and omics data. Note as a Tier 2 user, you only need to work with the Tier 2 workspace. Because all of our Tier 1 resources are included in there. These workspaces provide clear examples. How to work with the AMP PD data and Jupyter notebooks that are written in both Python and R.

AMP PD Workspaces

So, whatever you’re more comfortable with, we got you covered. They also contain content notebooks that simply describe. What’s in the data release. what we’re doing with these workspaces — this isn’t just static. Official AMP PD workspaces are continually being added. We have staff working on the new analyses that become official AMP PD workspaces. We also have partners working on their own analyses.

Official AMP PD
Official AMP PD

I have access to both the Clinical access tier and the Clinical and Omics access tier, so I see both of these. I would typically work within Tier 2, but for this demonstration, I’m going to move or just walkthrough — I’m going to move into the Tier 1 Clinical Access workspace. So, here you see a whole lot of front matter that describes how to get started, what the purpose is of these workspaces. I’ll come back to a piece of this later. But I’ll pause here at Data and Google Cloud Storage. So, you see we got references here to where the data are stored in our Google Cloud buckets and where they’re stored in our BigQuery datasets.

I think Dave mentioned earlier that all the data that we have in Google Cloud buckets is reflected in the BigQueries and vice versa. Finally, in the notebook section, you can see a listing of the getting started workspace notebooks that we have for clinical access. The list is smaller for this one than it would be, of course, for the Tier 2 access. But it gives you enough to get your feet wet. Here’s a Python 3 notebook on clinical data, how to load that into your workspace into the Jupyter notebook from BigQuery. That’s what we’re going to look at here in a second. Okay, so before we do that, I want to show you a couple of features generally about the workspace.

One of them is — I think Ali showed you this earlier — but if you click through support and you look at or click into the Contact Us, this is an incredibly simple way to get an answer to your question about things that are Terra user-interface related. Maybe working with Google Cloud integration with Terra. But if you have specific questions related to the AMP PD workspaces or the AMP PD research data or even creating your own analyses using AMP PD data, we have people on staff who can help with all of those.

And we’d ask that you send an email to admin@amp-pd.org and we’ll open a help desk ticket and get the right person to answer your questions. One other point before I move into the notebook is under the data tab. So, Ali also showed you this. But specific to the AMP PD workspaces, I want to mention how we’ve got — how we’re using these — basically these are global variables.

We encourage you to use these also. So, all of our official AMP PD workspaces will use the global variables that are then referenced in all of your notebooks. They provide a convenient way for you to change behavior in all of your notebooks in one place. And when you need to make an update — say there’s a new AMP PD data release, you only have to make that in one place. Without these, you would have to change them in all of the notebooks, which could be expensive and could lead to errors. Okay. So, let’s get into one of these notebooks.

This is the same list that you saw on the splash page. I’m going to choose “Py3-Clinical-load a table from BigQuery.” Now I’m not actually going to start up a cluster and execute this. What I’ve done is open in preview mode. This is going to show what the last person who ran this executed in all the output windows for each of the cells. Because these are curated by AMP PD, you’re seeing what we curated. And not what the last person who executed it was.

AMP PD published workspaces
AMP PD published workspaces

So, just scrolling down through here. I want to show that a lot of the initial setup and initialization that Ali told you about — this has all been taken care of for you. And put some utility functions in here that will help you work with the AMP PD workspace data. Finally, getting down to something more interesting. When we list out the BigQuery dataset, it’s very convenient for us that all that upfront initialization is set up because to get a listing of all the tables, you simply have to execute this one line of code. Let’s see.

I think I wanted to talk a little bit about some of these tables. We’ve got the release metadata down here also and you can pick from any one of those and substitute demographics if you chose. And just see what the contents of those are. So, in this example, you’ve got a way to load up the table, print out all the fields that are in that table, and then drop that preview to the screen.

What this would serve than is possibly an input to your analysis. And then the last point I want to make on this particular notebook is a point about the provenance. In all of our official AMP PD workspaces, we include a listing of dependencies and their versions. This is to help track provenance results and to help improve reproducibility. So, we would encourage you to do the same in your notebooks. Okay. The next one I — the next one I want to show you actually come from the Tier 2 dataset.

So, I’m not going to actually move through it but pick a couple of spots. Just to highlight that, you know, where I showed you something very simple. The querying clinical data table is pretty simple stuff. We’ve also got more complex, sophisticated analyses to serve as building blocks as well. So, this one is called the Quick QC  Check. We actually use this to determine our sex screen coordinates between clinical and RNA samples.


So, I’ve just bookmarked a couple of places in the same notebook. Further down here, it’s showing that we have Sex Checks with Salmon TPMs. So, after we ran the salmon run in Terra. We pulled results out and put those into a Big Query table. And that can be combined in this query with our demographics data. So we can determine the clinical data’s concordant with the sample data. All those tables are available to you and with Tier 2 access.

I think this last piece now I just want to show this briefly. That it is possible to get some good visualizations in the Terra notebooks without ever having to download. The data to your own site and you can use these just to get started. Okay. I think I’m going to back out of the last one. Just talk briefly about the ConOps for working with these projects.

So, getting started workspace in AMP PD provides read-only access. So, you can’t actually execute these in this billing project. You would need to create your own billing project or use the $300 free credit one that Ali told you about. And then you can clone the workspace entirely and start modifying it yourself before you could copy a notebook into a workspace that you created under your billing account. In this, the program helps distribute the cost associated with processing. It can get very heavy.

I’ll show you an example of that when I get into billing. But, you know, even with that distributed cost model we still are able to retain a sense of community of the data sharing that we might otherwise lose if we didn’t have these tools in the cloud and if people simply downloaded the data to their own clusters. All right, so I’m going to stop this. Okay. Now that we’ve seen how to use Terra, let’s talk a little bit about billing. How we pay for the processing in the cloud at AMP PD.

I think Ali showed you how you can get started with a $300 credit in Terra, but here we — that’s only going to last so long. And when these credits are expired, you’re going to need to create a Google billing account and then associate that with Terra. To do this, Google makes it very easy for you to drop in a credit card. When you have a brand-new Google Cloud account, it’s like all roads point to create a billing account. So, I’m not going to demonstrate that.

Another way that you can fund your Google Cloud account is through an institutional account through a provider or through a program like STRIDES, which you’ll hear about from Nick Weber who is here with us today. Once you have a billing account in Google, you need to link it to Terra. I will show that. It doesn’t show very well, but if you get caught up on this part you might be happy that this — that you have this recording to refer to. So, I’ll move into a demonstration of the billing side of things. So, here I’ve got a very simple view set up.

I want to point out that what we’re trying to do is — assuming that we’ve already dropped a credit card into Google Cloud and created a billing account. That part is very easy. Now we need to give Terra permission to use our Google Cloud billing account and to create resources and projects on our behalf. To do that, we need to give them permission. So, I actually wanted to show where this is because this is kind of the tricky part.

If you’re considering providing access to Terra, you would think that it would go under identity and access management, but it doesn’t. It — billing accounts are treated in a different way. So, if you want to provide access to your billing account you need to go into the billing menu right here. And then at the bottom of that, click on account management and then you’ll see your projects or your billing account, I should say. And to provide access to it, you need to click on this.


It shows an info panel and that’ll slide out a drawer. And so, if you follow the instructions that are at the bottom of that last slide, you’ll find that we want to put in this address; terrabilling@terra.bio. And I will add this individual as a member, and his role will be billing account users. And then I’ll quickly save. All right. So, my policy in Google Cloud has been updated. Now Terra billing — this is interesting. Terra billing resolves to terrabilling@firecloud.org. That could be a legacy, I don’t know if that’s in my setup but if you see that, it’s the same thing. What we have managed to accomplish now is that if we now go into Terra, we can ask Terra to create projects on our behalf. So, I’ve got this kind of canned setup.

You saw this earlier briefly in Ali’s walk-through. But what I would do is come in under my name and I would pick billing and that should come up with a box. And yet it’s not. Well, the interface is very simple. It’s a one-box interface. It asks you to type the name of your project, your billing project into the first field. And then in the second field, basically, you hit go. That spins and whirs for about 10 or 30 seconds. And when that’s done, another one of these labeled projects will appear, and then you can start using that to associate new workspaces in Terra.

You can clone things into it. You can run up a whole lot of charges. And then, finally, we want to go into Google Cloud cost recording so that you can see how much you’re spending. So, here I’d like to emphasize how well Terra integrates with the Google Cloud account. When you create a workspace in Terra and you run your notebooks in pipelines, Terra creates virtual machines in a project under your Google billing account.

WordPress with AMP
WordPress with AMP

So, now Terra’s managing your Google Cloud resources for you. This means that you only have to manage one billing account, which is great, but it also means that you can use all of the Google Cloud tools to manage your cloud budgets, you can review your audit logs, and review your account activity. Here, I’m showing two Terra projects. These are actual projects that we use in AMP PD. One of them is the AMP PD admin billing account, which is what we use to test the release. And the other one is the same billing account that you see the Getting Started workspaces under. That’s the AMP PD release v1. So, there are two billing accounts I’m showing and I can change the timeframe. I can change a few things. One that’s very useful, I’ll show you, is if we wanted to see the composition.

How are we spending the money within those two projects — also I’d like to group them by-product? So, here I can see — oh, everything in blue is consistently expensive, and nowhere the compute engine starts and stops as we’re doing some testing of some of our workflows and Jupyter notebooks. But still, looking at the cost of these — this $31 a day, $31 a day, $20 a day — seems relatively small for a program as large as AMP PD. That’s not a whole lot of money. But, in this tab, I’m showing where we ran very expensive processing. So, here you’re seeing AMP PD’s time from May to the end of June when we did a whole lot of RNASeq processing through Sandeman and Feech [phonetic sp] accounts and STAR Alignment. That was a pretty expensive process. And we ran those all in Terra.


We also did our joint genotyping for 4,000 HOGENOM sequence processed Crams. And the grand total for that two-month stretch was 23k, which, even for a program as large as ours, is a significant amount of money. So, you can spend money in — using Terra tools. So, you’re going to want to track and monitor that. The last couple of points I want to make about — one is about setting budgets and alerts. So, one of the tools that you have in Google Cloud is how to set a testing budget. Sorry. How to set an alert or budget. I call my testing budget. I define triggers at 1 percent so I know when somebody actually uses this at all, I’ll get an email.

When it’s halfway done, then I’ll pay a lot more attention to it. And when it’s 80 percent and 100 percent, then I need to figure out what to do because it’s costing too much money. One thing I want to point out about this is that we’re not actually — these are just alerting. We’re not actually shutting people down. These are not thresholds and Google just stops, you know, letting you process stuff. I don’t think they want to get into the business of shutting down your pipeline. But you can set up programmatic scripts that run based on these triggers.

And if you wanted to set up a custom script to perform that function, you could. Finally, I’m not even going to click into it. It’s the billing export. All of the cost material that you saw that drives the graphs, it’s pretty extensive. And you can ask Google to deposit all of that material, the cost billing information into a giant file, and put it into a Google Cloud bucket. Or you can also ask it to put it into a Big Query table so that you can query billing information in the same way you would query your research data.