Dear developers,
I am new to Scout and have some questions.
We are developing a lab for a class where we aim to use Scout to trace a Mendelian disease in a pedigree. In the absence of real clinical data we are thinking of taking some chromosomes from the 1000 genomes project as founders and simulate offspring and matings. After that we will seed the resulting individuals with the particular mutation.
This all requires quite some work, and will result in some VCF, PED and MAP files with variants and some basic relationships between individuals.
My first question is, how is this data normally loaded into Scout? Is it normally side-loaded from command-line, or uploaded by a user via the web?
The examples available in the scout repo have a lot of metadata spread out across various yaml and variant files. Is there a way to easily generate the needed metadata or do the files need to be hand-crafted? What is the smallest viable set of metadata that can be used to load variation from VCF, a pedigree and encoding for whether someone is affected or not (or carriers/non-carriers)?
Lastly, as long as it is specified (somewhere?), does it matter whether we work with variants scored against the Ch37 or Ch38 human genome assemblies? The examples seem to use Ch37.
Hi there! Great to hear that you are using Scout for teaching! We have done that a few times ourselves, mostly with adapted clinical cases or spiked GiB reference samples. We might be able to share some of the latter publicly, but haven't got a good sharing system up for that yet. Tentatively looking at figshare now. It would be fun to look at your solution later! Are you running it on UPPMAX or somewhere else?
Cases are always loaded, deleted etc via the CLI. The intention being that in a production facility tasks like this is usually automated, or at least performed by informatics personell who strongly prefer the repeatability of a CLI over clicking. 馃樃
Yes, basically a pedigree and filenames to the vcf files you actually want loaded. Some of the meta-data files are redundant, reflecting different (versions of) pipelines and their approaches to encoding the same info. I would suggest looking att the demo cases Scout load yaml files - e g (https://github.com/Clinical-Genomics/scout/blob/master/scout/demo/cancer.load_config.yaml or https://github.com/Clinical-Genomics/scout/blob/master/scout/demo/643594.config.yaml) - and trimming away everything that you don't have or understand. It will likely work - or warn accordingly. We can try to help you from there!
In principle, GRCh37 and GRCh38 should work, but take care to specify for each case if you load 38 since 37 is default in some places (human_genome_build: 38 in the case load config yaml). Solna run mostly pure 37 instances, but e.g. Lund have mixed and mostly 38.
Have you tried the admin guide in the docs dir (mirrored on e.g. http://www.clinicalgenomics.se/scout/)? I realise it may not be clear on all counts. Please keep asking for what's missing!
(E.g. http://www.clinicalgenomics.se/scout/admin-guide/load-config/#minimal-config 鈽猴笍)
Hi again!
I have no experience running web servers on UPPMAX, so have this set up on hardware that I control myself. I have a machine with 16 GB RAM and four (Celeron) cores and a new SSD. Do you think this machine has enough power?
With regards to loading example data, if you have something sitting around on disk from previous classes, we are interested. Out main issue and time-sink is to generate data which actually looks like it comes from related individuals. Any such data is welcome (we can put our own causative variants in), as long as there are no privacy concerns.
I will take a look at the examples! Many thanks!
Depends a bit on how big your class is.. 馃槈 But sure, dev from small laptops work fine, and as long as you don't plan to load thousands of whole genomes I doubt you need to worry much. Having a machine of your own is in many ways ideal. We have a slightly dated AWS image that we probably should keep updated for e.g. teaching purposes; see e.g. https://github.com/Clinical-Genomics/scout/issues/1799.
I'll see what I can do about public test data, but we have talked about that internally many times, so don't expect any miracles! 馃槉
Thanks again!
Two points:
Backing up a step, when running scout setup database , is normal and expected that scouts prints that very many genes from various resources do not exist?
I wonder because downstream of this, I have problems running
scout load panel scout/demo/panel_1.txt
It also reports various missing genes.
Started all over and rebuilt the database and loaded the panel and example case without issues. Maybe just some temporary external server hickup?
An update, after sorting out some issues in #2040 , Scout and the database is now up again.
I loaded the test example in two steps (note: with panel-id):
scout load panel --panel-id panel1 scout/demo/panel_1.txt
scout load case scout/demo/643594.config.yaml
I can now click my way in to the following screen:

If I click "Clinical SNV and INDELs" under Link, I am presented with some interesting genes and stuff, but if I click "643594" under inactive cases, I am not taken to the nice pedigree display that I saw in the demo, but I get an "Internal Server Error", with the following messages at the end of the log:
File "/home/scoutuser/.local/lib/python3.8/site-packages/flask/helpers.py", line 370, in url_for
return appctx.app.handle_url_build_error(error, endpoint, values)
File "/home/scoutuser/.local/lib/python3.8/site-packages/flask/app.py", line 2216, in handle_url_build_error
reraise(exc_type, exc_value, tb)
File "/home/scoutuser/.local/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/home/scoutuser/.local/lib/python3.8/site-packages/flask/helpers.py", line 357, in url_for
rv = url_adapter.build(
File "/home/scoutuser/.local/lib/python3.8/site-packages/werkzeug/routing.py", line 2179, in build
raise BuildError(endpoint, values, method, self)
werkzeug.routing.BuildError: Could not build url for endpoint 'report.report' with values ['level', 'panel_name', 'sample_id']. Did you mean 'cases.hpoterms' instead?
It seems like something is up with reporting and hpoterms.
Is some other resource meant to be loaded before this works?
Am I missing some set of terms (e.g. Phenomizier, which I do not have logins for)?
scout view hpo
reports that:
scout.commands.view.hpo[155551] INFO Found 15530 terms
What if you run the server without gunicorn?
Nah, you can go just fine without phenomizer, chanjo etc as long as they are not enabled in your config. I wonder if we have tested 3.8 lately - I think the automatic test matrix is 3.6 and 3.7 only?
It works without gunicorn! I can load the broken link with:
scout serve -h 0.0.0.0 -p XXXXX
If the performance is expected to be acceptable with 15-20 users w/o gunicorn, I could settle with that. However, is there a way to load my key and cert w/o gunicorn, so that I can still use HTTPS (helps a lot for writing the lab instructions if this is consistent)?
I wonder if we have tested 3.8 lately
I have 3.8 on my local instance and it works
By not enabled, should I comment them out, or keep them as they are in the example?
There are also unconfigured lines for SQLALCHEMY in the my config.
Commented out, or removed - if I understood what file you copied, that would still contain lines for e.g. PhenomizerUser and password, as well as Chanjo etx.
The latter SQLALCHEMY lines are for chanjo, and likely what is acting up. I would again suggest removing or commenting out anything that you are not sure you have set up already! 馃槄
(And on our to-do we would add providing a minimal production setup server config!)
That is correct, with it set to this in the config:
SQLALCHEMY_DATABASE_URI = '???'
It crashes. With it commented out, it works :-)
So a minimal config works fine without Phenomizer or SQLALCHEMY. Is there a particular function/experience in Scout that becomes degraded without Phenomizer?
By not enabled, should I comment them out, or keep them as they are in the example?
I don't get it, what lines should you comment out?
scout serve -h 0.0.0.0 -p XXXXX
In general I just run scout serve
Our production server uses nginx to proxy the service. I'm copying from our setup manual:
The web service is run on http://localhost:8080. We let _NGINX_ handle the specifics of which domain the web service is available on as well as encypting traffic over SSL/HTTPS.
To make life easy for users we proxy each sevice behind _NGINX_ so we can use pretty URLs such as scout.scilifelab.se.
https:// managed by _NGINX_ with SLL certificates that SciLifeLab IT provides, requests on port 80 are automatically forwarded to port 443clinical-db:/etc/nginx/conf.d/{SUBDOMAIN}.confEach subdomain config contains the following (abbreviated) setup:
server {
server_name {SUBDOMAIN}.scilifelab.se
location / {
proxy_pass http://localhost:{PORT}; # port where the server runs
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Web service run on http://localhost:XXXX. _NGINX_ will handle forwarding requests and setting the correct headers to make it all work. Since _NGINX_ also handles the https part - individual web services don't need to worry about that!
Whenever you update access rules or other settings you need to restart _NGINX_ for the changes to take effect. To restart _NGINX_, run:
sudo /sbin/service nginx restart
SQLALCHEMY
Leave it out, it's for connecting to another database (chanjo) that handles the coverage. You don't need it at this stage
I used the example configuration over at https://github.com/Clinical-Genomics/scout to make my config.
This example contains a few lines that make it instructions inconsistent or confusing to get going for a beginner:
MONGO_DBNAME = 'scoutTest' <-- scoutTest is never created when initiating the database. It is called 'scout' in mongodb.
To use simple login, the whole section with Google OAth need to be taken out or commented.
It is unclear to an outsider whether these '???' specifications are interpreted as invalid and discarded by scout, or set to real values internally.
# enable Phenomizer gene predictions from phenotype terms
PHENOMIZER_USERNAME = '???'
PHENOMIZER_PASSWORD = '???'
# enable Chanjo coverage integration
SQLALCHEMY_DATABASE_URI = '???'
They seem to be accepted and configure the server to use those strings. Having Phenomizer set to '???' have not triggered a crash to me (yet), but SQLALCHEMY_DATABASE_URI with '???' did.
By removing them or commenting them out, all the things that I have interacted with on the web have worked, so far.
So as @dnil said, very basic config for gunicorn with these fixed would avoid those pitfalls for a beginner like me.
By removing them or commenting them out, all the things that I have interacted with on the web have worked, so far.
Ah ok, I see. If you haven't set up the system for advances settings then just omit them
Yep. The issue, I have no idea what settings are advanced or not when setting this up for the first time :-D
Yep. The issue, I have no idea what settings are advanced or not when setting this up for the first time :-D
You haven't set up google OAuth so omit, same for phenomizer, same for SQLALCHEMY_DATABASE_URI. Just omit the lot of them
For reference, I added a few lines on this in the docs (#2046). Since it seems you are up and running for the moment, so I'll close the thread, but feel free to reopen/open new ones! Its really useful to have other eyes look at this.