Intensive 4-days (5th day is not compulsory, but is open for any discussion, if there would be interest) course to learn all theory about HybSeq and practically learn how to analyze HybSeq data, how to solve all problems, and how to evaluate differences among gene trees. Important part is enough time to discuss everything, including practical problems and projects of individual participants.
Aims
Participants will become familiar with all theory about HybSeq: Participants will get an introduction to HybSeq and learn about the advantages and disadvantages of target enrichment vs. other genome sub-selection methods such as RADseq. We will discuss about the proper selection of genomic targets (the so-called probe design) and the pros and cons of group-specific vs. universal probes. Additional datasets that can be obtained with HybSeq will be highlighted and, finally, a summary of the wet lab workflow of target enrichment will be given.
Participants will understand principles of computational data analysis, i.e. how and why are the raw FASTQ files prepared (trimming, deduplication, quality checking) and processed by HybPiper, aligned and how gene and species trees are constructed.
Participants will have sufficient knowledge to be able to run all steps on a cluster using Linux command line and will be able to deal with all common technical issues.
Participants will be able to deal with all major problems of species tree construction, i.e. main source of gene trees incongruences and how do deal with them, how to evaluate differences among gene trees, etc.
Teachers
-
-
All theory about HybSeq, target enrichment, etc. — introduction to HybSeq, advantages and disadvantages of target enrichment vs. other genome sub-selection methods, selection of genomic targets (probe design), group-specific vs. universal probes, additional datasets that can be obtained with HybSeq, target enrichment wet lab workflow.
-
-
-
All scripting and data processing — trimming, deduplication, running HyPiper, aligning contigs, gene and species trees constructions, etc.
-
-
-
ASTRAL (theory, versions, multilocus bootstrapping/local posterior probability, accuracy, contraction of lower supported branches, gene tree filtering, branch lengths, ...), supertree methods (MRL, MRP), quartet scoring and locus selection/filtering.
-
Pre-requirements for participants
-
The participant has some relevant project running/is expecting to run such project in near future.
-
The participant is able to work in Linux command line.
-
The participant has knowledge about at least basic usage of command line (BASH). Participants who are not fluent in scripting in BASH are requested to join Vojta's Linux course or something similar.
-
The participant is able to connect from her/his notebook to Linux server via
SSH
, i.e. she/he has some some tool to connect (macOS, Linux installation (e.g. VM), Putty, Windows 10 Linux subsystem, …) and knowledge to use it. -
Participant has access to some computing server, e.g. MetaCentrum and knows how use needed software and launch tasks there.
-
-
The participant should have at least minimal knowledge of R.
-
Helpful, but not required, is at least minimal knowledge of Python.
-
Participants, who work exclusively in Windows must be able to work on remote Linux server, and must be able to install and run all software needed for local tasks. Alternatively, such people can use e.g. Linux in VM or do all the work on remote server.
-
The participant has good knowledge of molecular biology and (plant) evolution.
Required software
- ASTRAL
- BASH 4 or later; and bunzip2
- BBMap
- BLAST
- BWA
- Exonerate
- FastQC
- GNU Parallel
- HybPiper
- IQ-TREE
- MAFFT
- MJPythonNotebooks
- PhyloNet
- phyparts
- Python 2.7 or later; and Biopython 1.59 or later
- QuartetScores
- R and packages ade4, adegenet, ape, corrplot, distory, ggplot2, gplots, heatmap.plus, ips, kdetrees, pegas, phangorn, phytools, reshape2
- SAMtools
- SPAdes
- TreeShrink
- Trimmomatic
Outline
-
All BASH scripts used during whole course will be available in dedicated Git repository. Participants will be requested to inspect all of them, explain their functions and discuss other options how to do the tasks.
-
There will be plenty of room to discuss everything.
-
It should be enough to ensure everyone really well understands what, how and why is done.
-
Rest of the time is available to discuss various individual projects and issues of every participant.
-
-
The outline might be subject of change, more people introducing various topics might be involved.
Day 1
-
All needed theory — introduction to HybSeq, advantages and disadvantages of target enrichment vs. other genome sub-selection methods, selection of genomic targets (probe design), group-specific vs. universal probes, additional datasets that can be obtained with HybSeq, target enrichment wet lab workflow.
-
Ensureing everyone can access some computing server and submit needed tasks there. Explanation of advanced options how to work with a task on remote server.
-
Pre-processing of raw FASTQ files — trimming, deduplication, quality checking. Ensuring everyone is able to submit needed scripts and well understands what it does.
Day 2
-
Preparing data to run HybPiper and running whole its pipeline — blasting, SPAdes assembly, generating alignments against the target sequence, sorting sequences and creating final list per gene, getting statistics, etc.
-
Aligning all obtained sequence lists (exons, introns as well as supercontigs).
Day 3
-
Checking alignments and excluding not usable alignments. Constructing gene trees with or without outgroups. Discussing various tree constructing methods. Inspecting produced gene trees.
-
Evaluating gene trees with PCoA of their topographical differences, kdetrees and TreeShrink.
Day 4
-
Constructing species trees using ASTRAL.
-
Visualizing gene trees vs. species tree incongruencies using phyparts.
-
Constructing phylogenetic networks using PhyloNet.
-
Locus selection/filtering.
-
Supertree methods (MRP, MRL).
Day 5
- Not compulsory, open for any discussion, if there would be interest.
Participants should make themselves familiar with some relevant literature
-
Weitemier et al. (2014), Appl. Plant Sci.; introduction to HybSeq, https://doi.org/10.3732/apps.1400042
-
Harvey et al. (2016), Syst. Biol; target enrichment vs. RADseq, https://doi.org/10.1093/sysbio/syw036
-
Chau et al. (2017), Appl. Plant Sci.; group-specific vs. universal probes, https://doi.org/10.1002/aps3.1032
-
Gnirke et al. (2009), Nat. Biotechnol.; target enrichment wet lab procedure, https://doi.org/10.1038/nbt.1523
Date and place
January 20--23, 2020, Department of Botany, Faculty of Science, Charles University, Prague, Benátská 433/2, lecture hall "Seminárium Katedry botaniky (BB)", 2nd floor. The course will start daily at 9:00 AM and end up about 4--5 PM.
Working environment
-
Most of the work will be performed using Linux command line on remote computing server. Participants should ensure they have access to MetaCentrum or another computing cluster. All tasks might be done locally, but most of notebooks will be probably too slow to this task...
-
Participants will get training data, including all intermediate steps.
-
Reduced data set of South African Oxalis will be used.
-
-
Participants can work in any operating system with any software tools, tasks done locally do not require much computing power, but all participants must well master their computers and must be able to install all needed (sometimes tricky) software.
Application
Number of participants is limited to 15 people at most. If you with to join the course, fill, please https://docs.google.com/forms/d/e/1FAIpQLSfv85vOWGJfgo304C5qTuM12QK5HQv7irLssNneDbZbc-mCGQ/viewform We'll contact you soon.
Q & A
All questions should be adressed to Vojtěch Zeisek and Roswitha Schmickl, i.e. to e-mails zeisek (at) natur (dot) cuni (dot) cz
and roswitha (dot) schmickl (at) natur (dot) cuni (dot) cz
.