Structure, R and Linux | Vojtěch Zeisek

Od vojta , 8 Říjen, 2014

Structure is very popular software for revealing population structure. Here I describe easy workflow how to run independent runs of Structure in parallel to speed up whole process using statistical software R and how to facilitate use of CLUMPP and distruct with easy BASH scripts. Of course, we work on Linux (although another UNIX systems like Mac OS X will be very similar).

Please, note, this is not step by step how to you could immediately repeat without any change. You will need to modify parameters etc. It is more inspiration. And it is not the only way how to do the job.

Let's have directory structure within working directory as follows:


structure_calculations
├── 0_results
├── 1_structure_sum
├── 2_clumpp
└── 3_distruct

How to run Structure in parallel

With modern computers having multiple cores we need to start to think about use of those resources efectively. If our processor has 8 threads/cores, we can launch up to 8 instances of Structure in parallel and to speed up whole process. I took inspiration from The Molecular Ecologist and used ParallelStructure R library (described in PLOS).

Of course, we need to prepare valid input file according to Structure manual. Then we need to prepare joblist - a file describing unique name for each run, populations involved for that particular run, K, burn-in and number of steps. See documentation of ParallelStructure for more details.

It looks like this:

run1 1,2,3,4,5,6,7,8,9,10 2 100000 1000000 run2 1,2,3,4,5,6,7,8,9,10 2 100000 1000000 run3 1,2,3,4,5,6,7,8,9,10 2 100000 1000000 run4 1,2,3,4,5,6,7,8,9,10 2 100000 1000000 run5 1,2,3,4,5,6,7,8,9,10 2 100000 1000000 ...

Now we can launch R (the authors recommend plain R console without any GUI):

# Install ParallelStructure
install.packages("ParallelStructure", repos="http://R-Forge.R-project.org")
# Set working directory
setwd("/path/to/directory/structure_calculations/")
# Verify the path
getwd()
# List directory content
dir()
# Load ParallelStructure package
library(ParallelStructure)
# See manual for the R package (and read Structure documentation)
?parallel_structure
# Run Structure
parallel_structure(joblist="joblist.txt", n_cpu=6, structure_path="/path/to/directory/with/structure/binary/", infile="data.in", outpath="0_results/", numinds=200, numloci=10, plot_output=1, label=1, popdata=1, popflag=1, phenotypes=0, markernames=1, mapdist=0, onerowperind=0, phaseinfo=0, extracol=0, missing=-9, ploidy=2, usepopinfo=0, revert_convert=1, printqhat=1, locdata=0, recessivealleles=0, phased=0, noadmix=0, linkage=0, locprior=0, inferalpha=1)

Note, that parameters for parallel_structure() function are practically same as those one would normaly use in mainparams and extraparams files (with some limitations). Read documentation of the function. In directory 0_results (set by outpath) we find results as from any other Structure run, and if we used plot_output=1, also plots, although not very beautiful.

Select the best K

To postprocess results and choose the best K we can use Structure_sum R script written by Dorothee Ehrich. See documentation of the script for all available methods and interpretations. We need to prepare TXT file with list of Ks and output files. It looks like this:

2 results_job_run1_f 2 results_job_run2_f 2 results_job_run3_f 2 results_job_run4_f 2 results_job_run5_f ...

Now we can continue in R:

# Load the scrips
source("/path/to/the/script/structure-sum-2009.r")
# Set working directory
setwd("/path/to/directory/1_structure_sum/")
# Verify the path
getwd()
# List directory content
dir()
# Launch the commands of the script to find the best K (see its documentation for details)
Structure.table("list_k.txt", 10)
Structure.simil("list_k.txt", 10)
Structure.deltaK("list_k.txt", 10)
graphics.off()
Structure.cluster("list_k.txt", 10)

Using CLUMPP

CLUMPP is used to align the results to prepare them for making figures with distruct. Prepare indfiles, popfiles and paramfiles as described in the documentation. I recommend to use PRINT_PERMUTED_DATA 2 to get separate file for every run. Copy binary of CLUMPP (named clumpp in this case) to the directory 2_clumpp. Following script will just launch needed instances of CLUPP for every choosen K and copy results to distruct folder for later use.

The paramfile can look like this:


# This is the file that sets the parameters for the program CLUMPP, version 1.1

# Everything after "#" will be ignored by the program. Parameters are:
# K, C, R, M, W, GREEDY_OPTION, REPEATS, PERMUTATIONFILE, PRINT_PERMUTED_DATA,
# PERMUTED_DATAFILE, PRINT_EVERY_PERM and EVERY_PERMFILE. 
# All parameter names shall be followed by at least one blank space and then
# the parameter-value.

# --------------- Main parameters ---------------------------------------------

DATATYPE 0						# The type of data to be read in.
							# 0 = individual data in the file
							# specified by INDFILE, 1 = population
							# data in the file specified by
							# POPFILE.

INDFILE k2.indfile   			# The name of the individual datafile.
							# Required if DATATYPE = 0.

POPFILE k2.popfile				# The name of the population datafile.
							# Required if DATATYPE = 1.

OUTFILE k2.outfile     			# The average cluster membership 
							# coefficients across the permuted runs
							# are printed here.

MISCFILE k2.miscfile			# The parameters used and a summary of
							# the results are printed here. 

K 2      						# Number of clusters.

C 200				      		# Number of individuals or populations.

R 10			            		# Number of runs.

M 1							# Method to be used (1 = FullSearch,
							# 2 = Greedy, 3 = LargeKGreedy).

W 1							# Weight by the number of individuals
							# in each population as specified in 
							# the datafile (1 if yes, 0 if no).

S 2                           		# Pairwise matrix similarity statistic 
                              		# to be used. 1 = G, 2 = G'. 

# - Additional options for the Greedy and LargeKGreedy algorithm (M = 2 or 3) -

GREEDY_OPTION 2					# 1 = All possible input orders,
							# 2 = random input orders,
							# 3 = pre-specified input orders.

REPEATS 1000000					# If GREEDY_OPTION = 2, then REPEATS
							# determines the number of random input
							# orders to be tested. If GREEDY_OPTION
							# = 3, then REPEATS is the number of 
							# input orders in PERMUTATIONFILE.  

PERMUTATIONFILE k2.permutationfile 	# The permutations of the runs in 
							# PERMUTATIONFILE will be used, if 
							# GREEDY_OPTION = 3.

# --------------- Optional outputs --------------------------------------------

PRINT_PERMUTED_DATA 2				# Print the permuted data (clusters) in
							# INDFILE or POPFILE to 
							# PERMUTED_DATAFILE (0 = don't print,
							# 1 = print into one file, 2 = print 
							# into separate files for each run).

PERMUTED_DATAFILE k2.perm_datafile	# The permuted data (clusters) will be 
							# printed to this file (if 
							# PRINT_PERMUTED_DATA = 2, several 
							# files with the extensions "_1" to 
							# "_R" will be created).

PRINT_EVERY_PERM 0				# Print every tested permutation of the
							# runs and the corresponding value of 
							# SSC to a file specified by 
							# EVERY_PERMFILE (0 = don't print, 
							# 1 = print). 
							# Note that printing may result in a 
							# very large file.

EVERY_PERMFILE k2.every_permfile	# Every tested permutation of the runs
							# and the corresponding SSC will be
							# printed here. 

PRINT_RANDOM_INPUTORDER 0			# Print random input orders of runs to
                                    	# RANDOM_INPUTORDER (0 = don't print,
                                       	# 1 = print). This option is only 
                                        	# available if GREEDY_OPTION = 2.

RANDOM_INPUTORDERFILE k2.random_inputorderfile # Every random input order 
                                        	# of the runs (generated by CLUMPP if 
                                        	# GREEDY_OPTION = 2) will be printed
                                        	# here. 

# --------------- Advanced options --------------------------------------------

OVERRIDE_WARNINGS 0				# This option allows the user to 
							# override non-crucial warnings from 
							# the program (0 allow warnings, 1 do 
							# not issue non-crucial warnings).

ORDER_BY_RUN 1					# Permute the clusters of the output 
							# files by the specified run. (0 to 
							# not specify a run, 1 to R specifies
							# a run in the INDFILE or POPFILE).   

# --------------- Additional comments -----------------------------------------

# The term ''permutation'' is used in two different contexts, permutations of 
# membership coefficients, or clusters, and permutations of runs.

# For example, if the datafile has has data A B C D E (each letter indicates a 
# column corresponding to a cluster), then permutation 3 2 5 1 4 of the
# clusters means C B E A D.

# Permutation 4 1 2 3 of runs 1-4 would mean start with run 4, then run 1, then
# run 2, and then run 3.

# --------------- Command line arguments --------------------------------------

# -i INDFILE
# -p POPFILE
# -o OUTFILE
# -j MISCFILE
# -k K
# -c C
# -r R
# -m M
# -w W
# -s S
# -----------------------------------------------------------------------------

The BASH script can look as follows. Save it with clumpp.sh, add in command line execution right by chmod +x clumpp.sh and launch from the directory 2_clumpp by ./clumpp.sh:

#!/bin/bash
# Selected K = 2, 3, 4
# Go to the directory
cd /path/to/directory/structure_calculations/2_clumpp/
# Just launch the CLUMPP as many times as needed and record the log files
./clumpp k2.paramfile | tee clumpp2.log &
./clumpp k3.paramfile | tee clumpp3.log &
./clumpp k4.paramfile | tee clumpp4.log
# Copy output files to distruct directory
for K in 12 13 14
do
  # In the expresion "{1..10}" use second number according to number of runs (10 in this case)
  for run in {1..10}
  do
      cp t$K.perm_datafile_$run ../3_distruct/t_$K.$run.indivq
  done
done

Draw the figure with distruct

Now we can go to the last directory 3_distruct. We already have here indivq files. We must manually prepare popq files for each K, and files with drawparams, names (poplation names) and perm (colors). Save distruct binary to the file. The input file can look like this:


PARAMETERS FOR THE PROGRAM distruct.  YOU WILL NEED TO SET THESE
IN ORDER TO RUN THE PROGRAM.  

"(int)" means that this takes an integer value.
"(B)"   means that this variable is Boolean 
        (1 for True, and 0 for False)
"(str)" means that this is a string (but not enclosed in quotes) 
"(d)"   means that this is a double (a real number).

Data settings

#define INFILE_POPQ        k#####.popq      // (str) input file of population q's
#define INFILE_INDIVQ      k#####.indivq    // (str) input file of individual q's
#define INFILE_LABEL_BELOW k.names2     // (str) input file of labels for below figure
#define INFILE_LABEL_ATOP  k.names1     // (str) input file of labels for atop figure
#define INFILE_CLUST_PERM  k.perm     // (str) input file of permutation of clusters to print 
#define OUTFILE            k#####.ps       //(str) name of output file

#define K	#####    // (int) number of clusters	
#define NUMPOPS 13    // (int) number of pre-defined populations
#define NUMINDS 200  // (int) number of individuals

Main usage options

#define PRINT_INDIVS      1  // (B) 1 if indiv q's are to be printed, 0 if only population q's
#define PRINT_LABEL_ATOP  1  // (B) print labels above figure
#define PRINT_LABEL_BELOW 1  // (B) print labels below figure
#define PRINT_SEP         1  // (B) print lines to separate populations

Figure appearance

#define FONTHEIGHT 10	// (d) size of font
#define DIST_ABOVE 5	// (d) distance above plot to place text
#define DIST_BELOW -13	// (d) distance below plot to place text
#define BOXHEIGHT  36	// (d) height of the figure
#define INDIVWIDTH 4	// (d) width of an individual

Extra options

#define ORIENTATION 0	     // (int) 0 for horizontal orientation (default)
			     //       1 for vertical orientation
			     //	      2 for reverse horizontal orientation
                             //       3 for reverse vertical orientation
#define XORIGIN 40		// (d) lower-left x-coordinate of figure
#define YORIGIN 400		// (d) lower-left y-coordinate of figure
#define XSCALE 1		// (d) scale for x direction
#define YSCALE 1		// (d) scale for y direction
#define ANGLE_LABEL_ATOP 45	// (d) angle for labels atop figure (in [0,180])
#define ANGLE_LABEL_BELOW 0    // (d) angle for labels below figure (in [0,180])
#define LINEWIDTH_RIM  3	// (d) width of "pen" for rim of box
#define LINEWIDTH_SEP 0.3	// (d) width of "pen" for separators between pops and for tics
#define LINEWIDTH_IND 0.3	// (d) width of "pen" used for individuals 
#define GRAYSCALE 0	        // (B) use grayscale instead of colors
#define ECHO_DATA 1             // (B) print some of the data to the screen
#define REPRINT_DATA 1          // (B) print the data as a comment in the ps file
#define PRINT_INFILE_NAME 1     // (B) print the name of INFILE_POPQ above the figure 
                                //     this option is meant for use only with ORIENTATION=0 
#define PRINT_COLOR_BREWER 1    // (B) print ColorBrewer settings in the output file 
                                //     this option adds 1689 lines and 104656 bytes to the output
                                //     and is required if using ColorBrewer colors

Command line options:

-d drawparams
-K K
-M NUMPOPS
-N NUMINDS
-p input file (population q's)
-i input file (individual q's)
-a input file (labels atop figure)
-b input file (labels below figure)
-c input file (cluster permutation)
-o output file

Values market by ##### will be overwritten by command line arguments in following script, which will draw PS figures and convert them to PNG. Of course, you can adjust the values. Save the following script as distruct.sh, add in command line execution right (chmod +x distruct.sh) and launch it (./distruct.sh).

#! /bin/bash
# drawparams, names and perm are still sam. indivq, popq and K do change.
cd /path/to/directory/structure_calculations/3_distruct/
# Selected K = 2, 3, 4
for K in 2 3 4
do
  for run in {1..10}
  do
    # In the expresion "{1..10}" use second number according to number of runs (10 in this case)
    ./distruct -d k.drawparams -K $K -p k_$K.$run.popq -i k_$K.$run.indivq -o k_$K.$run.ps
  done
done
# Convert PS to PNG
for psout in `ls -1 *.ps`; do
  convert -verbose -density 200 $psout -flatten ${psout%.*}.png;
done

We are done. Now, we have nice figures, which we can edit, for example, in GIMP.

Running Structure and CLUMPP on MetaCentrum NGI

Structure, R and CLUMPP are available also on MetaCentrum NGI, Czech National Grid Infrastructure by CESNET. I will not comment all the settings and workflows for work with MetaCentrum, see the wiki.

In your home directory you need script structure.sh (see further) and directory structure with data file, joblist, R script structure.r and structure binary. Script named structure.sh used to launch the structure task:

#!/bin/bash
# Set data directories
WORKDIR="structure"
DATADIR="/storage/praha1/home/$LOGNAME"
# Clean-up of SCRATCH
trap 'clean_scratch' TERM EXIT
trap 'cp -ar $SCRATCHDIR $DATADIR/ && clean_scratch' TERM
# Prepare the task
cp -ar $DATADIR/$WORKDIR/* $SCRATCHDIR/  || exit 1
# Change working directory
cd $SCRATCHDIR/ || exit 2
# Launch it
. /packages/run/modules-2.0/init/sh
module add R-3.1.0
R CMD BATCH structure.r
# Copy results back to home directory
cp -ar $SCRATCHDIR $DATADIR/ || export CLEAN_SCRATCH=false

The R script structure.r to launch perform the computations will be slightly modified:

dir.create("/storage/brno2/home/LOGIN/Rpackages/", showWarnings=FALSE)
install.packages("ParallelStructure", repos="http://R-Forge.R-project.org", lib="/storage/brno2/home/LOGIN/Rpackages/")
library(ParallelStructure, lib.loc="/storage/brno2/home/LOGIN/Rpackages/")
parallel_structure(joblist="joblist.txt", n_cpu=6, structure_path=".", infile="data.in", outpath="results/", numinds=200, numloci=10, plot_output=1, label=1, popdata=1, popflag=1, phenotypes=0, markernames=1, mapdist=0, onerowperind=0, phaseinfo=0, extracol=0, missing=-9, ploidy=2, usepopinfo=0, revert_convert=1, printqhat=1, locdata=0, recessivealleles=0, phased=0, noadmix=0, linkage=0, locprior=0, inferalpha=1)

Now, you can launch the computations by qsub command, for example qsub -l walltime=3d -l nodes=1:ppn=6 -l mem=12gb -l scratch=1gb -m abe structure.sh Modify it according your needs. This is not the only way how to launch Structure on MetaCentrum, see the wiki.

Later, when we select our K, we can launch CLUMPP in similar way. Directory clumpp will contain all files like in previous case, excluding CLUMPP binary. Script clump.sh to launch calculations can look like this (modify according your needs):

#!/bin/bash
# Set data directories
WORKDIR="clumpp"
DATADIR="/storage/praha1/home/$LOGNAME"
# nastaveni uklidu SCRATCHE pri chybe nebo ukonceni# (pokud nerekneme jinak, uklidime po sobe)
trap 'clean_scratch' TERM EXIT
trap 'cp -ar $SCRATCHDIR $DATADIR/ && clean_scratch' TERM
# Příprava vstupních dat úlohy: pokud se nepodaří, bude úloha ukončena s nenulovým návratovým kódem:
cp -ar $DATADIR/$WORKDIR/* $SCRATCHDIR/  || exit 1
# Změna pracovního adresáře: pokud se nepodaří, bude úloha ukončena s nenulovým návratovým kódem:
cd $SCRATCHDIR/ || exit 2
# spuštění úlohy
. /packages/run/modules-2.0/init/sh
module add clumpp
CLUMPP clumppinput.paramfile | tee clumpp.log
# Zkopírování výsledků
cp -ar $SCRATCHDIR $DATADIR/$WORKDIR || export CLEAN_SCRATCH=false

And launch it in similar way asi in previous case: qsub -l walltime=3d -l nodes=1:ppn=1 -l mem=1gb -l scratch=1gb -m abe clumpp.sh Modify it according to your needs.

Please, consult on-line manual how to use MetaCentrum and some textbook of Linux command line for details of all scripts and commands used in this section.

Články

Linux (a jiný software)

Populační genetika

Věda, výzkum, biologie