# Read me **General Information**   [_printer friendly version_](readme.pdf) PCAtag uses fastPHASE and R principal component analysis method to analyze genomic data. ### Requirements Java 1.8 Java 1.8 or later version installed on your Java virtual machine PCAtag software bundle [Download](download.html) the PCAtag software bundle, this includes PCAtag.jar, fastPHASE and R which require to run PCAtag software. Input data file This input data file contains one header line with id, marker names(s) and optional subset co lumn; and the remaining lines are for the genotype data - unique individual id, marker data and optinal subset name. Marker data should be entered as two separate columns of numbers separated by a space or tab. See [32snp.txt](32snp.txt) for detail. Additional Requirement   [R](http://www.r-project.org/) Linux users must download and install R separately. ### Download and Installation Download and install either Linux or PC version of the PCAtag software. All download archive files have been compressed by the zip command. After downloading the required file please decompress and restore it to its origin al form prior any execution. To decompressed a .zip file - _unzip filename._ * **Linux** user should download the PCAtag\_Linux.zip file. This compressed archive file includes PCAtag.jar, fastPHASE used for haplotype construction, and all example files. However, Linux user must download and install R separately. * **PC** user should download PCAtag\_PC.zip file. This compressed archive file includes PCAtag.jar, fastPHASE, R and all example files. ### Execution * **Starting the GUI** * PC - Double click on the PCAtag.jar file * Linux - at the PCAtag directory type _"java -jar PCAtag.jar"_ The display panel is separated into 2 sections. The top section is for entering optional variables. The bottom section is for outputing messages. **Running Analysis** 1. Enter an input data file name to the **_Input File Name_** text field or use the **_Browse_** button to select a file. 2. Analysis Option 1. **_Haplotype_** default option. This option first phases the data and then performs the PCA analyses on the phased data. It imputes missing data, so there are no issues of missing data. 2. **_Genotype_** This option omits the phasing and performs the PCA analysis directly. This option should only be considered if the missing data rate is very low. 3. **_Subset Analysis_** \- Use this option to execute analyses separately. This option requires the input data file to be setup with subset column. 4. PCA Parameters 1. **_EigenValue Threshold_** , default is 0.7. Indicates the eigenvalue threshold used to extract factors in the PCA analysis. Factor that have lower eigenvalues will only be selected if the percentage of the variance explained has not surpassed the threshold set for the variance explained, in which case more factors will be acquired to reach this threshold. Value should between 0.0 to 1.9. 2. **_Variance Explained %_** , default value is 90%. And additional threshold used to extract additional facators in the PCA analysis (see above eigenvalue threshold for more info). 6. Factor Loading 1. **_Retain_** , default value is 0.4. Factor loading threshold for group membership. By default, a SNP is only considered to belong to a factor if its factor loading is >= 0.4 or <= 0.4. This value for a factor loading is a standard in the field (Stevens JP (1992) Applied Multiivariate Statistics for the Social Sciences, 2nd Edition, Hillsdale, NJ: Erlbaum). 2. **_Suppress Print Below_**, default value is 0.2. Suppression for printing, factor loading are not printed in the output if they are <= 0.2 or >= 0.2. 8. Analysis Method 1. **_two step_**, 2-step PCA method. 2. **_multi step_**, this is the default option. Run multi-step PCA method 3. **_two step & multi step_**, run both 2-step and multi-step PCA methods. 10. Click on the **_Uploading file_** button to verify the input data file format and subset option 11. Output File 1. Enter the output file name in the **_Output File Name_** text field or use the **_Browse_** button to select and overwrite an existing file or select a directory and enter the file name in the **_Selection_** text field in the Open File pop-up window. 2. Print Option 1. **_Long Output_** Long output format print enigenvalues and cumulatiive variance of the initial PCA analysis to determine the main factors. It also print all the details of the sub-factors and all the factor loadings, and the final factors and the tagging SNPs selected. 2. **_short Output_** Short format output the eigenvalue and cumulative variance of the initial PCA analysis to determine the main factors. And the Final factor and the tagging SNPs selected. 3. **_Summary_** A simply summary file output contains the final set of tagging SNPs. . * **Command Prompt:**       From the PCAtag directory type        **_java -jar PCAtag.jar_** \-i inputfilename -o outputfilename \[OPTION\]... * **Options Description:** * **_\-i_** Name of the input file from which to read the genotype data. * **_\-o_** Name of the file to which output is directed. * **_\-g_** for genotype, default is haplotype. Haplotype option first phases the data and then performs the PCA analyses on the phased data. The genotype option omits the phasing and performs the PCA analysis directly. the haplotype option imputes missing data, so there are no issues of missing data. The genotype option should only be considered if the missing data rate is very low. * **_\-s_** Use this option to execute analyses separately for each subset.Default is no subset analysis * **_\-e_** < \[0.0..1.0\]>, default is 0.7. Eigenvalue threshold used to extract factors in the PCA analysis. Factor that have lower eigenvalues will only be selected if the percentage of the variance explained has not surpassed the threshold set for the variance explained (-v), in which case more factors will be acquired to reach this threshold. * **_\-v_** <\[0..100%\]>, default value is 90%. Variance explained, an additional threshold used to extract additional factors in the PCA analysis ( see above -e option ). * **_\-f_** <\[0.0-1.0\]>, default value is 0.4. Factor loading threshold for group membership. By default, a SNP is only considered to belong to a factor if its factor loading is >= 0.4 or <= 0.4. This value for a factor loading is a standard in the field (Stevens JP (1992) Applied Multiivariate Statistics for the Social Sciences, 2nd Edition, Hillsdale, NJ: Erlbaum). * **_\-t_** <\[0.0-1.0\]>, default value is 0.2. Suppression for printing, factor loading are not printed in the output if they are <= 0.2 or >= 0.2. * **_\-p_** Output print mode, default is \\223short\\224. Results can be printed in long or short format. Short format output the eigenvalues and cumulative variance of the initial PCA analysis to determine the main factors. And the Final factors and the tagging SNPs selected. The long format output additionally contains all the details of the sub-factors and all the factor loadings. Only one or other of these canbe selected. * **_\-summary_** - A simply summary file containing the names of the final set of the tagging SNPs. * **_\-2_** for two step PCA analysis, default is multi-step. Select the this option to run 2-step PCA method. * **_\-m_** for multi step PCA analysis. Select this option to run multi step PCA method. This is the default. * **_\-b_** for both two step & multi step PCA analysis. Select this option to run both 2-step and multi-step PCA methods.      **Example of command line syntax:**         _java -jar PCAtag.jar -i "C:\\PCAtag\\Input.txt" -o "C:\\PCAtag\\Output.txt" -b -g_ ### Example Input * **Input Data File** - only display the top 17 lines of the data file D M1 M2 M3 M4 M5 85 2 1 2 1 2 1 1 2 1 2 86 2 1 2 1 2 1 1 2 1 2 87 1 1 1 1 1 1 1 1 1 1 88 1 1 1 1 1 1 1 1 1 1 89 2 1 2 1 2 2 2 2 2 2 90 2 1 2 1 2 1 1 2 1 2 91 2 1 2 1 2 1 1 2 1 2 92 1 1 1 1 1 1 1 1 1 1 93 1 1 1 1 1 1 1 1 1 1 94 2 1 2 1 2 1 1 2 0 0 95 2 1 2 1 2 1 1 2 1 2 96 2 1 0 0 2 1 1 2 1 2 97 2 1 2 1 2 1 1 2 1 2 98 2 2 2 2 2 2 2 2 2 2 99 2 2 2 2 2 2 2 2 2 2 100 2 2 0 0 2 2 2 2 2 2 ### Example Output * **Short Output File** - with 2-step PCA and genotype analysis \*\*\*\*\*\*\*\*\*\*\* PCAtag Report \*\*\*\*\*\*\*\*\*\* Created : Wed May 26 16:12:11 MDT 2010 Input Data File : examplefiles/example.txt Output Format : short Options Summary : \- Two-Step \- Genotype \- All \- Variance Explained = 90.0 \- Eigenvalue Threshold = 0.7 \- Factor Loading Retain = 0.4 \- Factor Loading Suppress = 0.2 Subset Overall: Number of Valid Markers 5 Summary of PCA factors from the First-Level analysis: LD Group Eigenvalue Var(%) CumVar(%) 1 4.732097 94.64 94.64 \=============================================== 2-Step Final Summary : LD Group SNPs tSNPs 1 M5 M3 M1 M4 M2 2 M4 M2 M1 M3 M5 \------------------------------------------------- Final Tagging SNPs LD Group M2 2 M3 1 Total # Tagging SNPs : 2 Elapse Time : 0 h : 0 m : 0 s * **Long Output** - with multi-step PCA and haplotype analysis \*\*\*\*\*\*\*\*\*\*\* PCAtag Report \*\*\*\*\*\*\*\*\*\* Created : Wed May 26 16:13:06 MDT 2010 Input Data File : examplefiles/example.txt Output Format : long Options Summary : \- Multi-Step \- Haplotype \- All \- Variance Explained = 90.0 \- Eigenvalue Threshold = 0.7 \- Factor Loading Retain = 0.4 \- Factor Loading Suppress = 0.2 Subset Overall: Number of Valid Markers 5 Summary of PCA factors from the First-Level analysis: LD Group Eigenvalue Var(%) CumVar(%) 1 3.818998 95.47 95.47 PCA: primary analysis results: SNP Factor 1 Factor 2 M1 0.8030 0.5810 M2 0.5110 0.8520 M3 0.8210 0.5680 M4 0.5770 0.8050 M5 0.8210 0.5680 Proportion Var(%) 51.80 47.10 Cumulative Var(%) 51.80 98.90 \================================================= M-Step Final Summary : Terminal Factor Membership Loading 1 M3\* 0.821 M5 0.821 M1 0.803 M4 0.577 M2 0.511 Terminal Factor Membership Loading 2 M2\* 0.852 M4 0.805 M1 0.581 M3 0.568 M5 0.568 \------------------------------------------------- Final Tagging SNPs Terminal Factor M2 2 M3 1 Total # Tagging SNPs : 2 Elapse Time : 0 h : 0 m : 6 s * **Summary Output File** M2 M3 [Home](index.html)   [Requirements](requirements.html)   [Examples](examples.html)