Analysis
First, we clean up and format the SNP calls in order to run the ANOVAs.
We remove any ambiguous bases or bases that are segregating within a given line from the analysis for each individual SNP. We also only analyze SNPs with two alleles. We require that the minor allele be in a minimum of 4 lines and that the mean sequencing coverage (across lines) be between 2x and 30x (to avoid mis-called bases and mis-aligned repeats). Finally, we convert the base calls to a 0 for the major allele and a 2 for the minor allele.
Second, we run the ANOVAs for almost 2.5 million SNPs (fewer if less than 168 lines are submitted).
If data is given for both sexes, we run simple linear models for each SNP in two forms: For pooled sexes we use the model - phenotype = mean + M + S + MxS + L(M) + E, where M is the Marker (SNP), S is Sex, L(M) is the random Line effect nested within Marker and E is Error. For sexes separately we use the model - phenotype = mean + M + E, where M is the Marker (SNP) and E is Error. We keep the p-values for all SNPs for the pooled marker effect, the male marker effect and the female marker effect.
If data is only for a single sex or no sex is indicated, we run an ANOVA with the model - phenotype = mean + M, where M is the Marker (SNP).
Currently we use 10-5 as an arbitrary cutoff for significance for all tests.
Third, we compute effect sizes (a) for each of the significant SNPs (for pooled sexes, males and females in the case where both sexes are present).
Finally we run a MMC clustering analysis to look at LD amongst the significant SNPs. The primary reason for the LD analysis is that we tend to have some long-distance LD due to the small number of lines relative to the total number of SNPs (although this isn't that common). This is summarized as a heat map using pairwise correlations (r-squared) where red represents strong correlation (high LD) and blue represents no correlation.
Files generated
- The original phenotype input file submitted by the user.
- ANOVA_all_pvals.txt: Marker (SNP) effect (Female, Male and Pooled) and SNP x Sex p-values for all SNPs. In the case where there is only one sex, or no sex is indicated, there is only a single Marker (SNP) p-value for each SNP.
- R2.eps: r2 heat map of LD: Both axes are the significant SNPs ordered by chromosomal arm and position along each arm. Red represents strong correlation between the two SNPs, blue represents no correlation between the two SNPs.
- top_snps.txt: This is a partially annotated list of all the top SNPs (where p-value is less than 10-5 for any of the markers (pooled, females or males). NOTE: This file has been recently modified, check "Header key" section below for details.
- top_SNP_calls_by_line.txt: These are the major/minor allele calls for each of the significant SNPs. Each column is a SNP and each row is a specific line. 0 represents the major allele, 2 represents the minor allele and a blank means that SNP could not be called for that line.
Header key
Notes: The annotation has been recently updated to version 5.46. Many errors in the old version have also been fixed (primarily in the Site Class column for negative sense genes for version 5.35 and truncated gene symbols in version 5.42). Most annotated files will have 1-2 lines per SNP depending on the genomic position of the SNP. Intergenic SNPs often produce two lines (the two closest genes within 5000 bp are listed). In rare cases, where genes overlap, there may be a line for each gene. In most of these cases we used a ranking system to avoid overly complex and long annotations. We only report the most likely affected gene in overlapping cases (CDS/splice site changes > UTR changes > intergenic changes). For example: If genes A and B overlap (containing the same SNP) where the SNP is a non-synonymous base change in gene A and an intergenic base change in gene B, we only report gene A. However, cases still exist where a SNP may have the same ranking in multiple genes. In this case all are listed. There is no longer annotation information for genes that are more than 5000 bp away from the nearest gene. Annotation information is now based on version 5.46 of the D. melanogaster genome sequence. The annotation information is constantly changing, so this information should only be used as a guide. Updated information for each SNP can be obtained via FlyBase.org
- Chromosome: the chromosomal arm the SNP is located on.
- Position: the base position of the SNP on the given chromosomal arm based on release 5.46 of the Drosophila melanogaster genome sequence.
- Gene Symbol: the gene symbol for the gene in which the SNP is located or near.
- FlyBase ID: the Flybase ID number (a partial list as we did not have ID numbers for all genes at the time of implementation).
- Site Class: the type of change produced by the polymorphism if it is in coding sequence (non-synonymous, synonymous, etc.) otherwise the type of gene region it falls within if not in coding sequence.
- SNP Location in Gene: This has been removed due to complexity of multiple transcripts
- Bases from Gene: The number of bases the SNP is from the listed gene if it is intergenic.
- Reference Sequence Allele: The base call from the reference sequence found on FlyBase.
- Minor Allele Frequency: Minor Allele Count / Total Lines.
- SNP p-value: p-value for the Marker (SNP) effect (each of Female, Male, Pooled with data for both sexes).
- Effect size: [(Major allele mean) - (Minor allele mean)]/2
- Major Allele Mean: the mean value of lines with the major allele.
- Minor Allele Mean: the mean value of lines with the minor allele.
- Major Allele Call: the base call of the more frequent polymorphism.
- Minor Allele Call: the base call of the less frequent polymorphism.
- Minor Allele Count: the number of lines with the minor allele.
- Total Count: the total number of lines used in the statistical test (missing, ambiguous or segregating SNPs are not used).
- Mean coverage: the mean sequencing coverage for the SNP across all lines.

