Analysis

First, we clean up and format the SNP calls in order to run the ANOVAs. 

We remove any ambiguous bases or bases that are segregating within a given line from the analysis for each individual SNP. We also only analyze SNPs with two alleles. We require that the minor allele be in a minimum of 4 lines and that the mean sequencing coverage (across lines) be between 2x and 30x (to avoid mis-called bases and mis-aligned repeats). Finally, we convert the base calls to a 0 for the major allele and a 2 for the minor allele.

Second, we run the ANOVAs for almost 2.5 million SNPs (fewer if less than 168 lines are submitted).

If data is given for both sexes, we run simple linear models for each SNP in two forms: For pooled sexes we use the model - phenotype = mean + M + S + MxS + L(M) + E, where M is the Marker (SNP), S is Sex, L(M) is the random Line effect nested within Marker and E is Error.  For sexes separately we use the model - phenotype = mean + M + E, where M is the Marker (SNP) and E is Error. We keep the p-values for all SNPs for the pooled marker effect, the male marker effect and the female marker effect.

If data is only for a single sex or no sex is indicated, we run an ANOVA with the model - phenotype = mean + M, where M is the Marker (SNP).

Currently we use 10-5 as an arbitrary cutoff for significance for all tests. 

Third, we compute effect sizes (a) for each of the significant SNPs (for pooled sexes, males and females in the case where both sexes are present).

Finally we run a MMC clustering analysis to look at LD amongst the significant SNPs.  The primary reason for the LD analysis is that we tend to have some long-distance LD due to the small number of lines relative to the total number of SNPs (although this isn't that common).  This is summarized as a heat map using pairwise correlations (r-squared) where red represents strong correlation (high LD) and blue represents no correlation.

Files generated

Header key

Notes: The annotation has been recently updated to version 5.46. Many errors in the old version have also been fixed (primarily in the Site Class column for negative sense genes for version 5.35 and truncated gene symbols in version 5.42). Most annotated files will have 1-2 lines per SNP depending on the genomic position of the SNP. Intergenic SNPs often produce two lines (the two closest genes within 5000 bp are listed). In rare cases, where genes overlap, there may be a line for each gene. In most of these cases we used a ranking system to avoid overly complex and long annotations. We only report the most likely affected gene in overlapping cases (CDS/splice site changes > UTR changes > intergenic changes). For example: If genes A and B overlap (containing the same SNP) where the SNP is a non-synonymous base change in gene A and an intergenic base change in gene B, we only report gene A. However, cases still exist where a SNP may have the same ranking in multiple genes. In this case all are listed. There is no longer annotation information for genes that are more than 5000 bp away from the nearest gene. Annotation information is now based on version 5.46 of the D. melanogaster genome sequence. The annotation information is constantly changing, so this information should only be used as a guide. Updated information for each SNP can be obtained via FlyBase.org