BENCHMARKS
Genenrator of benchmarks
We made a generator of benchmarks that simulate the biological problematic we face [1]. Some examples of  benchmarks generate are given below.
The specifities of the biological problematic is for example :
- correlation between loci
- several associations with several gene to find
- ....
Generator (rar file, C version)

Files of benchmarks
A benchmark is compound of several files : a description file, a file with only affected people, a file with non affected people.

example 1 

GAW 11 Challenge

There is a disease, spilkilizing myrmaedonia, or SM for short.  It occurs in a severe form and a mild form.  Epidemiological studies have indicated that the prevalence of the disease (severe and mild forms together) is on the order of 3-6\%, though some studies have found somewhat higher figures, due to controversy about diagnostic criteria, misdiagnosis, etc.  This disease is currently being studied by four scientific groups, in three distinct populations, as described below.
Each family contains at least two affected children (see below for further ascertainment details of the individual studies).

You are given the following information on family members: Affectedness status (severe; mild; unaffected, coded 3, 2, 1, respectively).  Marker typing on the entire genome.  Exposure to two environmental factors, E1 and E2, which have been suggested by some as possibly relating to disease etiology (coded 1 for present, 0 for absent).  Epidemiological studies indicate that E1 occurs at a frequency of 30\% in the population, and E2, at a frequency of 40%.
Both environmental factors occur at random, i.e., are not correlated within families.

The genome consists of 6 chromosomes, with exactly 50 markers per chromosome, with the average distance (recombination fraction) between the markers being 0.07.  You are given correct population allele frequencies at the marker loci. 
Here are details of the individual studies:

Study 1.  This population lives in Mycenae and is being studied by a group at the Allgemeine Humanistische Erforscher, Mycenae (AHEM).
In previous studies, these scientists have collected extensive population data, and they have been advocates of the two
environmental risk factors, E1 and E2.  One characteristic of the Mycenaean population is that little of the severe form of SM is seen
here.  It is also rumored that the Mycenaean group has found a disease-marker association but that they are hoping to develop a test
kit and so are unwilling to publish this finding.  The AHEM study has ascertained all nuclear families in Mycenae with at least two
affected children. 
Study 2.  A second population lives in Mordor and is being studied by scientists at the Mordor Research Institute (MRI), located in the
northern part of Mordor.  These researchers believe that the severe and mild forms of disease represent a spectrum of symptoms from the same
diathesis.  They have collected all nuclear families in northern Mordor with at least two affected children (affected with either form of the
disease). 
Study 3.  The Mordor population is also being studied by a group from the neighboring People's Democracy of Ruritania (PDR), which
borders Mordor on the south.  The Ruritanians believe that families in which the severe form is found differ genetically from those families
exhibiting only the mild form.  (They believe this partly because of the published work from the Mycenaeans, above.)  Consequently, they
ascertain all families with at least two affected children but require that at least one affected child must have the severe form of SM.  (Under a cooperative agreement, the MRI and PDR groups have divided up the population of Mordor, with the MRI studying families in the northern part of Mordor, and the PDR studying families in the southern part, so there is no overlap.)

Study 4.  The third population lives in Erehwon and is being studied by researchers at the Spilkilizing Myrmaedonia of Erewohn Foundation (SMERF).  Like the PDR researchers, these investigators also believe that the severe and mild forms differ genetically, but their approach is to exclude families in which any member has the severe form of the disease.  They ascertain all Erehwon families in which at least two children are affected with the mild form, but discard a family if any
member, parent or child, has the severe form.  However, after they have ascertained a family, if they subsequently misdiagnose someone as having the severe form, they do not discard that family.

INFORMATION FOR GAW PARTICIPANTS:

The basic disease model is the same for all populations, but the parameters may have different values in the three different populations.

You are given 100 datasets (25 replicates for each population) consisting of 100 nuclear families each.  The replicates are labeled, so you will know which population they came from. You may choose to study only one replicate from one study; or analyze one replicate from each study or population, in order to compare results across studies or populations; or analyze multiple replicates either  from one study or from multiple studies.

The first line of the data is the country of origin and family info, as well as environmental risk factor information. 
Affectedness status can be 1, 2 or 3 (unaffected, mild, severe).

There are two environmental risk factors. "1" means the risk was present, "0" not present.

There follow 6 lines of marker data. Each contains the alleles for 50 markers, allele 1 followed by allele 2. A blank line separates one individual's data from the next. A blank is the first character of each line of marker data.

\textbf{type 2 locus } \textbf{p} \textbf{q} $\mathbf{f_1}$ $\mathbf{f_2}$ $\mathbf{f_3}$ $\mathbf{f_4}$\\
Mycenae 0.2 0.2 0.9 0.05 0.5 0.05\\
Mordor 0.5 0.1 0.9 0.90 0.2 0.20\\
Erehwon 0.2 0.2 0.4 0.20 0.1 0.20\\
\textbf{type 3 all\`ele} $\mathbf{r_1}$ $\mathbf{r_2}$ $\mathbf{f_5}$ $\mathbf{f_6}$ $\mathbf{f_7}$ $\mathbf{f_8}$\\
Mycenae 0.03 0.02 0.9 0.01 1 0.3\\
Mordor 0.02 0.05 0.7 0.40 1.0 0.5 \\
Erehwon 0.01 0.03 0.7 0.30 0.7 0.3\\

Valeur des param\`etres pour les 3 populations : p=fr\'equence de l'all\`ele ``A'' au locus A, q=fr\'equence du locus B, $r_1$=fr\'equence de l'all\`ele $C_1$ dominant au locus C, $r_2$=fr\'equence de l'all\`ele r\'ecessif $C_2$ au locus C, $f_1$ \`a $f_8$ sont les p\'en\'etrances dans la population.}\label{gaw:para}



Nombre d'attributs Nombre de locus Nombre de paires atteint/atteint
Fichier test bi point sur la population ``erehwon'' 1149 380 230
Fichier test multi 1/5 points sur la population ``erehwon'' 1461 491 230
Fichier test bi point sur la population ``mordor'' 1149 380 202
Fichier test multi 1/5 points sur la population ``mordor'' 1461 491 202
Fichier test bi point sur la population ``Mycenae'' 1149 380 202
Fichier test multi 1/5 points sur la population ``Mycenae'' 1461 491 202
Fichier test bi point sur la population ``Ruritania'' 1149 380 165
Fichier test multi 1/5 points sur la population ``Ruritania'' 1461 491 165


Taille des diff\'erents fichiers pour chaque population.

FILES

Erehwon 1    1/5
Mordor 1 1/5
Mycenae 1 1/5
Ruritania 1 1/5