Microarray data processing techniques for genome-scale network inference from large public repositories

Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips.

In this page we provide the datasets and software used in our paper.


Sriram P. Chockalingam, Maneesha Aluru, and Srinivas Aluru. Microarray data processing techniques for genome-scale network inference from large public repositories . Under preperation

Files for Download


All the twelve classified datasets (7 tissues ; 5 conditions) are can be downloaded from here . The dataset files are of the "exp" format. exp is a plain text format. It has (No. of experiments + 2) columns and (No. of genes + 3) rows. The first two columns contain the probe set name and the locus id (Arabidopsis Genome Identifier or AGI). From the third column onwards, each column contains the expression values corresponding to an experiment.

The rows are organized as follows: First row is a header; second and third rows are descriptions. Starting from the fourth row, each row is a vector to the expression values corresponding to a gene. The first two entries in each row are the probe id and the AGI (of the form ATXGXXXX) are respectively. The locus id value can use used to select the rows corresponding to the genes of interest.

Software and Data

Attachment Size
datasets.tar.gz 2.1 GB
pre-processing.tar.gz 1.7 MB