Preprocessing

There are two intereseting papers regarding the preprocessing of ChEMBL data:
The Experimental Uncertainty of Heterogeneous Public Ki Data
Comparability of Mixed IC50 Data - A statistical Analysis
These papers are both by the same authors

About the first paper

The first paper checks how well the Ki measurements of different institutes agree with each other on the same drug - target pairs.
In this paper the Ki values which are originally given in nM are transformed into logspace (pKi), s.th.:
1nM -> 9 pKi
10nM -> 8 pKi
100nM -> 7 pKI
1000nM -> 6 pKI
...
They estimate that the experimental uncertainty (Median unsigned error) in Ki measurements for ChEMBL is within 0.44 and 0.48 pKi units.

About the second paper

The second paper checks how well the IC50 measurements of different institutes agree with each other on the same drug - target pairs.
The same log transform is applied to the measurements as for the Ki values. The paper concludes that the experimental uncertainty (Median unsigned error) in IC50 measurements for ChEMBL is around 0.55 pIC50.
The paper also concludes that Ki measurements and IC50 measurements can be mixed into one dataset, if we subtract 0.35 pKi from the pKi measures. (There is a section in this paper titled "Can ChEMBL Ki and IC50 Data be Mixed?")

Preprocessing

I first generated a dataset with only Ki measurements with the following preprocessing steps:
Then I generated a dataset with only IC50 measurements with the same preprocessing steps (except taking the IC50 column instead of the Ki column in the second step of course).
Finally I generated a third dataset which integrates Ki and IC50s as explained in the second paper above.
For the third datasets I removed all drugs and targets with less than 5 entries (instead of 4) otherwise the dimension becomes too large (~8100 x 950) (remember the problem with the webinterface that can compute drug similarities for max. 4000 drugs?).
All three datasets contain AR and ER. The datasets are visualized here (you see two highlighted green columns which are AR and ER):
The Ki dataset (dimension 2776 drugs x 320 targets (~2%filled))
The IC50 dataset (dimension 4852 drugs x 766 targets (~1%filled))
The mixed dataset (dimension 4153 drugs x 698 targets (~1.2%filled))
Note that I discarded Kd measurements completely.. They are not mentioned in the papers at all and they make up the smallest subset of ChEMBL (850.000 observations -> 60000 Kd, 240.000 Ki, 460.000 IC50,). The code for the preprocessing can be found here Link.