Code
::use_python('/usr/local/bin/python3') reticulate
January 9, 2024
We simulate binary \(X\) under a markov model. The first column of \(X\) are iid \(Bin(1, p)\) to generate \(x_{k+1} | x_{k}\) we sample \(\pi\) indices and flip the bits. By setting \(\pi\) to a small value we get highly correlated \(X\), or by setting \(\pi =0.5\) we get uncorrelated \(X\).
To simulate our phenotypes, we specify \(h^2\) of liability for the trait, and \(\rho\) the fraction of \(h^2\) explained by the causal variant in the locus. We fix \(h^2 = 0.2\) and vary \(\rho\) from \(0.02\) to \(1.0\) (corresponding to odds ratios of \(\approx 1.1 - 2\)). For each simulation we set the intercept to achieve the desired fraction of cases \(k=0.1\). Note the effect sizes in these simulations are not random, they are computed deterministically.
These simulation settings are meant to be a reasonable approximation to what we might see in GWAS. But admitidely, the way we generate \(X\) is a little odd.
We can improve these simulations by: 1. Simulating more realistic genotype type data 1. Being a bit more thoughtful about how we pick effect sizes e.g. have an MAF dependent prior OR.
We see good coverage of the \(95%\) CSs across the different simulation scenarios. This is in contrast to earlier simulations, which I’ve been having trouble replicating, where we see a dramatic loss of coverage. It could be that choosing a more realistic genotype would change the results here, and we should follow up on that.
import numpy as np
def flip_bits(x, flip):
"""
Flips the values of x based on the flip array.
Parameters:
x (ndarray): Input array
flip (ndarray): Array of 0s and 1s indicating which values to flip
Returns:
ndarray: Flipped array
"""
return np.where(flip == 1, 1 - x, x)
def sim_binary_X(n: int=1000, p: int=100, pi1: float = 0.1, rho=0.1):
"""
n : number of rows
p: number of columns
pi1: fraction of ones in first column
rho: probability of flipping bits
"""
X = np.zeros((n, p))
X[:, 0] = np.random.choice([0, 1], size=n, p=[1 -pi1, pi1])
for j in range(1, p):
flip = np.random.binomial(1, rho, size=n)
X[:, j] = flip_bits(X[:, j-1], flip)
return X
def augment(X, y):
yaug = np.append(y, [1., 1., 0., 0.])
Xaug = np.column_stack([np.append(X[:, i], [1., 0., 1., 0.]) for i in range(X.shape[1])])
Xaug = (Xaug - Xaug.mean(0)[None])/Xaug.std(0)[None]
return Xaug, yaug
def fit_ser_aug(X, y):
Xaug, yaug = augment(X, y)
ser = logistic_ser(Xaug, yaug)
return ser
def compute_cs(alpha, target=0.95):
sorted_indices = np.argsort(alpha)[::-1]
cumulative_sum = np.cumsum(alpha[sorted_indices])
top_indices = sorted_indices[:(cumulative_sum < target).sum() + 1]
return top_indices
def simulate_case_control(x, k=0.1, h2=0.2, rho=0.1):
nu = np.pi**2/3 * h2/(1-h2)
sigma = np.sqrt(rho * nu)
s = np.sqrt(nu - sigma**2)
xnorm = (x - x.mean()) / x.std()
b = sigma
# b = np.random.normal(scale=sigma)
logit = xnorm * b + np.random.normal(scale=s, size=n)
b0 = np.quantile(logit + np.random.logistic(size=n), k)
logit = b0 + logit
y = np.random.binomial(1, 1/(1 + np.exp(-logit)))
sim = dict(k=k, h2=h2, rho=rho, nu=nu, sigma=sigma, s=s, b0=b0, b=b, logit=logit, y=y)
return sim
# simulate X
n, p = 5000, 100
X = sim_binary_X(n=n, p=p, pi1 = 0.1, rho=0.005)
x = X[:, 0] # first one is causal
rep = np.arange(50)
rho = np.concatenate([np.linspace(0.02, 0.1, 5), np.linspace(0.2, 1.0, 5)])
df = pd.DataFrame(list(itertools.product(rep, rho)), columns=['rep', 'rho'])
df['k'] = 0.1
df['h2'] = 0.2
df['sim'] = df.apply(lambda row: simulate_case_control(x, k=row.k, h2=row.h2, rho=row.rho), axis=1)
df['ybar'] = df.sim.apply(lambda sim: sim['y'].mean())
df['logit_var'] = df.sim.apply(lambda sim: np.var(sim['logit']))
/usr/local/lib/python3.11/site-packages/jax/_src/numpy/lax_numpy.py:3527: UserWarning: 'kind' argument to argsort is ignored; only 'stable' sorts are supported.
warnings.warn("'kind' argument to argsort is ignored; only 'stable' sorts "
df['cs_size'] = df.cs.apply(lambda cs: len(cs))
df['covered'] = df.cs.apply(lambda cs: 0 in cs)
df['max_z'] = df.ser.apply(lambda fit: np.abs(fit['betahat']/np.sqrt(fit['shat2'])).max())
df['lbf_ser'] = df.ser.apply(lambda fit: np.array(fit['lbf_ser'].flatten())[0])
df['prior_variance'] = df.ser.apply(lambda fit: np.array(fit['prior_variance'].flatten())[0])
Across all CSs
rho mean_covered median_cs_size median_max_z median_lbf_ser
0 0.02 0.82 82.5 3.200346 0.730564
1 0.04 0.94 28.5 4.354254 3.307323
2 0.06 0.96 9.0 5.859319 9.163818
3 0.08 1.00 8.0 6.158600 10.446674
4 0.10 0.98 5.0 7.288703 16.028008
5 0.20 1.00 2.0 11.153625 43.252552
6 0.40 0.98 1.0 17.539280 111.461807
7 0.60 1.00 1.0 22.606627 186.874664
8 0.80 1.00 1.0 26.861099 277.643341
9 1.00 1.00 1.0 30.352369 381.481323
That contains CSs for all simulations. But in some cases there isn’t strong evidence for an effect at all, and we would not report the CSs. If we filter to SERs with and \(\log BF_{SER} > 1\), we see improved coverage in the smallest effect size simulation.
rho mean_covered median_cs_size median_max_z median_lbf_ser
0 0.02 0.913043 59.0 3.975982 3.440657
1 0.04 0.976190 25.0 4.472989 4.269767
2 0.06 0.959184 9.0 5.883980 9.222936
3 0.08 1.000000 8.0 6.158600 10.446674
4 0.10 0.980000 5.0 7.288703 16.028008
5 0.20 1.000000 2.0 11.153625 43.252552
6 0.40 0.980000 1.0 17.539280 111.461807
7 0.60 1.000000 1.0 22.606627 186.874664
8 0.80 1.000000 1.0 26.861099 277.643341
9 1.00 1.000000 1.0 30.352369 381.481323