Logistic SER coverage under polygenic model

We are interested in knowing if the inferences under the logistic SER are biased when phenotypes are simulated under a polygenic model. In these simulations we simulate one causal variant in the locus, and model unlinked genetic effects with a random intercept. The SER fits the ordinary/marginal regression model at each step, without accounting for this extra variability in liability. Nonetheless, we see good coverage of the credible sets.
Published

January 9, 2024

Simulation

We simulate binary \(X\) under a markov model. The first column of \(X\) are iid \(Bin(1, p)\) to generate \(x_{k+1} | x_{k}\) we sample \(\pi\) indices and flip the bits. By setting \(\pi\) to a small value we get highly correlated \(X\), or by setting \(\pi =0.5\) we get uncorrelated \(X\).

To simulate our phenotypes, we specify \(h^2\) of liability for the trait, and \(\rho\) the fraction of \(h^2\) explained by the causal variant in the locus. We fix \(h^2 = 0.2\) and vary \(\rho\) from \(0.02\) to \(1.0\) (corresponding to odds ratios of \(\approx 1.1 - 2\)). For each simulation we set the intercept to achieve the desired fraction of cases \(k=0.1\). Note the effect sizes in these simulations are not random, they are computed deterministically.

These simulation settings are meant to be a reasonable approximation to what we might see in GWAS. But admitidely, the way we generate \(X\) is a little odd.

We can improve these simulations by: 1. Simulating more realistic genotype type data 1. Being a bit more thoughtful about how we pick effect sizes e.g. have an MAF dependent prior OR.

Results

We see good coverage of the \(95%\) CSs across the different simulation scenarios. This is in contrast to earlier simulations, which I’ve been having trouble replicating, where we see a dramatic loss of coverage. It could be that choosing a more realistic genotype would change the results here, and we should follow up on that.

Code
reticulate::use_python('/usr/local/bin/python3')
Code
import numpy as np
import pandas as pd
from susiepy.logistic_susie import logistic_ser
import itertools
Code
import numpy as np

def flip_bits(x, flip):
    """
    Flips the values of x based on the flip array.
    
    Parameters:
    x (ndarray): Input array
    flip (ndarray): Array of 0s and 1s indicating which values to flip
    
    Returns:
    ndarray: Flipped array
    """
    return np.where(flip == 1, 1 - x, x)
    
def sim_binary_X(n: int=1000, p: int=100, pi1: float = 0.1, rho=0.1):
    """
    n : number of rows
    p: number of columns
    pi1: fraction of ones in first column
    rho: probability of flipping bits
    """
    X = np.zeros((n, p))
    X[:, 0] = np.random.choice([0, 1], size=n, p=[1 -pi1, pi1])
    
    for j in range(1, p):
        flip = np.random.binomial(1, rho, size=n)
        X[:, j] = flip_bits(X[:, j-1], flip)

    return X

def augment(X, y):
    yaug = np.append(y, [1., 1., 0., 0.])
    Xaug = np.column_stack([np.append(X[:, i], [1., 0., 1., 0.]) for i in range(X.shape[1])])
    Xaug = (Xaug - Xaug.mean(0)[None])/Xaug.std(0)[None]
    return Xaug, yaug

def fit_ser_aug(X, y):
    Xaug, yaug = augment(X, y)
    ser = logistic_ser(Xaug, yaug)
    return ser

def compute_cs(alpha, target=0.95):
    sorted_indices = np.argsort(alpha)[::-1]
    cumulative_sum = np.cumsum(alpha[sorted_indices])
    top_indices = sorted_indices[:(cumulative_sum < target).sum() + 1]
    return top_indices

def simulate_case_control(x, k=0.1, h2=0.2, rho=0.1):
    nu = np.pi**2/3 * h2/(1-h2)
    sigma = np.sqrt(rho * nu)
    s = np.sqrt(nu - sigma**2)
    xnorm = (x - x.mean()) / x.std()
    b = sigma
    # b = np.random.normal(scale=sigma) 
    logit = xnorm * b + np.random.normal(scale=s, size=n)
    b0 = np.quantile(logit + np.random.logistic(size=n), k)
    logit = b0 + logit
    y = np.random.binomial(1, 1/(1 + np.exp(-logit)))
    sim = dict(k=k, h2=h2, rho=rho, nu=nu, sigma=sigma, s=s, b0=b0, b=b, logit=logit, y=y)
    return sim

Correlated X

Code
def pluck(series, val):
    return series.apply(lambda x: x[val])
Code
# simulate X
n, p = 5000, 100
X = sim_binary_X(n=n, p=p, pi1 = 0.1, rho=0.005)
x = X[:, 0] # first one is causal

rep = np.arange(50)
rho = np.concatenate([np.linspace(0.02, 0.1, 5), np.linspace(0.2, 1.0, 5)])
df = pd.DataFrame(list(itertools.product(rep, rho)), columns=['rep', 'rho'])
df['k'] = 0.1
df['h2'] = 0.2
df['sim'] = df.apply(lambda row: simulate_case_control(x, k=row.k, h2=row.h2, rho=row.rho), axis=1)
df['ybar'] = df.sim.apply(lambda sim: sim['y'].mean())
df['logit_var'] = df.sim.apply(lambda sim: np.var(sim['logit']))
Code
# run simulation
df['ser'] = df.sim.apply(lambda sim: fit_ser_aug(X, sim['y']))
Code
# process simulation
df['cs'] = df.ser.apply(lambda fit: compute_cs(fit['alpha'], 0.95))
/usr/local/lib/python3.11/site-packages/jax/_src/numpy/lax_numpy.py:3527: UserWarning: 'kind' argument to argsort is ignored; only 'stable' sorts are supported.
  warnings.warn("'kind' argument to argsort is ignored; only 'stable' sorts "
Code
df['cs_size'] = df.cs.apply(lambda cs: len(cs))
df['covered'] = df.cs.apply(lambda cs: 0 in cs)
df['max_z'] = df.ser.apply(lambda fit: np.abs(fit['betahat']/np.sqrt(fit['shat2'])).max())
df['lbf_ser'] = df.ser.apply(lambda fit: np.array(fit['lbf_ser'].flatten())[0])
df['prior_variance'] = df.ser.apply(lambda fit: np.array(fit['prior_variance'].flatten())[0])

Across all CSs

Code
df2 = df.groupby('rho').agg({'covered': 'mean', 'cs_size': 'median', 'max_z': 'median', 'lbf_ser': 'median'}).rename(columns={'covered': 'mean_covered', 'cs_size': 'median_cs_size', 'max_z': 'median_max_z', 'lbf_ser': 'median_lbf_ser'}).reset_index()
df2
    rho  mean_covered  median_cs_size  median_max_z  median_lbf_ser
0  0.02          0.82            82.5      3.200346        0.730564
1  0.04          0.94            28.5      4.354254        3.307323
2  0.06          0.96             9.0      5.859319        9.163818
3  0.08          1.00             8.0      6.158600       10.446674
4  0.10          0.98             5.0      7.288703       16.028008
5  0.20          1.00             2.0     11.153625       43.252552
6  0.40          0.98             1.0     17.539280      111.461807
7  0.60          1.00             1.0     22.606627      186.874664
8  0.80          1.00             1.0     26.861099      277.643341
9  1.00          1.00             1.0     30.352369      381.481323

That contains CSs for all simulations. But in some cases there isn’t strong evidence for an effect at all, and we would not report the CSs. If we filter to SERs with and \(\log BF_{SER} > 1\), we see improved coverage in the smallest effect size simulation.

Code
df2 = df[df.lbf_ser > 1].groupby('rho').agg({'covered': 'mean', 'cs_size': 'median', 'max_z': 'median', 'lbf_ser': 'median'}).rename(columns={'covered': 'mean_covered', 'cs_size': 'median_cs_size', 'max_z': 'median_max_z', 'lbf_ser': 'median_lbf_ser'}).reset_index()
df2
    rho  mean_covered  median_cs_size  median_max_z  median_lbf_ser
0  0.02      0.913043            59.0      3.975982        3.440657
1  0.04      0.976190            25.0      4.472989        4.269767
2  0.06      0.959184             9.0      5.883980        9.222936
3  0.08      1.000000             8.0      6.158600       10.446674
4  0.10      0.980000             5.0      7.288703       16.028008
5  0.20      1.000000             2.0     11.153625       43.252552
6  0.40      0.980000             1.0     17.539280      111.461807
7  0.60      1.000000             1.0     22.606627      186.874664
8  0.80      1.000000             1.0     26.861099      277.643341
9  1.00      1.000000             1.0     30.352369      381.481323