Data input

Data input

Parse command-line arguments

limmbo.io.parser.getGWASargs()[source]

A detailed list of the command-line options for getGWASargs can be found in runGWAS.

limmbo.io.parser.getVarianceEstimationArgs()[source]

A detailed list of the command-line options for getVarianceEstimationArgs can be found in Variance decomposition.

Read data

class limmbo.io.reader.ReadData(verbose=True)[source]

Generate object containing all datasets relevant for the analysis. For variance decomposition, at least phenotypes and relatedness estimates need to be specified. For association testing with LMM, at least phenotype, relatedness estimates and genotypes need to be read.

getCovariates(file_covariates=None, delim=', ')[source]

Reads [N x K] covariate matrix with [N] samples and [K] covariates.

Parameters:
  • file_covariates (string) – [N x (K +1)] covariates file with [N] sample IDs in the first column
  • delim (string) – delimiter of covariates file, one of ” “, “,”, “t”
Returns:

updated the following attributes of the ReadData instance:

  • self.covariates (np.array): [N x K] covariates matrix

Return type:

None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io.reader import ReadData
>>> from limmbo.io.utils import file_type
>>> data = ReadData(verbose=False)
>>> file_covs = resource_filename('limmbo',
...                               'io/test/data/covs.csv')
>>> data.getCovariates(file_covariates=file_covs)
>>> data.covariates.index[:3]
Index([u'ID_1', u'ID_2', u'ID_3'], dtype='object')
>>> data.covariates.values[:3,:3]
array([[ 0.92734699,  1.59767659, -0.67263682],
       [ 0.57061985, -0.84679736, -1.11037123],
       [ 0.44201204, -1.61499228,  0.23302345]])
getGenotypes(file_genotypes=None, delim=', ')[source]

Reads genotype file in the following formats: plink (.bed, .bim, .fam), gen (.gen, .sample) or comma-separated values (.csv) file.

Parameters:
  • file_geno (string) –

    path to phenotype file in .plink or .csv format - plink format:

    as specified in the plink user manual, binary plink format with .bed, .fam and .bim file
    • .csv format:
      • [(NrSNP + 1) x (N`+1)] .csv file with: [`N] sample IDs in the first row and [NrSNP] genotype IDs in the first column
      • sample IDs should be of type: chrom-pos-rsID for instance 22-50714616-rs6010226
  • delim (string) – delimiter of genotype file (when text format), one of ” “, “,”, “t”
Returns:

updated the following attributes of the ReadData instance:

  • self.genotypes (np.array): [N x NrSNPs] genotype matrix
  • self.genotypes_info (pd.dataframe): [NrSNPs x 2] dataframe with columns ‘chrom’ and ‘pos’, and rsIDs as index

Return type:

None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io import reader
>>> from limmbo.io.utils import file_type
>>> data = reader.ReadData(verbose=False)
>>> # Read genotypes in delim-format
>>> file_geno = resource_filename('limmbo',
...     'io/test/data/genotypes.csv')
>>> data.getGenotypes(file_genotypes=file_geno)
>>> data.genotypes.index[:4]
Index([u'ID_1', u'ID_2', u'ID_3', u'ID_4'], dtype='object')
>>> data.genotypes.shape
(1000, 20)
>>> data.genotypes.values[:5,:5]
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [2., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.]])
>>> data.genotypes_info[:5]
           chrom       pos
rs1601111      3  88905003
rs13270638     8  20286021
rs75132935     8  76564608
rs72668606     8  79733124
rs55770986     7   2087823
>>> ### read genotypes in plink format
>>> file_geno = resource_filename('limmbo',
...     'io/test/data/genotypes')
>>> data.getGenotypes(file_genotypes=file_geno)
>>> data.genotypes.values[:5,:5]
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [2., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.]])
>>> data.genotypes_info[:5]
           chrom       pos
rs1601111      3  88905003
rs13270638     8  20286021
rs75132935     8  76564608
rs72668606     8  79733124
rs55770986     7   2087823
getPCs(file_pcs=None, nrpcs=None, delim=', ')[source]

Reads file with [N x PC] matrix of [PC] principal components from the genotypes of [N] samples.

Parameters:
  • file_pcs (string) – [N x (PC +1)] PCA file with [N] sample IDs in the first column
  • delim (string) – delimiter of PCA file, one of ” “, “,”, “t”
  • nrpcs (integer) – Number of PCs to use (uses first nrpcs principal components)
Returns:

updated the following attributes of the ReadData instance:

  • self.pcs (np.array): [N x PC] principal component matrix

Return type:

None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io.reader import ReadData
>>> from limmbo.io.utils import file_type
>>> data = ReadData(verbose=False)
>>> file_pcs = resource_filename('limmbo',
...                     'io/test/data/pcs.csv')
>>> data.getPCs(file_pcs=file_pcs, nrpcs=10, delim=" ")
>>> data.pcs.index[:3]
Index([u'ID_1', u'ID_2', u'ID_3'], dtype='object', name=0)
>>> data.pcs.values[:3,:3]
array([[-0.02632738, -0.05791269, -0.03619099],
       [ 0.00761785,  0.00538956,  0.00624196],
       [ 0.01069307, -0.0205066 ,  0.02299996]])
getPhenotypes(file_pheno=None, delim=', ')[source]

Reads [N x P] phenotype file; file ending must be either .txt or .csv

Parameters:
  • file_pheno (string) – path to [(N`+1) x (`P`+1)] phenotype file with: [`N] sample IDs in the first column and [P] phenotype IDs in the first row
  • delim (string) – delimiter of phenotype file, one of ” “, “,”, “t”
Returns:

updated the following attributes of the ReadData instance:

  • self.phenotypes (np.array): [N x P] phenotype matrix

Return type:

None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io.reader import ReadData
>>> from limmbo.io.utils import file_type
>>> data = ReadData(verbose=False)
>>> file_pheno = resource_filename('limmbo',
...                                'io/test/data/pheno.csv')
>>> data.getPhenotypes(file_pheno=file_pheno)
>>> data.phenotypes.index[:3]
Index([u'ID_1', u'ID_2', u'ID_3'], dtype='object')
>>> data.phenotypes.columns[:3]
Index([u'trait_1', u'trait_2', u'trait_3'], dtype='object')
>>> data.phenotypes.values[:3,:3]
array([[-1.56760036, -1.5324513 ,  1.17789321],
       [-0.85655034,  0.48358151,  1.35664966],
       [ 0.10772832, -0.02262884, -0.27963328]])
getRelatedness(file_relatedness, delim=', ')[source]

Read file of [N x N] pairwise relatedness estimates of [N] samples.

Parameters:
  • file_relatedness (string) – [(N + 1) x N] .csv file with: [N] sample IDs in the first row
  • delim (string) – delimiter of covariate file, one of ” “, “,”, ” “
Returns:

updated the following attributes of the ReadData instance:

  • self.relatedness (np.array): [N x N] relatedness matrix

Return type:

None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io.reader import ReadData
>>> from limmbo.io.utils import file_type
>>> data = ReadData(verbose=False)
>>> file_relatedness = resource_filename('limmbo',
...                     'io/test/data/relatedness.csv')
>>> data.getRelatedness(file_relatedness=file_relatedness)
>>> data.relatedness.index[:3]
Index([u'ID_1', u'ID_2', u'ID_3'], dtype='object')
>>> data.relatedness.columns[:3]
Index([u'ID_1', u'ID_2', u'ID_3'], dtype='object')
>>> data.relatedness.values[:3,:3]
array([[1.00892922e+00, 2.00758504e-04, 4.30499103e-03],
       [2.00758504e-04, 9.98944885e-01, 4.86487318e-03],
       [4.30499103e-03, 4.86487318e-03, 9.85787665e-01]])
getSampleSubset(file_samplelist=None, samplelist=None)[source]

Read file or string with subset of sample IDs to analyse.

Parameters:
  • file_samplelist (string) – “path/to/file_samplelist”: file contains subset sample IDs with one ID per line, no header.
  • samplestring (string) – comma-separated list of sample IDs e.g. “ID1,ID2,ID5,ID10”.
Returns:

(numpy array)

array containing list of sample IDs

getTraitSubset(traitstring=None)[source]

Limit analysis to specific subset of traits

Parameters:traitstring (string) – comma-separated trait numbers (for single traits) or hyphen- separated trait numbers (for trait ranges) or combination of both for trait selection (1-based)
Returns:
(numpy array)
array containing list of trait IDs

Examples

>>> from limmbo.io import reader
>>> data = reader.ReadData(verbose=False)
>>> traitlist = data.getTraitSubset("1,3,5,7-10")
>>> print traitlist
[0 2 4 6 7 8 9]
getVarianceComponents(file_Cg=None, file_Cn=None, delim_cg=', ', delim_cn=', ')[source]

Reads a comma-separated files with [P x P] matrices of [P] trait covariance estimates.

Parameters:
  • file_Cg (string) – [P x P] .csv file with [P] trait covariance estimates of the genetic component
  • file_Cn (string) – [P x P] .csv file with [P] trait covariance estimates of the non-genetic (noise) component
Returns:

updated the following attributes of the ReadData instance:

  • self.Cg (np.array): [P x P] matrix with trait covariance of the genetic component
  • self.Cn (np.array): [P x P] matrix with trait covariance of the non-genetic (noise) component

Return type:

None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io.reader import ReadData
>>> data = ReadData()
>>> file_Cg = resource_filename('limmbo',
...                     'io/test/data/Cg.csv')
>>> file_Cn = resource_filename('limmbo',
...                     'io/test/data/Cn.csv')
>>> data.getVarianceComponents(file_Cg=file_Cg, file_Cn=file_Cn)
>>> data.Cg.shape
(10, 10)
>>> data.Cn.shape
(10, 10)
>>> data.Cg[:3,:3]
array([[ 0.45446454, -0.21084613,  0.01440468],
       [-0.21084613,  0.11443656,  0.01250233],
       [ 0.01440468,  0.01250233,  0.02347906]])
>>> data.Cn[:3,:3]
array([[ 0.53654803, -0.14392748, -0.45483001],
       [-0.14392748,  0.88793093,  0.30539822],
       [-0.45483001,  0.30539822,  0.97785614]])

Check data

class limmbo.io.input.InputData(verbose=True)[source]

Generate object containing all datasets relevant for variance decomposition (phenotypes, relatedness estimates) and pre-processing steps (check for common samples and sample order, covariates regression and phenotype transformation)

Parameters:verbose (bool) – initialise verbose: should progress messages be printed to stdout
addCovariates(covariates, covs_samples=None)[source]

Add [N x K] covariate data with [N] samples and [K] covariates to InputData instance.

Parameters:
  • covariates (array-like) – [N x `K] covariate matrix of N individuals and K covariates; if pandas.DataFrame with covs_samples as index, covs_samples do not have to specified separately.
  • covs_samples (array-like) – [N] sample ID
Returns:

updated the following attributes of the InputData instance:

  • self.covariates (pd.DataFrame): [N x K] covariates matrix
  • self.covs_samples (np.array): [N] sample IDs

Return type:

None

Examples

>>> from limmbo.io import input
>>> import numpy as np
>>> import pandas as pd
>>> covariates = [(1,2,4),(1,1,6),(0,4,8)]
>>> covs_samples = ['S1','S2', 'S3']
>>> covariates = pd.DataFrame(covariates, index=covs_samples)
>>> indata = input.InputData(verbose=False)
>>> indata.addCovariates(covariates = covariates,
...     covs_samples = covs_samples)
>>> print indata.covariates.shape
(3, 3)
>>> print indata.covs_samples.shape
(3,)
addGenotypes(genotypes, geno_samples=None, genotypes_info=None)[source]

Add [N x NrSNP] genotype array of [N] samples and [NrSNP] genotypes, [N] array of sample IDs and [NrSNP x 2] dataframe of genotype description to InputData instance.

Parameters:
  • genotypes (array-like) – [N x NrSNP] genotype array of [N] samples and [NrSNP] genotypes; if pandas.DataFrame with geno_samples as index, geno_samples do not have to specified separately.
  • geno_samples (array-like) – [N] vector of N sample IDs
  • genotypes_info (dataframe) – [NrSNPs x 2] dataframe with columns ‘chrom’ and ‘pos’, and rsIDs as index
Returns:

updated the following attributes of the InputData instance:

  • self.genotypes (pd.DataFrame): [N x NrSNPs] genotype matrix
  • self.geno_samples (np.array): [N] sample IDs
  • self.genotypes_info (pd.DataFrame): [NrSNPs x 2] dataframe with columns ‘chrom’ and ‘pos’, and rsIDs as index

Return type:

None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io import reader
>>> from limmbo.io import input
>>> data = reader.ReadData(verbose=False)
>>> file_geno = resource_filename('limmbo',
...                                'io/test/data/genotypes.csv')
>>> data.getGenotypes(file_genotypes=file_geno)
>>> indata = input.InputData(verbose=False)
>>> indata.addGenotypes(genotypes=data.genotypes,
...                     genotypes_info=data.genotypes_info)
>>> indata.geno_samples[:5]
array(['ID_1', 'ID_2', 'ID_3', 'ID_4', 'ID_5'], dtype=object)
>>> indata.genotypes.shape
(1000, 20)
>>> indata.genotypes.values[:5,:5]
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [2., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.]])
>>> indata.genotypes_info[:5]
           chrom       pos
rs1601111      3  88905003
rs13270638     8  20286021
rs75132935     8  76564608
rs72668606     8  79733124
rs55770986     7   2087823
addPCs(pcs, pc_samples=None)[source]

Add [N x PC] matrix of [PC] principal components from the genotypes of [N] samples to InputData instance.

Parameters:
  • pcs (array-like) – [N x `PCs] principal component matrix of N individuals and PCs principal components; if pandas.DataFrame with pc_samples as index, covs_samples do not have to specified separately.
  • pc_samples (array-like) – [N] sample IDs
Returns:

updated the following attributes of the InputData instance:

  • self.pcs (pd.DataFrame): [N x PCs] principal component matrix
  • self.pc_samples (np.array): [N] sample IDs

Return type:

None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io import reader
>>> from limmbo.io import input
>>> data = reader.ReadData(verbose=False)
>>> file_pcs = resource_filename('limmbo',
...                     'io/test/data/pcs.csv')
>>> data.getPCs(file_pcs=file_pcs, nrpcs=10, delim=" ")
>>> indata = input.InputData(verbose=False)
>>> indata.addPCs(pcs = data.pcs)
>>> print indata.pcs.shape
(1000, 10)
>>> print indata.pc_samples.shape
(1000,)
addPhenotypes(phenotypes, pheno_samples=None, phenotype_ID=None)[source]

Add phenotypes, their phenotype ID and their sample IDs to InputData instance

Parameters:
  • phenotypes (array-like) – [N x `P] phenotype matrix of N individuals and P phenotypes; if pandas.DataFrame with pheno_samples as index and phenotypes_ID as columns, pheno_samples and phenotype_ID do not have to specified separately.
  • pheno_samples (array-like) – [N] sample ID
  • phenotype_ID (array-like) – [P] phenotype IDs
Returns:

updated the following attributes of the InputData instance:

  • self.phenotypes (pd.DataFrame): [N x P] phenotype array
  • self.pheno_samples (np.array): [N] sample IDs
  • self.phenotype_ID (np.array): [P] phenotype IDs

Return type:

None

Examples

>>> from limmbo.io import input
>>> import numpy as np
>>> import pandas as pd
>>> pheno = np.array(((1,2),(7,1),(3,4)))
>>> pheno_samples = ['S1','S2', 'S3']
>>> phenotype_ID = ['ID1','ID2']
>>> phenotypes = pd.DataFrame(pheno, index=pheno_samples,
...     columns = phenotype_ID)
>>> indata = input.InputData(verbose=False)
>>> indata.addPhenotypes(phenotypes = phenotypes)
>>> print indata.phenotypes.shape
(3, 2)
>>> print indata.pheno_samples.shape
(3,)
>>> print indata.phenotype_ID.shape
(2,)
addRelatedness(relatedness, relatedness_samples=None)[source]

Add [N x N] pairwise relatedness estimates of [N] samples to the InputData instance

Parameters:
  • relatedness (array-like) – [N x `N] relatedness matrix of N individuals; if pandas.DataFrame with relatedness_samples as index, relatedness_samples do not have to specified separately.
  • relatedness_samples (array-like) – [N] sample IDs
Returns:

updated the following attributes of the InputData instance:

  • self.relatedness (pd.DataFrame): [N x N] relatedness matrix
  • self.relatedness_samples (np.array): [N] sample IDs

Return type:

None

Examples

>>> from limmbo.io import input
>>> import numpy
>>> import pandas as pd
>>> from numpy.random import RandomState
>>> from numpy.linalg import cholesky as chol
>>> random = RandomState(5)
>>> N = 100
>>> SNP = 1000
>>> X = (random.rand(N, SNP) < 0.3).astype(float)
>>> relatedness = numpy.dot(X, X.T)/float(SNP)
>>> relatedness_samples = numpy.array(
...     ['S{}'.format(x+1) for x in range(N)])
>>> relatedness = pd.DataFrame(relatedness,
...     index=relatedness_samples)
>>> indata = input.InputData(verbose=False)
>>> indata.addRelatedness(relatedness = relatedness)
>>> print indata.relatedness.shape
(100, 100)
>>> print indata.relatedness_samples.shape
(100,)
addVarianceComponents(Cg, Cn)[source]

Add [P x P] matrices of [P] trait covariance estimates of the genetic trait variance component (Cg) and the non-genetic (noise) variance component (Cn) to InputData instance

Parameters:
  • Cg (array-like) – [P x `P] matrix of P trait covariance estimates of the genetic trait covaraince component
  • Cn (array-like) – [P x `P] matrix of P trait covariance estimates of the non-genetic (noise) trait covaraince component
Returns:

updated the following attributes of the InputData instance:

  • self.Cg (np.array): [P x `P] matrix of P trait covariance estimates of the genetic trait covariance component
  • self.Cn (np.array): [P x `P] matrix of P trait covariance estimates of the non-genetic trait covaraince component

Return type:

None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io import reader
>>> from limmbo.io import input
>>> import numpy as np
>>> from numpy.random import RandomState
>>> from numpy.linalg import cholesky as chol
>>> data = reader.ReadData(verbose=False)
>>> file_pheno = resource_filename('limmbo',
...                     'io/test/data/pheno.csv')
>>> data.getPhenotypes(file_pheno=file_pheno)
>>> file_Cg = resource_filename('limmbo',
...                     'io/test/data/Cg.csv')
>>> file_Cn = resource_filename('limmbo',
...                     'io/test/data/Cn.csv')
>>> data.getVarianceComponents(file_Cg=file_Cg,
...                            file_Cn=file_Cn)
>>> indata = input.InputData(verbose=False)
>>> indata.addPhenotypes(phenotypes = data.phenotypes)
>>> indata.addVarianceComponents(Cg = data.Cg, Cn=data.Cn)
>>> print indata.Cg.shape
(10, 10)
>>> print indata.Cg.shape
(10, 10)
commonSamples(samplelist=None)[source]

Get [M] common samples out of phenotype, relatedness and optional covariates with [N] samples (if all samples present in all datasets [M] = [N]) and ensure that samples are in same order.

Parameters:samplelist (array-like) – array of sample IDs to select from data
Returns:updated the following attributes of the InputData instance:
  • self.phenotypes (pd.DataFrame): [M x P] phenotype matrix
  • self.pheno_samples (np.array): [M] sample IDs
  • self.relatedness (pd.DataFrame): [M x M] relatedness matrix
  • self.relatedness_samples (np.array): [M] sample IDs of relatedness matrix
  • self.covariates (pd.DataFrame): [M x K] covariates matrix
  • self.covs_samples (np.array): [M] sample IDs
  • self.genotypes (pd.DataFrame): [M x NrSNPs] genotypes matrix
  • self.geno_samples (np.array): [M] sample IDs
  • self.pcs (pd.DataFrame): [M x PCs] principal component matrix
  • self.pc_samples (np.array): [M] sample IDs
Return type:None

Examples

>>> from limmbo.io import input
>>> import numpy as np
>>> import pandas as pd
>>> from numpy.random import RandomState
>>> from numpy.linalg import cholesky as chol
>>> random = RandomState(5)
>>> P = 2
>>> K = 4
>>> N = 10
>>> SNP = 1000
>>> pheno = random.normal(0,1, (N, P))
>>> pheno_samples = np.array(['S{}'.format(x+4)
...     for x in range(N)])
>>> phenotype_ID = np.array(['ID{}'.format(x+1)
...     for x in range(P)])
>>> phenotypes = pd.DataFrame(pheno, index=pheno_samples,
...     columns=phenotype_ID)
>>> X = (random.rand(N, SNP) < 0.3).astype(float)
>>> relatedness = np.dot(X, X.T)/float(SNP)
>>> relatedness_samples = np.array(['S{}'.format(x+1)
...     for x in range(N)])
>>> covariates = random.normal(0,1, (N-2, K))
>>> covs_samples = np.array(['S{}'.format(x+1)
...     for x in range(N-2)])
>>> indata = input.InputData(verbose=False)
>>> indata.addPhenotypes(phenotypes = pheno,
...                      pheno_samples = pheno_samples,
...                      phenotype_ID = phenotype_ID)
>>> indata.addRelatedness(relatedness = relatedness,
...                  relatedness_samples = relatedness_samples)
>>> indata.addCovariates(covariates = covariates,
...                      covs_samples = covs_samples)
>>> indata.covariates.shape
(8, 4)
>>> indata.phenotypes.shape
(10, 2)
>>> indata.relatedness.shape
(10, 10)
>>> indata.commonSamples(samplelist=["S4", "S6", "S5"])
>>> indata.covariates.shape
(3, 4)
>>> indata.phenotypes.shape
(3, 2)
>>> indata.relatedness.shape
(3, 3)
getAlleleFrequencies()[source]

Compute allele frequencies of genotypes.

Returns:updated the following attributes of the InputData instance:
  • self.freqs (pandas DataFrame): [NrSNP x 2] matrix of alt and ref allele frequencies; index: snp IDs
Return type:None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io import reader
>>> from limmbo.io import input
>>> from limmbo.utils.utils import makeHardCalledGenotypes
>>> from limmbo.utils.utils import AlleleFrequencies
>>> data = reader.ReadData(verbose=False)
>>> file_geno = resource_filename('limmbo',
...                                'io/test/data/genotypes.csv')
>>> data.getGenotypes(file_genotypes=file_geno)
>>> indata = input.InputData(verbose=False)
>>> indata.addGenotypes(genotypes=data.genotypes,
...                     genotypes_info=data.genotypes_info,
...                     geno_samples=data.geno_samples)
>>> freqs = indata.getAlleleFrequencies()
>>> freqs.iloc[:5,:]
                   p         q
rs1601111   0.292186  0.707814
rs13270638  0.303581  0.696419
rs75132935  0.024295  0.975705
rs72668606  0.119091  0.880909
rs55770986  0.169338  0.830662
regress()[source]

Regress out covariates (optional).

Returns:updated the following attributes of the InputData instance:
  • self.phenotypes (np.array): [M x P] phenotype matrix of residuals of linear model
  • self.covariates: None
Return type:None

Examples

>>> from limmbo.io import input
>>> import numpy as np
>>> from numpy.random import RandomState
>>> from numpy.linalg import cholesky as chol
>>> random = RandomState(5)
>>> P = 5
>>> K = 4
>>> N = 100
>>> pheno = random.normal(0,1, (N, P))
>>> pheno_samples = np.array(['S{}'.format(x+1)
...     for x in range(N)])
>>> phenotype_ID = np.array(['ID{}'.format(x+1)
...     for x in range(P)])
>>> covariates = random.normal(0,1, (N, K))
>>> covs_samples = np.array(['S{}'.format(x+1)
...     for x in range(N)])
>>> indata = input.InputData(verbose=False)
>>> indata.addPhenotypes(phenotypes = pheno,
...                      pheno_samples = pheno_samples,
...                      phenotype_ID = phenotype_ID)
>>> indata.addCovariates(covariates = covariates,
...                      covs_samples = covs_samples)
>>> indata.phenotypes.values[:3, :3]
array([[ 0.44122749, -0.33087015,  2.43077119],
       [ 1.58248112, -0.9092324 , -0.59163666],
       [-1.19276461, -0.20487651, -0.35882895]])
>>> indata.regress()
>>> indata.phenotypes.values[:3, :3]
array([[ 0.34421705, -0.01470998,  2.25710966],
       [ 1.69886647, -1.41756814, -0.55614649],
       [-1.10700674, -0.66017713, -0.22201814]])
standardiseGenotypes()[source]

Standardise genotypes:

\[w_{ij} = \frac{x_{ij} -2p_i}{\sqrt{2p_i (1-p_i)}}\]

where \(x_{ij}\) is the number of copies of the reference allele for the \(i\) th SNP of the \(j\) th individual and \(p_i\) is the frequency of the reference allele (as described in (Yang et al 2011)).

Returns:updated the following attributes of the InputData instance:
  • self.genotypes_sd (numpy array): [N x NrSNP] matrix of NrSNP standardised genotypes for N samples.
Return type:None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io import reader
>>> from limmbo.io import input
>>> from limmbo.utils.utils import makeHardCalledGenotypes
>>> from limmbo.utils.utils import AlleleFrequencies
>>> data = reader.ReadData(verbose=False)
>>> file_geno = resource_filename('limmbo',
...                                'io/test/data/genotypes.csv')
>>> data.getGenotypes(file_genotypes=file_geno)
>>> indata = input.InputData(verbose=False)
>>> indata.addGenotypes(genotypes=data.genotypes,
...                     genotypes_info=data.genotypes_info)
>>> geno_sd = indata.standardiseGenotypes()
>>> geno_sd.iloc[:5,:3]
             0         1       2
ID_1 -2.201123 -2.141970 -8.9622
ID_2 -2.201123 -2.141970 -8.9622
ID_3 -2.201123 -2.141970 -8.9622
ID_4  0.908627 -0.604125 -8.9622
ID_5 -0.646248 -2.141970 -8.9622
subsetTraits(traitlist=None)[source]

Limit analysis to specific subset of traits

Parameters:traitlist (array-like) – array of trait numbers to select from phenotypes
Returns:updated the following attributes of the InputData instance:
  • self.traitlist (list): of [t] trait numbers (int) to choose for analysis
  • self.phenotypes (pd.DataFrame): reduced set of [N x t] phenotypes
  • self.phenotype.ID (np.array): reduced set of [t] phenotype IDs
Return type:None

Examples

>>> from pkg_resources import resource_filename
>>> from limmbo.io.reader import ReadData
>>> from limmbo.io.input import InputData
>>> from limmbo.io.utils import file_type
>>> data = ReadData(verbose=False)
>>> file_pheno = resource_filename('limmbo',
...                                'io/test/data/pheno.csv')
>>> data.getPhenotypes(file_pheno=file_pheno)
>>> traitlist = data.getTraitSubset(traitstring="1-3,5")
>>> indata = InputData(verbose=False)
>>> indata.addPhenotypes(phenotypes = data.phenotypes)
>>> print indata.phenotypes.shape
(1000, 10)
>>> print indata.phenotype_ID.shape
(10,)
>>> indata.subsetTraits(traitlist=traitlist)
>>> print indata.phenotypes.shape
(1000, 4)
>>> print indata.phenotype_ID.shape
(4,)
transform(transform)[source]

Transform phenotypes

Parameters:transform (string) –

transformation method for phenotype data:

  • scale: mean center, divide by sd
  • gaussian: inverse normalisation
Returns:updated the following attributes of the InputData instance:
  • self.phenotypes (np.array): [N x P] (transformed) phenotype matrix
Return type:None

Examples

>>> from limmbo.io import input
>>> import numpy as np
>>> from numpy.random import RandomState
>>> from numpy.linalg import cholesky as chol
>>> random = RandomState(5)
>>> P = 5
>>> K = 4
>>> N = 100
>>> pheno = random.normal(0,1, (N, P))
>>> pheno_samples = np.array(['S{}'.format(x+1)
...     for x in range(N)])
>>> phenotype_ID = np.array(['ID{}'.format(x+1)
...     for x in range(P)])
>>> SNP = 1000
>>> X = (random.rand(N, SNP) < 0.3).astype(float)
>>> relatedness = np.dot(X, X.T)/float(SNP)
>>> relatedness_samples = np.array(['S{}'.format(x+1)
...     for x in range(N)])
>>> indata = input.InputData(verbose=False)
>>> indata.addPhenotypes(phenotypes = pheno,
...                      pheno_samples = pheno_samples,
...                      phenotype_ID = phenotype_ID)
>>> indata.addRelatedness(relatedness = relatedness,
...                  relatedness_samples = relatedness_samples)
>>> indata.phenotypes.values[:3, :3]
array([[ 0.44122749, -0.33087015,  2.43077119],
       [ 1.58248112, -0.9092324 , -0.59163666],
       [-1.19276461, -0.20487651, -0.35882895]])
>>> indata.transform(transform='gaussian')
>>> indata.phenotypes.values[:3, :3]
array([[ 0.23799988, -0.11191464,  2.05785598],
       [ 1.41041953, -0.81365681, -0.92217818],
       [-1.55977999,  0.01240937, -0.62091817]])