moved to GRCh38/hg38 for human genes with hg19 still supported

sync_sulab assembly human genome

GRCh38 (or hg38) is the latest human genome assembly, which was released almost a year ago. is now moved to support GRCh38 by default for human genes, but the data (including queries) based on previous assembly version (GRCh37/hg19) are still supported. Here are some details about this change:

  • genomic_pos field is now based on hg38

This field contains the genomic location of the given gene, the start/end positions are now based on hg38:

In [1]: mg.getgene('1017', fields='genomic_pos')
{'_id': '1017',
 'genomic_pos': {'chr': '12',
  'end': 55972784,
  'start': 55966769,
  'strand': 1},
  • exons field is now based on hg38

This field contains the genomic locations of exons, as well as cdsstart/cdsend, txstart/txend data. All these values are now based on hg38:

In [2]: mg.getgene('1017', fields='exons')
{'_id': '1017',
 'exons': {'NM_001290230': {'cdsend': 55971625,
   'cdsstart': 55967008,
   'chr': '12',
   'exons': [[55966768, 55967124],
    [55968048, 55968169],
    [55968777, 55968948],
    [55971043, 55971247],
    [55971520, 55972789]],
   'strand': 1,
   'txend': 55972789,
   'txstart': 55966768},
  'NM_001798': {'cdsend': 55971625,
   'cdsstart': 55967008,
   'chr': '12',
   'exons': [[55966768, 55967124],
    [55967856, 55967934],
    [55968048, 55968169],
    [55968777, 55968948],
    [55969474, 55969576],
    [55971043, 55971247],
    [55971520, 55972789]],
   'strand': 1,
   'txend': 55972789,
   'txstart': 55966768},
  'NM_052827': {'cdsend': 55971625,
   'cdsstart': 55967008,
   'chr': '12',
   'exons': [[55966768, 55967124],
    [55967856, 55967934],
    [55968048, 55968169],
    [55968777, 55968948],
    [55971043, 55971247],
    [55971520, 55972789]],
   'strand': 1,
   'txend': 55972789,
   'txstart': 55966768}}}
In [3]: mg.query('chrX:151,073,054-151,383,976', species='human')
{'hits': [{'_id': '100422930',
   '_score': 5.1352987,
   'entrezgene': 100422930,
   'name': 'microRNA 4330',
   'symbol': 'MIR4330',
   'taxid': 9606},
  {'_id': 'ENSG00000228717',
   '_score': 5.1352987,
   'symbol': 'AF013593.1',
   'taxid': 9606},
  {'_id': 'ENSG00000278724',
   '_score': 5.1352987,
   'name': 'Metazoan signal recognition particle RNA',
   'symbol': 'Metazoa_SRP',
   'taxid': 9606},
  {'_id': '9248',
   '_score': 5.1264653,
   'entrezgene': 9248,
   'name': 'G protein-coupled receptor 50',
   'symbol': 'GPR50',
   'taxid': 9606},
  {'_id': 'ENSG00000234696',
   '_score': 5.1264653,
   'name': 'GPR50 antisense RNA 1',
   'symbol': 'GPR50-AS1',
   'taxid': 9606},
  {'_id': 'ENSG00000269993',
   '_score': 5.111959,
   'symbol': 'AF003625.3',
   'taxid': 9606}],
 'max_score': 5.1352987,
 'took': 1202,
 'total': 6}
  • A new field genomic_pos_hg19 is added to hold the genomic location data based on hg19:
In [4]: mg.getgene('1017', fields='genomic_pos_hg19')                                                      
{'_id': '1017',
 'genomic_pos_hg19': {'chr': '12',
  'end': 56366568,
  'start': 56360553,
  'strand': 1}}

  • A new field exons_hg19 is added to hold the exons data based on hg19:
In [5]: mg.getgene('1017', fields='exons_hg19')
{'_id': '1017',
 'exons_hg19': {'NM_001290230': {'cdsend': 56365409,
   'cdsstart': 56360792,
   'chr': '12',
   'exons': [[56360552, 56360908],
    [56361832, 56361953],
    [56362561, 56362732],
    [56364827, 56365031],
    [56365304, 56366573]],
   'strand': 1,
   'txend': 56366573,
   'txstart': 56360552},
  'NM_001798': {'cdsend': 56365409,
   'cdsstart': 56360792,
   'chr': '12',
   'exons': [[56360552, 56360908],
    [56361640, 56361718],
    [56361832, 56361953],
    [56362561, 56362732],
    [56363258, 56363360],
    [56364827, 56365031],
    [56365304, 56366573]],
   'strand': 1,
   'txend': 56366573,
   'txstart': 56360552},
  'NM_052827': {'cdsend': 56365409,
   'cdsstart': 56360792,
   'chr': '12',
   'exons': [[56360552, 56360908],
    [56361640, 56361718],
    [56361832, 56361953],
    [56362561, 56362732],
    [56364827, 56365031],
    [56365304, 56366573]],
   'strand': 1,
   'txend': 56366573,
   'txstart': 56360552}}}

In [6]: mg.query('hg19.chrX:151,073,054-151,383,976', species='human')
{'hits': [{'_id': 'ENSG00000231937',
   '_score': 6.9943757,
   'symbol': 'RP11-329E24.6',
   'taxid': 9606},
  {'_id': '574412',
   '_score': 6.9943757,
   'entrezgene': 574412,
   'name': 'microRNA 452',
   'symbol': 'MIR452',
   'taxid': 9606},
  {'_id': 'ENSG00000228965',
   '_score': 6.9943757,
   'symbol': 'RP11-1007I13.2',
   'taxid': 9606},
  {'_id': '2564',
   '_score': 6.9620624,
   'entrezgene': 2564,
   'name': 'gamma-aminobutyric acid (GABA) A receptor, epsilon',
   'symbol': 'GABRE',
   'taxid': 9606},
  {'_id': '4109',
   '_score': 6.9620624,
   'entrezgene': 4109,
   'name': 'melanoma antigen family A, 10',
   'symbol': 'MAGEA10',
   'taxid': 9606},
  {'_id': '407009',
   '_score': 6.9609237,
   'entrezgene': 407009,
   'name': 'microRNA 224',
   'symbol': 'MIR224',
   'taxid': 9606},
  {'_id': 'ENSG00000229967',
   '_score': 6.9609237,
   'symbol': 'RP11-366F6.2',
   'taxid': 9606},
  {'_id': 'ENSG00000266560',
   '_score': 6.9609237,
   'symbol': 'RP11-1007I13.4',
   'taxid': 9606},
  {'_id': '2556',
   '_score': 6.8982496,
   'entrezgene': 2556,
   'name': 'gamma-aminobutyric acid (GABA) A receptor, alpha 3',
   'symbol': 'GABRA3',
   'taxid': 9606},
  {'_id': '4103',
   '_score': 6.8982496,
   'entrezgene': 4103,
   'name': 'melanoma antigen family A, 4',
   'symbol': 'MAGEA4',
   'taxid': 9606}],
 'max_score': 6.9943757,
 'took': 1095,
 'total': 12}

As a final note, this change affects human genes only, of course.