Skip Navigation
  Print Page

Biostatistics and Bioinformatics Branch (BBB)

Skip sharing on social media links
Share this:

Association analysis of complex diseases using triads, parent-child dyads and singleton monads

count_triad_dyad_monad.sas

%let pgm=cnt_triad_dyad_monad;
%*delvars; * delete all user macro variable. Need this for SAS reruns;

**************************************************************************************************************;
* *;
* PGM : cnt_triad_dyad_monad *;
* Path : C:\NICHD\Research\software\Fan\Joint_association_analysis\cnt_triad_dyad_monad.sas *;
* *;
* R Level : R 3.0.1 *;
* Operating System : Win 7 64 Enterprise *;
* SAS : SAS 9.3M1 64bit *;
* *;
**************************************************************************************************************;
* *;
%let Purpose : SAS and R program to count MONAD, DYAD and TRIAD biomarkers; *;
* Both SAS and R code are provided *;
* Although R code is run within SAS you can copy and paste the R code *;
* into the R Gui or run R *;
* *;
* Run Dependencies : Need Paper:Association Analysis of Complex Diseases Using Triads, for mapping *;
* The R code is completely independent of the SAS code and vice versa *;
* *;
* Functions/Macros/Packages *;
* *;
* Internal : *;
* *;
* External : R packages sqldf qnd reshape *;
* *;
* Notes : This codes handles an arbitrary number of SNP pairs without any changes. *;
* : All that is required is snp pairs be next to each other and the name begin with snp *;
* : ie snp11 snp12 snp21 snp22 snp31 snp32 .. snp9991 snp9992 or snpxxx snpyyy *;
* : DYAD,MONAD or TRIAD is determined by examining the number of family rows *;
* : Ranges fgor faily type is not used. *;
* : The R code at the end of this code is completely independent of SAS and can be *;
* : cut and pasted into the R console. *;
* : The only inputs are the simulation data and mappings from Fans paper. *;
* *;
* Program Flow R and SAS *;
* *;
* 1. Import simped.csv into a SAS dataset or R dataframe -edit missing col1 name to x *;
* 2. Create format lookups or 'fomat' dataframes in R *;
* 3. Determine the number of pairs (example has 2 but SAS program will hadle up to 999) *;
* 4. Drop unneeded columns and transpose *;
* 5. Determine if family group is TRIAD MONAD or DYAD (no need for ranges) *;
* 6. Create PARENT1, PARENT2 and CHILD from ID. ID=3 is CHILD *;
* 7. Stack SNP21 and SNP22 below SNP11 and SNP12 *;
* 8. Let SNP = 10*SNP#1 + SNP#2 where # is 1,2, create a grouping factor SNP 1 .. SNP999. *;
* 9. Transpose so PARENT1, PARENT2 and CHILD are on the same row *;
* 10. Concatenate PARENT1, PARENT2 and CHILD into up to six chars ie AeAaAA *;
* 11. Add lvl is s01, s02 and s03 by looking up concatenation *;
* 12. Tabulate frequency couts for each level ie SNP 1 DYAD n01 11 *;
* 13. Create all possibilites for SNP### Family type n## and poulate with 0s *;
* We need to do this to fill in combinations missing in the data with 0s *;
* 14. Fill in missing combinations with 0s ie SNP 1 DYAD m01 0 *;
* 15. Transpose by family tpe and ANP group using lvl to get names *;
* 16. Output TRIAD MONAD or DYAD csv files *;
* *;
* Files *;
* *;
%let Input1 = C:\NICHD\Research\software\Fan\Joint_association_analysis\Count_Triad_Dyad_Monad\simped.csv; *;
* Column one in the csv did not have a heading so I added the label x *;
%let Input2 = Mapping from Fans Paper: Association Analysis of Complex Diseastes Using Triads..; *;
* This mapping is hardcoded in both SAS and R *;
* 'AAAAAA' = 'n1' *;
* 'AAAaAA' = 'n2' *;
* 'AAAaAa' = 'n3' *;
* *;
%let Output1 = C:\NICHD\Research\software\Fan\Joint_association_analysis\Count_Triad_Dyad_Monad\sas_out\dyad.csv; *;
%let Output2 = C:\NICHD\Research\software\Fan\Joint_association_analysis\Count_Triad_Dyad_Monad\sas_out\monad.csv; *;
%let Output3 = C:\NICHD\Research\software\Fan\Joint_association_analysis\Count_Triad_Dyad_Monad\sas_out\triad.csv; *;
%let Output4 = C:\NICHD\Research\software\Fan\Joint_association_analysis\Count_Triad_Dyad_Monad\cnt_triad_dyad_monad.R; *;
* *;
* *;
* *;
* Example Output MONAD case: *;
* *;
* s0 s1 s2 S *;
* *;
* SNP1 2 0 0 2 *;
* SNP2 0 1 1 2 *;
* *;
* *;
* *;
* Functions/Macros Assumptions : *;
* *;
* Functions/Macros Parameters : *;
* *;
**************************************************************************************************************;
* *;
* Version History : *;
* *;
* Version Date Programmer Description *;
* ------- --------- ---------- ----------- *;
* 1.0 19SEP2013 deangelisrj creation *;
* 2.0 26SEP2013 deangelisrj R is no longer dependent on SAS (programs are independent) *;
* 3.0 08NOV2013 deangelisrj Changes SNP001 and SNP002 to SNP1 and SNP2 *;
* *;
**************************************************************************************************************;

options validvarname=upcase;

proc datasets library=work kill;
run;quit;

/*--------------------------------------------------------------*\
| |
| Convert csv new simulation data into a SAS dataset |
| |
\*--------------------------------------------------------------*/

* note I have manually edited input to label the R rownames column X;
dm "dimport '&input1' &pgm._simped replace";

/*
Up to 40 obs from WORK.SIMPED

Obs X FAMID ID DADID MOMID GENDER SNP11 SNP12 SNP21 SNP22

1 1 1.1 1 0 0 1 2 2 1 2
2 2 1.1 2 0 0 2 2 2 1 1
3 3 1.1 3 1 2 1 2 2 1 2
4 4 2.1 1 0 0 1 2 2 2 1
5 5 2.1 2 0 0 2 2 2 1 1
6 6 2.1 3 1 2 1 2 2 1 2
7 7 3.1 1 0 0 1 1 1 2 2
8 8 3.1 2 0 0 2 1 2 1 1
9 9 3.1 3 1 2 1 1 1 1 2
10 10 4.1 1 0 0 1 2 2 1 1
11 11 4.1 2 0 0 2 1 1 1 1
*/

*In simped.csv, we have;
*(1) 200 triad families: Famid is between 1.1 and 200.1;
*(2) 12 dyad families with missing dad: Famid is between 201.1 and 212.1;
*(3) 13 dyad families with missing mom: Famid is between 213.1 and 225.1;
*(4) 25 monads: Famid is between 226.1 and 250.1;

/*--------------------------------------------------------------*\
| |
| Handy dataframe or SAS datset to recode family markers |
| |
\*--------------------------------------------------------------*/

proc sql;
create
table &pgm._map ( str char(6), lvl char(3) );
insert into &pgm._map
values('AAAAAA','n01')
values('AAAaAA','n02')
values('AAAaAa','n03')
values('AAaaAa','n04')
values('AaAaAA','n05')
values('AaAaAa','n06')
values('AaAaaa','n07')
values('AaaaAa','n08')
values('Aaaaaa','n09')
values('aaaaaa','n10')
values('AaAAAA','n02')
values('AaAAAa','n03')
values('aaAAAa','n04')
values('aaAaAa','n08')
values('aaAaaa','n09')
values('AAAA' ,'m01')
values('AAAa' ,'m02')
values('AaAA' ,'m03')
values('AaAa' ,'m04')
values('Aaaa' ,'m05')
values('aaAa' ,'m06')
values('aaaa' ,'m07')
values('AA' ,'s01')
values('Aa' ,'s02')
values('aa' ,'s03')
;quit;

proc format;
value num2par
1='Parent1'
2='Parent2'
3='Child'
;
value num2two
11 = 'AA'
12,21 = 'Aa'
22 = 'aa'
;
run;

/*--------------------------------------------------------------*\
| |
| Determine the number of pairs (example has two) |
| |
\*--------------------------------------------------------------*/

proc sql;
select
count(*)/2 into :numduo separated by ''
from
sashelp.vcolumn
where
libname = 'WORK' and
memname = upcase("&pgm._simped") and
substr(name,1,3) = 'SNP'
;quit;

%put RESULT: The number of pairs are numduo = &numduo;

* RESULT: The number of pairs are numduo = 2;

/*--------------------------------------------------------------*\
| |
| Determine if family group is TRIAD MONAD or DYAD |
| This is a more general solution then using ranges |
| I can use ranges if that is what is needed |
| |
| The subselecet adds the family size to the input dataset |
| |
\*--------------------------------------------------------------*/

proc sql;
create
table &pgm._famtyp (keep=famid famtyp fammem snp:) as
select
case (r.cntrec)
when ( 1 ) then 'MONAD'
when ( 2 ) then 'DYAD'
when ( 3 ) then 'TRIAD'
else 'ERROR'
end as famtyp
,put(l.id,num2par.) as fammem
,l.*
from
&pgm._simped as l left join /* selecet below adds family size to each family */
(select famid, count(famid) as cntrec from &pgm._simped group by famid) as r
on
l.famid = r.famid
order
by famid, id
;quit;

/*
Up to 40 obs from WORK.cnt_triad_dyad_monad_FAMTYP

Obs FAMTYP FAMMEM FAMID SNP11 SNP12 SNP21 SNP22

1 TRIAD Parent1 1.1 2 2 1 2
2 TRIAD Parent2 1.1 2 2 1 1
3 TRIAD Child 1.1 2 2 1 2
4 TRIAD Parent1 2.1 2 2 2 1
5 TRIAD Parent2 2.1 2 2 1 1
6 TRIAD Child 2.1 2 2 1 2
7 TRIAD Parent1 3.1 1 1 2 2
*/

/*--------------------------------------------------------------*\
| |
| Generlalize to handle many SNPs not just two like the example.|
| Cycle through pairs until all pairs are processed. |
| |
| An arary is used to hnle as all the snp pairs. |
| |
| Stack the pairs |
| Create SNP group name SNP 1 SNP 2 SNP 3 ... SNP999 |
| |
| Handles up to 999 pairs |
| |
\*--------------------------------------------------------------*/

data &pgm._datfix (keep=famid famtyp fammem snpgrp snp );
set &pgm._famtyp (keep= famid famtyp fammem snp:);
* snp: is all variables that begin with snp;
* put them in an array SAS arrays can be up to 4gb;
array arysnp[*] snp: ; * snp11 snp12 snp21 snp22 snp31 snp32..;
do idx=0 to &numduo by 2;
snp = put(arysnp[idx+1]*10+ arysnp[idx+2],num2two.); * one number from two 1 2 = 12;
snpgrp=cats(substr(vname(arysnp[idx+1]),1,3),put((1+idx/2),3.)); * creates SNP1-SNP2E9;
output;
end;
run;

/*
Up to 40 obs from WORK.cnt_triad_dyad_monad_DATFIX

Obs FAMTYP FAMMEM FAMID SNP SNPGRP

1 TRIAD Parent1 1.1 aa SNP 1
2 TRIAD Parent1 1.1 Aa SNP 2
3 TRIAD Parent2 1.1 aa SNP 1
4 TRIAD Parent2 1.1 AA SNP 2
5 TRIAD Child 1.1 aa SNP 1
6 TRIAD Child 1.1 Aa SNP 2
7 TRIAD Parent1 2.1 aa SNP 1
8 TRIAD Parent1 2.1 Aa SNP 2
*/

proc sort data=&pgm._datfix out=&pgm._datsrt;
by snpgrp famtyp famid;
run;

/*--------------------------------------------------------------*\
| |
| Transpose so PARENTS and CHILD are across for easy concat |
| of genetic codes ie TRIAD may yeild AaAAaa |
| |
\*--------------------------------------------------------------*/

proc transpose data=&pgm._datsrt out=&pgm._datxpo(drop=_name_);
by snpgrp famtyp famid;
id fammem;
var snp;
run;

/*
Up to 40 obs from WORK.CLP_PGMSIM_DATXPO

Up to 40 obs from WORK.cnt_triad_dyad_monad_DATXPO

Obs SNPGRP FAMTYP FAMID PARENT2 CHILD PARENT1

1 SNP 1 DYAD 201.1 Aa Aa
2 SNP 1 DYAD 202.1 Aa AA
3 SNP 1 DYAD 203.1 Aa Aa
4 SNP 1 DYAD 204.1 aa aa
5 SNP 1 DYAD 205.1 Aa Aa
6 SNP 1 DYAD 206.1 aa aa
7 SNP 1 DYAD 207.1 aa Aa
8 SNP 1 DYAD 208.1 aa Aa
9 SNP 1 DYAD 209.1 Aa aa
10 SNP 1 DYAD 210.1 Aa Aa
11 SNP 1 DYAD 211.1 aa aa
12 SNP 1 DYAD 212.1 aa aa
13 SNP 1 DYAD 213.1 Aa aa
14 SNP 1 DYAD 214.1 aa aa
15 SNP 1 DYAD 215.1 aa aa
16 SNP 1 DYAD 216.1 aa aa

*/

/*--------------------------------------------------------------*\
| |
| Concatenate the pairs |
| Join with mapping to get levels ie s1, s2 or s3 |
| |
\*--------------------------------------------------------------*/

proc sql;
create
table &pgm._mkelvl as
select
r.snpgrp
,r.famtyp
,l.lvl
from
&pgm._map as l full outer join &pgm._datxpo as r
on
cats(r.parent1,parent2,child) = str
order
by snpgrp, famtyp, lvl
;quit;

/*
Up to 40 obs from WORK.cnt_triad_dyad_monad_MKELVL

Obs SNPGRP FAMTYP LVL

1 SNP 1 DYAD m03
2 SNP 1 DYAD m04
3 SNP 1 DYAD m04
4 SNP 1 DYAD m04
5 SNP 1 DYAD m04
6 SNP 1 DYAD m04
7 SNP 1 DYAD m04
*/

/*--------------------------------------------------------------*\
| |
| Tabulate the frequencies by snpgrp and level |
| |
\*--------------------------------------------------------------*/

proc freq data=&pgm._mkelvl;
tables snpgrp*famtyp*lvl/missing list out=&pgm._mkefrq(drop=percent);
run;

/*
Up to 40 obs from WORK.cnt_triad_dyad_monad_MKEFRQ
Obs SNPGRP FAMTYP LVL COUNT

1 SNP 1 DYAD m03 1
2 SNP 1 DYAD m04 8
3 SNP 1 DYAD m05 3
4 SNP 1 DYAD m06 5
5 SNP 1 DYAD m07 8
6 SNP 1 MONAD s01 3
7 SNP 1 MONAD s02 6
*/

/*--------------------------------------------------------------*\
| |
| Create a shell dataset the has all possible snpgrp x lvl cases|
| So we can fill in the missing combinations with 0 |
| |
\*--------------------------------------------------------------*/

data &pgm._zro(keep=snpgrp famtyp lvl cnt);
set &pgm._map;
do idx=1 to 2;
snpgrp=cats('SNP',put(idx,3.));
select (substr(lvl,1,1));
when ('n') famtyp='TRIAD';
when ('m') famtyp='DYAD';
when ('s') famtyp='MONAD';
end;
cnt=0;
output;
end;
run;

/*
* note all counts are 0
Up to 40 obs from WORK.cnt_triad_dyad_monad_ZRO

Obs LVL SNPGRP FAMTYP CNT

1 n01 SNP 1 TRIAD 0
2 n01 SNP 2 TRIAD 0
3 n02 SNP 1 TRIAD 0
4 n02 SNP 2 TRIAD 0
5 n03 SNP 1 TRIAD 0
6 n03 SNP 2 TRIAD 0
7 n04 SNP 1 TRIAD 0
8 n04 SNP 2 TRIAD 0
9 n05 SNP 1 TRIAD 0
10 n05 SNP 2 TRIAD 0
*/

/*--------------------------------------------------------------*\
| |
| Add in the zero counts to fill in all possibilities |
| if count is missing in freq output then 0 is used |
| |
| distinct is needed because TRIAD has multiples ie n02 |
| |
\*--------------------------------------------------------------*/

proc sql;
create
table &pgm._lftzro as
select
distinct
l.snpgrp
,l.famtyp
,l.lvl
,coalesce(r.count,l.cnt) as cntfin
from
&pgm._zro as l left join &pgm._mkefrq as r
on
l.snpgrp = r.snpgrp and
l.famtyp = r.famtyp and
l.lvl = r.lvl
order
by snpgrp, famtyp, lvl
;quit;

/*
Up to 40 obs from WORK.cnt_triad_dyad_monad_LFTZRO

Obs SNPGRP FAMTYP LVL CNTFIN

1 SNP 1 DYAD m01 0
2 SNP 1 DYAD m02 0
3 SNP 1 DYAD m03 1
4 SNP 1 DYAD m04 8
5 SNP 1 DYAD m05 3
6 SNP 1 DYAD m06 5
7 SNP 1 DYAD m07 8
8 SNP 1 TRIAD n01 1
*/

/*--------------------------------------------------------------*\
| |
| Sum the counts by snp group and family type |
| |
\*--------------------------------------------------------------*/

data &pgm._sum(drop=sum);
retain sum 0;
set &pgm._lftzro;
by snpgrp famtyp;
output;
sum=sum+cntfin;
if last.famtyp then do;
lvl='sum';
cntfin=sum;
output;
sum=0;
end;
run;

/*--------------------------------------------------------------*\
| |
| Transpose each family type to create final SAS datasets |
| MONAD, DYAD and TRIAD datasets |
| |
| Output CSV files |
| |
\*--------------------------------------------------------------*/

%macro famtyp(typ=MONAD,out=1)
/des="Create separate MONAD, DYAD and TRIAD datasets";

proc transpose data=&pgm._sum(where=(famtyp="&typ"))
out=&pgm._&typ(drop=_name_);
by famtyp snpgrp;
id lvl;
var cntfin;
run;

/*
All obs from cnt_triad_dyad_monad_monad

Obs FAMTYP SNPGRP S01 S02 S03

1 MONAD SNP 1 3 6 16
2 MONAD SNP 2 19 5 1
*/

dm "dexport &pgm._&typ '&&output&out' replace";

%mend famtyp;

%famtyp(typ=DYAD ,out=1);
%famtyp(typ=MONAD,out=2);
%famtyp(typ=TRIAD,out=3);

Last Updated Date: 11/21/2013
Last Reviewed Date: 11/21/2013

Contact Information

Name: Dr Paul Albert
Chief and Senior Investigator
Biostatistics and Bioinformatics Branch
Phone: 301-496-5582
E-mail: albertp@mail.nih.gov

Staff Directory
Vision National Institutes of Health Home BOND National Institues of Health Home Home Storz Lab: Section on Environmental Gene Regulation Home Machner Lab: Unit on Microbial Pathogenesis Home Division of Intramural Population Health Research Home Bonifacino Lab: Section on Intracellular Protein Trafficking Home Lilly Lab: Section on Gamete Development Home Lippincott-Schwartz Lab: Section on Organelle Biology