HyPhy message board - Print Page

HyPhy message board
http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl HYPHY Package >> HyPhy feedback >> Maximum Sequence Length http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1115162813 Message started by adk on May 3^rd, 2005 at 4:26pm

Title: Maximum Sequence Length
Post by adk on May 3^rd, 2005 at 4:26pm

Hi there,

I'm trying to estimate parameters on an extremely big dataset (~154 megabases), but I get an memory full error:

** malloc: vm_allocate(size=160923648) failed (error code=3)
*** malloc[18470]: error: Can't allocate region

the errors.log file looks like this:

Error:
Memory Full Exiting...
Current BL Command:Read Data Set myData from file "../../../data/simulans/syntenic/pairwise/melw501.auto.nonCDS.cat.fa"

I am guessing that this dataset is too big for the default compilation of HYPHY. Is this true? If so, how could I compile it to handle this big a dataset?

cheers,
Andy

Title: Re: Maximum Sequence Length
Post by Sergei on May 3^rd, 2005 at 4:37pm

Dear Andy,

wrote on May 3^rd, 2005 at 4:26pm:

I am guessing that this dataset is too big for the default compilation of HYPHY. Is this true? If so, how could I compile it to handle this big a dataset?

Never tried something quite this large. On Mac OS X there is nothing one can do to adjust maximum memory size for a process; 4GB is obviously the limit for a 32 bit system (non-G5). If I recall OS X also limits a per-process memory allocation to something like 1.5 GB. What is your computer configuration like?

How many sequences do you have? I can try to generate some random data with the same length and same data format and see where the memory allocation error happens.

To be fair, I never really optimized the data reading module to use the least amount of memory; perhaps I should make a code revision to improve the memory footprint.

Cheers,
Sergei

Title: Re: Maximum Sequence Length
Post by adk on May 3^rd, 2005 at 4:44pm

Hey,

So the dataset is only two sequences, each about 77 megabases long. The computer setup is a Dual G5 Xserve. How much memory should I need to open the data set? I'm taking it there is no hard coded limit in HYPHY then?

cheers,

Andy

Title: Re: Maximum Sequence Length
Post by adk on May 3^rd, 2005 at 4:50pm

btw- our sys. admin. tells me that that server should be able to use 4gigs per process. If the data file is 149mb is it possible that it is taking up over 4gb of RAM?

cheers,
Andy

Title: Re: Maximum Sequence Length
Post by Sergei on May 3^rd, 2005 at 4:53pm

Dear Andy,

wrote on May 3^rd, 2005 at 4:44pm:

So the dataset is only two sequences, each about 77 megabases long. The computer setup is a Dual G5 Xserve. How much memory should I need to open the data set? I'm taking it there is no hard coded limit in HYPHY then?

There is no hard coded limit. I just looked at the data reader code; my latest revision was optimized for speed of reading MSA (more than 2 sequences); the current incarnation is actually very memory inefficient for 2 long sequences.

I am actually doing something that will require me to read genome-size pairs of sequences, so I'm probably going to optimize memory usage for your scenario; stay tuned - should have more for you in a couple of days.

Cheers,
Sergei

Title: Re: Maximum Sequence Length
Post by adk on May 3^rd, 2005 at 5:07pm

Excellent Sergei! HYPHY is at the core of the phylogenetics analysis we are currently performing for an upoming drospohila genomics paper, so this addition would be extremely helpful!
cheers,
Andy

Title: Re: Maximum Sequence Length
Post by Simon on May 4^th, 2005 at 11:25am

Dear Andy,

Are you looking at comparisons between large numbers of orthologous genes? If so, what kinds of things are you looking at? I ask, as we're trying to develop various methods for looking at dN/dS comparisons at a genome-wide level, similar to the work that Andy Clark and Rasmus Nielsen did on the Celera human/chimp data, which may come in handy. Feel free to email me or Sergei if you'd like to be a beta-tester.

Best,
Simon

Title: Re: Maximum Sequence Length
Post by Sergei on May 4^th, 2005 at 5:51pm

Quote:

I'm probably going to optimize memory usage for your scenario; stay tuned - should have more for you in a couple of days.

I have rewritten some old parts of the code which were never designed with very long data sets it mind :-[. I was able to read in a 30 megabase human-chimp CDS with the new code (May 4th, 2005 build) in about 30 seconds and run nucleotide and codon model fits on it quickly in about 250 MB peak memory consumption. Give today's build a spin and let me know if it works...

Cheers,
Sergei

Title: Re: Maximum Sequence Length
Post by avilella on Nov 30^th, 2005 at 10:20am

wrote on May 4^th, 2005 at 11:25am:

Hi Simon and Sergei,

I stumbled upon this post as of today and wanted to say that I am
certainly doing that kind of comparisons in my research, and am
enormously interested into being a beta-tester of your methods.

This is a perfect timing at this point of my research, so I am really
interested in getting into this as soon as possible,

Looking forward to hearing from you,

Bests,

Albert.

Title: Re: Maximum Sequence Length
Post by avilella on Dec 22^nd, 2005 at 2:38pm

Hi all,

I have an input file of 10seqs x 4090497 for which trying to calculate a free-ratios/local model codon analysis.

After spending some hours consuming ~1g RAM, it stops without giving the results and without much indication on what happened in the logs.

Is that kind of file meant to be analysed without problem? Any idea of what might be happening?

Multimedia File Viewing and Clickable Links are available for Registered Members only!! You need to

Title: Re: Maximum Sequence Length
Post by Sergei on Dec 22^nd, 2005 at 3:51pm

Dear Albert,

Good catch. Some old experimental code related to the sorting the order of alignment columns (which should have been de-activated) was choking on a large codon data set (with loads of unique codon patterns). I'll fix it in the next build; in the meantime change line 7496 in likefunc.cpp from

Code (]
checkParameter (useFullMST,kp,1.0);
[/code):

to

[code]
checkParameter (useFullMST,kp,0.0);

recompile and try again.

Cheers,
Sergei

P.S. Yes, I am still putting together multi-gene analyses. They were originally written for 2 sequences only and the modification to more than 2 is a bit tedious :(