codinghood

Wednesday, April 9, 2008

Sometimes

How many times you needed to back away from assumptions you have made ?
This is not really time I want to go with the story further. But as I've browsed the bash I've found something that made me laugh:

<Khassaki> HI EVERYBODY!!!!!!!!!! <Judge-Mental> try pressing the the Caps Lock key <Khassaki> O THANKS!!! ITS SO MUCH EASIER TO WRITE NOW!!!!!!! <Judge-Mental> fuck me

Wednesday, April 2, 2008

Tracing the BEAST

I have spent some time on analyzing BEAST code. I have started with the describing packages with UML. Then I've moved to the business logic analysis.
Let me introduce you BEAST's configurator. It can be found in dr.xml.
Configuration of BEAST simulation is given by xml file. Each xml node has its realization in java code.
BEAST is using builder pattern (or maybe it's better to say builder's builder) for the software initial setup. However this is some data driven, distributed variation on this pattern ;).
As you can see at the picture there are two classes marked yellow.

XMLParser is responsible for parsing XML file with the configuration. It uses DOM model of that file as an input. The traversing through the code is done in post-order manner. To parse each node from the description BEAST uses multiple AbstractXMLObjectParser's specialization instances (this builders can setup instance of the classes with the business logic).
And at this point I can explain what I mean saying distributed. Most of the specializations of the AbstractXMLObjectParser are anonymous classes which can be found in model's classes.
To see whole set of object parser you can always look into BeastParser.setup() method.

Reference class is realization of idref's concept from setup file.
Each parsed node is stored (as a new node realization instance) in some object's registry or/and added to it's parent node. When such a object is referred in other parts of configuration it is wrapped in Reference and returned to caller node. That means we can refer to the classes which was built before.
E.g. When parsing proces reach the mcmc node it look for the builder which is responsible for building class which implements MCMC process (I am skipping building of the mcmc's child nodes). At this point every referred object (is already consturcted) is taken from the registry and used for correct in construction of MCMC implementation.

So the formula for extending BEAST is simple.

define xml for your contribution

provide builder for your implementation

implement your contribution

I don't want to go into much details about this engine (as for example how the simplation is started, after construction or how the xml description is validated) so if you can any questions and suggestions (as I only interpret the code I could misunderstand some concepts) leave a comment.
Next time I will show you simple recipe for adding new tree mutator to the BEAST.

cheers,

Friday, March 28, 2008

Articulate - kids don't use it at home ;)

Well sometimes there is a need to share some knowledge with others. Sometimes you want to get some knowledge from others. And finally sometimes there is a time to do both at the same moment.
As we are interested how non-functional requirements influence software architecture we decide to ask some specialist. As it should be done nice and smooth we decide to take M$ Office and Articulate&Quizmaker tooling. In theory it enables people to build nice multimedia flash presentation from the power point one.
Well only in theory :P .
Problems you can face using Powerpoint 2007 and Articulate:

cliparts doesn't scale well

relative distances between some elements of slide are different from original ones

there are some additional sighs on the end of random lines of lists elements

Well you can of course live with it but what can happened when you decide to add quiz support (Quizmaker) ?

each new quiz slide destroys your animation synchronizations.

And as Quizmaker is intended to build some surveys how should the collecting user e-mail addresses look like ? I though it is nice and easy.
There are two options to get text from user:

short text - and I mean it's realy short - 24 characters so some addresses won't fit it

or essay - you can limit number of characters but text box takes whole page space which is ridiculous in case of e-mail inputs

Perhaps all this problems are my fault. Can anybody explain how to get rid of at least some of them ?

Wednesday, March 19, 2008

BEAST code reverse engineering - in progress

As I've recently said there is no perfect tool (and this is only my private opinion) which can do all class reverse engineering for me. But as I tried o dozen of them there is one which works not bad. Omondo can reverse from Java 1.5 and as is stated on features site can do much more. Anyway I found it useful. Don't get me wrong, I really like idea of using EMP and MoDisco but I don't have time right now.
What is funny not only people here at Frankfurt Uni are interested in knowing BEAST architecture (or rather design) (just look at this thread) so maybe this is worth to publish my findings after all.

Wednesday, March 12, 2008

PERL for fast and ugly scripts :)

Yup. Example below. Data generator for dishonest casino example from "Biological sequences analysis". Yes I know that states are hardly changeable :). I am putting this code here also to hear answer for this question. how to produce nice colored html output from Eclipse. Eclipse -> Open office -> blogger is quite disappointing solution.



$outputName = $ARGV[0];
$dices = $ARGV[1];

@fair = (1,2,3,4,5,6);
@loaded = (1,2,3,4,5,6,6,6,6,6);

%model = (
 "f" => \@fair,
 "l" => \@loaded
);

%change = (
 "f" => .05,
 "l" => .1
);
if(rand() > .5) {
 $actual = "l";
 $next =   "f";
} else {
 $actual = "f";
 $next =   "l";
}
open(F, '>', $outputName) or die "file $outputName cannot be opened\n";

select F; 

for($i = 0; $i < $dices; ++$i) {
 my @a = @{$model{$actual}};
 print $a[(rand() * @a) % @a], " ", $actual, "\n";
 if(rand() <= $change{$actual}) {
  ($actual, $next) = ($next, $actual);
 }
}

close(F);

Reversing BEAST

I am exhausted :) Have you ever tried reverse java code using open source software (and also community editions of popular commercial tools) ?? Under Linux ? :D
Well, sometimes it works. And sometimes not necessarily.

BEAST is a cross-platform program for Bayesian MCMC analysis of molecular sequences.
This software package contains almost 800 java classes (and I don't mention classes from used jar libraries) It is about ~60000 LoC. And all written in Java 1.5 (OK, compiled against Java 1.5 shows more than 1500 warnings).

There is a number of problems with freeware software for java reverse engineering that make this tools useless when analysing blast. In many cases the problem is that they works only for Java <=1.4. Non of the tested tools was able to import whole source folder from BEAST and analyze it. In case of ArgoUML (which is still, in my opinion, the best open RE tool for Java 1.5) it failed after using all available for VM memory (yes, I've changed the default values). Now when I change my working model to so called "step-by-step" it works only with few complains.

So, what is the alternative? Commercial license. And believe me I am actually close to make this choice. But who will guarantee that money spent on license solves the problem?

Later on I will give chance Eclipse + UML2. I know I should start with Eclipse (as always), but nobody's perfect :).

Monday, March 10, 2008

Mulitilingua :D

To live in Germany you must know German :) Well it is not truth although this language is very useful ;) Recently I meet one of my neighbors. Georgi is studding law and came from Georgia. The probable was I speak some English and a bit German whereas he speaks some German and acceptable Russian. I though I understand Russian until that day ;). So we were communicating in this languages' mixture for a half an hour. It was fun but now I know that I must work harder on German.
Very similar situation is in bioinformatics. To understand some concepts or simply use it you should to know multilingua mixture.
Recently I worked on Viterbi algorithm (the code at wiki is a bit different from one I am using). It is simple dynamic programming approach to decode of HMM output. Anyway to compare my buggy Java code to the referenced on the page I grasped a bit of Python. As C++ is the basic language in my team and I want to be more fluent in C++ + boost I've decided to code it in C++. To generate input for tests I simple using Perl :).
I started with "Bioinformatics sequence analysis" and I must say I really enjoy this trip.

And if you want to know what I am doing during weekends (maybe later on I'll describe some of them) you can always wisit my picassa web album.