[Cs254f11] My plan

Mon Oct 17 20:15:40 EDT 2011

[Re-adding the list to the cc: -- hope that's okay! I do think that others will also benefit from this.]

Comments below...

On Oct 17, 2011, at 4:22 PM, Wm. Josiah Erikson wrote:
>    OK, so each critic is made up of a number of "scoring genomes", that are either generated randomly or pulled randomly from a large predetermined set of soup of things you could do to get a score. Add all the scoring genomes together and you get a critic, which will then evaluate each song and the total deviation of each individual score from my own will be that critic's fitness. Then I will breed together the x best critics in each generation of y members (adjustable) and run it again.

It sounds from the below like each "scoring genome" can itself be a pretty complicated beast, do math and comparisons etc. So why couldn't *one* of these be able to do all of the things that you're thinking that a bunch of them could do summed together? You're saying that a critic will be the sum of a bunch of things, each of which can do a bunch of things including making sums of bunches of things... right? If I've got that right then I think it'll be simpler and probably just as good to make each critic *be* just *one* of these scoring genomes. 

>    Now each scoring genome could both DO random things (like pull two random characters out from two random lines and compare them to each other, or pull a random character from a random line and compare it to the most common character in the whole file, or a million other possibilities) and could also assign scores to those things, positive, negative, who knows. It could also be that having a particular genome hard-coded with the actual positions in the files that it pulls from, once it's been randomly generated, or whatever, would be helpful and create more consistent results when breeding the resulting critic with another one.

My advice is to avoid any randomness in the actions or calculations of the scoring genomes. In other words, the same genome, if evaluated twice on the same file, should produce exactly the same score. 

It's not that I can't imagine it being useful to do random stuff (or, as you suggest below, grab random data from a file) as part of a calculating a score, but rather that I predict that randomness here will make evolution impossible. 

If your scores are nondeterministic (giving different scores on different applications to the same data) then your fitnesses (the differences between the scores and your own personal scores) will also be nondeterministic. This means that the "selection" performed in the evolutionary loop will be basing its selections on luck to a fairly large extent, and that this luck factor may well be as important as actual quality in determining who gets selected. This would mean that the evolutionary loop will not be able to amplify quality over generations.

If you really want to have scoring genomes that involve randomness then I think you'd have to take pretty serious countermeasures, like testing each one by running it a large numbers and averaging the results. But I think it'd be simpler and better just to leave out the randomness altogether.

Of course, if you're not going to be using a random number you'll have to have a non-random number and that will have to come from somewhere. Presumably there could just be numbers in the genome itself, or some functions could operate on *all* lines in a file or all lines of a particular type, etc.

> It will be interesting to see what tools I can come up with to try to help this evolve towards something useful, i.e. what to put in the soup. The ideas I have so far involve
> 
> Fetching:
>    -set of characters, same random position on each line of the file (that position would be in the genome and would get passed on from generation to generation)
>    -getting the most common character in the file
>    -random number of random characters from the file on different lines
>    -same as above except same line
>    -random initial number determines starting number of the position on the line, increase for each line until out of characters
> 
> Operators:
>    -modulus division, addition, subtraction, multiplication, division, mode, mean, median, and mixtures of all of these with comparison
> 
>    Maybe this is all too heterogeneous and I need to make everything take two operators or something. I'll see as I start actually implementing this. I'm going to start with a few simple genomes :)

All of this looks good except for the randomness...

> 
>    I figured out the slurp problem - for some reason when I stick it in a vector of lists, it works fine:
> 
> (def rita (vec (string/split-lines (slurp "/Volumes/cs254/group_storage/josiah/LovelyRita.csv"))))
> 
> (get rita (rand-int (count rita)))
> ;"11, 6864, Note_on_c, 9, 42, 0"
> 

Well... I don't know, but this does have half of the smell of a laziness problem, since calling vec on something lazy forces it to be fully realized. But the doc string on split-lines says it's not lazy, and neither should slurp be. So this doesn't fully make sense. Another thing that doesn't is that you said you had the same problem with the slurp example in clojinc with Jabberwocky.txt, and I haven't experienced any such problem... it returns instantly for me.

 -Lee

--
Lee Spector, Professor of Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438