I am probably thinking about this wrong but here goes.
A computer starts spitting out a gazillion random numbers between 11111111111111111111 and 99999999999999999999, in a linear row:
- Sometimes the computer adds a number to one end of the line.
- Sometimes the computer adds a number to the other end of the line.
- Each number has a number that comes, or will come, before.
- Each number has a number that comes, or will come, after.
- Not all numbers are unique, many, but not most, are repeated.
- The computer never stops spitting out numbers.
As I record all of these numbers, I need to be able to make an educated guess, at any given time:
-
If this is the second time I have seen a number I must know what number preceded it in line last time.
-
If it has appeared more than two times, I must know the probability/frequency of numbers preceding it.
-
If this is the second time I have seen a number, I must also know what number came after it in line last time.
-
If it has appeared more than two times, I must know the probability/frequency of numbers coming after it.
How the heck do I structure the tables in a MySQL database to store all these numbers? Which engine do I use and why? How do I formulate my queries? I need to know fast, but capacity is also important because when will the thing stop spitting them out?
My ill-conceived plan:
2 Tables:
1. Unique ID/#
2. #/ID/#
My thoughts:
Unique ID’s are almost always going to be shorter than the number = faster match.
Numbers repeat = fewer ID rows = faster match initially.
Select * in table2 where id=(select id in table1 where #=?)
OR:
3 Tables:
1. Unique ID/#
2. #/ID
3. ID/#
My thoughts:
If I only need left/before, or only need after/right, im shrinking the size of the second query.
SELECT # IN table2(or 3) WHERE id=(SELECT id IN table1 WHERE #=?)
OR
1 Table:
1. #/#/#
Thoughts:
Less queries = less time.
SELECT * IN table WHERE col2=#.
I’m lost…. 🙁 Each number has four attributes, that which comes before+frequency and that which comes after+frequency.
Would I be better off thinking of it in that way? If I store and increment frequency in the table, I do away with repetition and thus speed up my queries? I was initially thinking that if I store every occurrence, it would be faster to figure the frequency programmatically…….
Such simple data, but I just don’t have the knowledge of how databases function to know which is more efficient.
In light of a recent comment, I would like to add a bit of information about the actual problem: I have a string of indefinite length. I am trying to store a Markov chain frequency table of the various characters, or chunks of characters, in this string.
Given any point in the string I need to know the probability of the next state, and the probability of the previous state.
I am anticipating user input, based on a corpus of text and past user input. A major difference compared to other applications I have seen is that I am going farther down the chain, more states, at a given time and I need the frequency data to provide multiple possibilities.
I hope that clarifies the picture a lot more. I didn’t want to get into the nitty gritty of the problem, because in the past I have created questions that are not specific enough to get a specific answer.
This seems maybe a bit better. My primary question with this solution is: Would providing the “key” (first few characters of the state) increase the speed of the system? i.e query for state_key, then query only the results of that query for the full state?
Table 1:
name: state
col1:state_id - unique, auto incrementing
col2:state_key - the first X characters of the state
col3:state - fixed length string or state
Table 2:
name: occurence
col1:state_id_left - non unique key from table 1
col2:state_id_right - non unique key from table 1
col3:frequency - int, incremented every time the two states occur next to each other.
QUERY TO FIND PREVIOUS STATES:
SELECT * IN occurence WHERE state_id_right=(SELECT state_id IN state WHERE state_key=? AND state=?)
QUERY TO FIND NEXT STATES:
SELECT * IN occurence WHERE state_id_left=(SELECT state_id IN state WHERE state_key=? AND state=?)
I’m not familiar with Markov Chains but here is an attempt to answer the question. Note: To simplify things, let’s call each string of numbers a ‘state’.
First of all I imagine a table like this
order: just use a sequential number (1,2,3,…,n) this will make it easy to search for the previous or next state.
state_id: a unique number associated to the state. As an example, you can use the number 1 to represent the state ‘1111111111…1’ (whatever the length of the sequence is). What’s important is that a reoccurrence of a state needs to use the same state_id that was used before. You may be able to formulate the state_id based on the string (maybe substracting a number). Of course a state_id only makes sense if the number of possible states fits in a MySQL int field.
state: that is the string of numbers ‘11111111…1’ to ‘99999999…9’ … I’m guessing this can only be stored as a string but if it fits in an integer/number column you should try it as it may well be that you don’t need the state_id
The point of state_id is that searching number is quicker than searching texts, but there will always be trade-offs when it comes to performance … profile and identify your bottlenecks to make better design decisions.
So, how do you look for a previous occurrence of the state S_i ?
“SELECT order, state_id, state FROM states WHERE state_id = ” and then attach get_state_id(S_i) where get_state_id ideally uses a formula to generate a unique id for the state.
Now, with order – 1 or order + 1 you can access the neighboring states issuing an additional query.
Next we need to track the frequency of different occurrences. You can do that in a different table that could look like this:
And only add records as you get the numbers.
Finally, you can have tables to track frequency for the neighboring states:
You will be able to infer probabilities (i guess this is what you are trying to do) by looking at the number of occurrences of a state (in state_frequencies) vs the number of occurrences of it’s predecessor state (in prev_state_frequencies).
I’m not sure if I got your problem right but if this makes sense I’m guessing I have.
Hope it helps,
AH