My masters thesis is about discovering bad database design by analyzing metadata and the data stored. We do this by extracting a metadata model from a given DBMS and then running a set of rules on this metadata.
To extend this process with data analysis, we need to allow rules to query the database directly, but we must retain DBMS independence, such that queries can be applied to PostgreSQL, MSSQL and MySQL.
We have discussed a sort of functional construction of queries such as:
new Query(new Select(columnID), new From(tableID), new Where(new Equality(columnID1, columnID2)))
And then using a DBMS-specific serializer.
Another approach is to let rules handle it all by themselves:
public Query QueryDatabase(DBMS dbms)
{
if (dbms == PostgreSQL) { return "select count(1) from Users"}
if (dbms == MSSQL) {return ....}
}
Are we missing something? Does all this in fact exist in a nice library somewhere? And yes, we have looked at Entity frameworks, but they seem to rely on a statically types model of the database, which for obvious reasons cannot be created.
I should mention that we maintain an extensible rule architecture, allowing end users to implement their own rules.
To clarify what we want to achieve, look at the following query (mssql), it needs two parameters, the name of the table (@table) and the name of the column (@column):
DECLARE @TotalCount FLOAT;
SELECT @TotalCount = COUNT(1) FROM [@table];
SELECT SUM(pcount * LOG10(@TotalCount / pcount)) / (LOG10(2) * @TotalCount)
FROM (SELECT (Count([@column])) as pcount
FROM [@table]
GROUP BY [@column]) as exp1
The query measures the amount of information stored in a given attribute, by estimating the entropy. It needs to access all rows in the table. To avoid extracting all rows from the database and transferring them over a slow network connection it is better to express them in SQL an only transfer a single number.
NOTE: We DO have all the metadata we need. This question is only for accessing data!
I was not very sure of whether to add this to my already long question, edit an existing answer or what todo. Please feel free to advise. 😉
Building on mrnye answer:
new Query()
.Variable(varname => FLOAT)
.Set(varname => new Query().Count(1).From(table) )
.Select(new Aggregate().Sum(varname => "pcount * LOG10(varname / pcount)"))
.From(
new Query()
.Select(pcount => new Aggregate().Count(column)
.From(table)
.GroupBy(column)
)
Syntax errors and misuse of lambda statements aside, i played with the idea of using some extension methods for building queries. It does seem as a fairly complex approach. How would you think about such an approach?
Building on the LINQ answer:
let totalCount = Table.Count
from uv un from r in Table
group r by r["attr"]
select r.Count
select r.Count * Log2((totalCount / r.Count))
Seems fairly nice, but a helluva lot to implement…
You could achieve the same by implementing a custom LINQ provider infrastructure. The queries are generic, but the AST tree visitors that generate the SQL queries can be made pluggable. You can even mock a database using a in memory data store and translating your custom LINQ query to a LINQ to objects query!
You would need to create a provider that would know how to extract the column name from the object’s indexer. Here is a basic framework that you can extend: