My masters thesis is about discovering bad database design by analyzing metadata and the

Question

0

Asked: May 20, 20262026-05-20T07:48:11+00:00 2026-05-20T07:48:11+00:00

My masters thesis is about discovering bad database design by analyzing metadata and the

0

My masters thesis is about discovering bad database design by analyzing metadata and the data stored. We do this by extracting a metadata model from a given DBMS and then running a set of rules on this metadata.

To extend this process with data analysis, we need to allow rules to query the database directly, but we must retain DBMS independence, such that queries can be applied to PostgreSQL, MSSQL and MySQL.

We have discussed a sort of functional construction of queries such as:

new Query(new Select(columnID), new From(tableID), new Where(new Equality(columnID1, columnID2)))

And then using a DBMS-specific serializer.

Another approach is to let rules handle it all by themselves:

public Query QueryDatabase(DBMS dbms)
{
 if (dbms == PostgreSQL) { return "select count(1) from Users"}
 if (dbms == MSSQL) {return ....}
}

Are we missing something? Does all this in fact exist in a nice library somewhere? And yes, we have looked at Entity frameworks, but they seem to rely on a statically types model of the database, which for obvious reasons cannot be created.

I should mention that we maintain an extensible rule architecture, allowing end users to implement their own rules.

To clarify what we want to achieve, look at the following query (mssql), it needs two parameters, the name of the table (@table) and the name of the column (@column):

DECLARE @TotalCount FLOAT;
SELECT @TotalCount = COUNT(1) FROM [@table];
SELECT SUM(pcount * LOG10(@TotalCount / pcount)) / (LOG10(2) * @TotalCount)  
FROM (SELECT (Count([@column])) as pcount 
      FROM [@table]
      GROUP BY [@column])  as exp1

The query measures the amount of information stored in a given attribute, by estimating the entropy. It needs to access all rows in the table. To avoid extracting all rows from the database and transferring them over a slow network connection it is better to express them in SQL an only transfer a single number.

NOTE: We DO have all the metadata we need. This question is only for accessing data!

I was not very sure of whether to add this to my already long question, edit an existing answer or what todo. Please feel free to advise. 😉

Building on mrnye answer:

new Query()
.Variable(varname => FLOAT)
.Set(varname => new Query().Count(1).From(table) )
.Select(new Aggregate().Sum(varname => "pcount * LOG10(varname / pcount)"))
.From(
  new Query()
  .Select(pcount => new Aggregate().Count(column)
  .From(table)
  .GroupBy(column)
)

Syntax errors and misuse of lambda statements aside, i played with the idea of using some extension methods for building queries. It does seem as a fairly complex approach. How would you think about such an approach?

Building on the LINQ answer:

let totalCount = Table.Count
from uv un from r in Table
           group r by r["attr"]
           select r.Count
select r.Count * Log2((totalCount / r.Count))

Seems fairly nice, but a helluva lot to implement…

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T07:48:12+00:00

You could achieve the same by implementing a custom LINQ provider infrastructure. The queries are generic, but the AST tree visitors that generate the SQL queries can be made pluggable. You can even mock a database using a in memory data store and translating your custom LINQ query to a LINQ to objects query!

You would need to create a provider that would know how to extract the column name from the object’s indexer. Here is a basic framework that you can extend:

// Runs in LinqPad!
public class TableQueryObject
{
    private readonly Dictionary<string, object> _data = new Dictionary<string, object>();
    public string TableName { get; set; }
    public object this[string column]
    {
        get { return _data.ContainsKey(column) ? _data[column] : null; }
        set { if (_data.ContainsKey(column)) _data[column] = value; else _data.Add(column, value); }
    }
}

public interface ITableQuery : IEnumerable<TableQueryObject>
{
    string TableName { get; }
    string ConnectionString { get; }
    Expression Expression { get; }
    ITableQueryProvider Provider { get; }
}

public interface ITableQueryProvider
{
    ITableQuery Query { get; }
    IEnumerable<TableQueryObject> Execute(Expression expression);
}

public interface ITableQueryFactory
{
    ITableQuery Query(string tableName);
}


public static class ExtensionMethods
{
    class TableQueryContext : ITableQuery
    {
        private readonly ITableQueryProvider _queryProvider;
        private readonly Expression _expression;

        public TableQueryContext(ITableQueryProvider queryProvider, Expression expression)
        {
            _queryProvider = queryProvider;
            _expression = expression;
        }

        public string TableName { get { return _queryProvider.Query.TableName; } }
        public string ConnectionString { get { return _queryProvider.Query.ConnectionString; } }
        public Expression Expression { get { return _expression; } }
        public ITableQueryProvider Provider { get { return _queryProvider; } }
        public IEnumerator<TableQueryObject> GetEnumerator() { return Provider.Execute(Expression).GetEnumerator(); }
        IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); }
    }

    public static MethodInfo MakeGeneric(MethodBase method, params Type[] parameters)
    {
        return ((MethodInfo)method).MakeGenericMethod(parameters);
    }

    public static Expression StaticCall(MethodInfo method, params Expression[] expressions)
    {
        return Expression.Call(null, method, expressions);
    }

    public static ITableQuery CreateQuery(this ITableQueryProvider source, Expression expression)
    {
        return new TableQueryContext(source, expression);
    }

    public static IEnumerable<TableQueryObject> Select<TSource>(this ITableQuery source, Expression<Func<TSource, TableQueryObject>> selector)
    {
        return source.Provider.CreateQuery(StaticCall(MakeGeneric(MethodBase.GetCurrentMethod(), typeof(TSource)), source.Expression, Expression.Quote(selector)));
    }

    public static ITableQuery Where(this ITableQuery source, Expression<Func<TableQueryObject, bool>> predicate)
    {
        return source.Provider.CreateQuery(StaticCall((MethodInfo)MethodBase.GetCurrentMethod(), source.Expression, Expression.Quote(predicate)));
    }
}

class SqlTableQueryFactory : ITableQueryFactory
{

    class SqlTableQuery : ITableQuery
    {
        private readonly string _tableName;
        private readonly string _connectionString;
        private readonly ITableQueryProvider _provider;
        private readonly Expression _expression;

        public SqlTableQuery(string tableName, string connectionString)
        {
            _connectionString = connectionString;
            _tableName = tableName;
            _provider = new SqlTableQueryProvider(this);
            _expression = Expression.Constant(this);
        }

        public IEnumerator<TableQueryObject> GetEnumerator() { return Provider.Execute(Expression).GetEnumerator(); }
        IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); }
        public string TableName { get { return _tableName; } }
        public string ConnectionString { get { return _connectionString; } }
        public Expression Expression { get { return _expression; } }
        public ITableQueryProvider Provider { get { return _provider; } }
    }

    class SqlTableQueryProvider : ITableQueryProvider
    {
        private readonly ITableQuery _query;
        public ITableQuery Query { get { return _query; } }
        public SqlTableQueryProvider(ITableQuery query) { _query = query; }

        public IEnumerable<TableQueryObject> Execute(Expression expression)
        {
            //var connecitonString = _query.ConnectionString;
            //var tableName = _query.TableName;
            // TODO visit expression AST (generate any sql dialect you want) and execute resulting sql
                    // NOTE of course the query can be easily parameterized!
            // NOTE here the fun begins, just return some dummy data for now :)
            for (int i = 0; i < 100; i++)
            {
                var obj = new TableQueryObject();
                obj["a"] = i;
                obj["b"] = "blah " + i;
                yield return obj;
            }
        }
    }

    private readonly string _connectionString;
    public SqlTableQueryFactory(string connectionString) { _connectionString = connectionString; }
    public ITableQuery Query(string tableName)
    {
        return new SqlTableQuery(tableName, _connectionString);
    }
}

static void Main()
{
    ITableQueryFactory database = new SqlTableQueryFactory("SomeConnectionString");
    var result = from row in database.Query("myTbl")
                 where row["someColumn"] == "1" && row["otherColumn"] == "2"
                 where row["thirdColumn"] == "2" && row["otherColumn"] == "4"
                 select row["a"]; // NOTE select executes as linq to objects! FTW
    foreach(var a in result) 
    {
        Console.WriteLine(a);
    }   
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My masters thesis is about discovering bad database design by analyzing metadata and the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply