Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7161319
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T13:32:51+00:00 2026-05-28T13:32:51+00:00

Can anybody point me to some PHP or Perl code which will create a

  • 0

Can anybody point me to some PHP or Perl code which will create a mysql table from an arbitrary TSV file?

Based on the data found and some parameters, it would use its logic to work out appropriate field types for each field, create the database table, and upload the data. (i.e. the table structure isn’t known in advance).

(Alternatively I could imagine it creating an initial table with a general text type, then running sql queries to analyse the data, then altering the table structure to match the data.)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T13:32:51+00:00Added an answer on May 28, 2026 at 1:32 pm

    The only solution I found was in in the MYSQL Cookbook
    http://www.kitebird.com/mysql-cookbook/

    Code samples at: http://www.kitebird.com/mysql-cookbook/downloads-2ed.php

    This does the first piece…analyse TSV data and generate a “CREATE TABLE” appropriate for it.
    The second piece, uploading the TSV into the table structure, is simple and common.

    guess_table.pl

    #!/usr/bin/perl
    # guess_table.pl - characterize the contents of a data file and use the
    # information to guess a CREATE TABLE statement for the file
    
    # Usage: guess_table.pl table_name data_file
    
    # To do:
    # - Use value range information for something.  It's collected but not yet
    #   used.  For example, suggest better INT types.
    # - Get rid of nonnegative attribute; it can be assessed now from the range.
    
    # Load a data file and read column names and data values.
    # Guess the declaration for each of the columns based on what the data
    # values look like, and then generate an SQL CREATE TABLE statement for the
    # table.  Because the column declarations are just guesses, you'll likely
    # want to edit the output, for example, to change a data type or
    # length.  You may also want to add indexes.  Nevertheless, using this
    # script can be easier than writing the CREATE TABLE statement by hand.
    
    # Some assumptions:
    # - Lines are tab-delimited, linefeed-terminated
    # - Dates consist of 3 numeric parts, separated by - or /, in y/m/d order
    
    # Here are some ways that guess_table.pl could be improved.  Each of
    # them would make it smarter, albeit at the cost of increased processing
    # requirements. Some of the suggestions are likely impractical for really
    # huge files.
    
    # - For numeric columns, use min/max values to better guess the type.
    # - Keep track of the number of unique values in a column.  If there
    #   aren't many, the column might be a good candidate for being an ENUM.
    #   Testing should not be case sensitive, because ENUM columns are not
    #   case sensitive.
    # - Make the date guessing code smarter.  Have it recognize non-ISO format
    #   and attempt to make suggestions that a column needs to be reformatted.
    #   (This actually needs to see entire column, because that would help
    #   it distinguish U.S. from British formats WRT order of month and day.)
    #   This would need to track min/max for each of the three date parts.
    # - If all values in a column are unique, suggest that it should be a PRIMARY
    #   KEY or a UNIQUE index.
    # - For DATETIME columns, allow some times to be missing without flagging
    #   column as neither DATE nor TIME.
    
    # Paul DuBois
    # paul@kitebird.com
    # 2002-01-31
    
    # 2002-01-31
    # - Created.
    # 2002-02-19
    # - Add code to track ranges for numeric columns and for the three date
    #   subparts of columns that look like they contain dates.
    # 2002-02-20
    # - Added --lower and --upper options to force column labels to lowercase
    #   or uppercase.
    # 2002-03-01
    # - For character columns longer than 255 characters, choose TEXT type based
    #   on maximum length.
    # 2002-04-04
    # - Add --quote-names option to quote table and column names `like this`.
    #   The resulting statement requires MySQL 3.23.6 or higher.
    # 2002-07-16
    # - Fix "uninitialized value" warnings resulting from missing columns in
    #   data lines.
    # - Don't attempt to assess date characteristics for columns that are always
    #   empty.
    # 2005-12-28
    # - Make --quote-names the default, add --skip-quote-names option so that
    #   identifier quoting can be turned off.
    # - Default data type now is VARCHAR, not CHAR.
    # 2006-06-10
    # - Emit UNSIGNED for double/decimal columns if they're unsigned.
    
    use strict;
    use warnings;
    use Getopt::Long;
    $Getopt::Long::ignorecase = 0; # options are case sensitive
    $Getopt::Long::bundling = 1;   # allow short options to be bundled
    
    # ----------------------------------------------------------------------
    
    # Create information structures to use for characterizing each column in
    # in the data file.  We need to know whether any nonnumeric values are
    # found, whether numeric values are always integers, and the maximum length
    # of column values.
    
    # Argument is the array of column labels.
    # Creates an array of hash references and returns a reference to that array.
    
    sub init_col_info
    {
    my @labels = @_;
    my @col_info;
    
        for my $i (0 .. @labels - 1)
        {
            my $info = { };
            $info->{label} = $labels[$i];
            $info->{max_length} = 0;
    
            # these can be tested directly, so they're set false until found
            # to be true
            $info->{hasempty} = 0;    # has empty values
            $info->{hasnonempty} = 0; # has nonempty values
    
            # these can be assessed only by seeing all the values in the
            # column, so they're set true until discovered by counterexample
            # to be false
            $info->{numeric} = 1;     # used to detect general numeric types
            $info->{integer} = 1;     # used to detect INT
            $info->{nonnegative} = 1; # used to detect UNSIGNED
            $info->{temporal} = 1;    # used to detect general temporal types
            $info->{date} = 1;        # used to detect DATE
            $info->{datetime} = 1;    # used to detect DATETIME
            $info->{time} = 1;        # used to detect TIME
    
            # track min/max value for numeric columns
            $info->{min_val} = undef;
            $info->{max_val} = undef;
    
            # track min/max for each of three date parts
            $info->{date_range} = [ undef, undef, undef];
    
            push (@col_info, $info);
        }
        return (\@col_info);
    }
    
    sub print_create_table
    {
    my ($tbl_name, $col_info_list, $quote) = @_;
    my $ncols = @{$col_info_list};
    my $s;
    my $extra = "";
    
        $quote = ($quote ? "`" : "");     # quote names?
        for my $i (0 .. $ncols - 1)
        {
            my $info = $col_info_list->[$i];
    
            $s .= ",\n" if $i > 0;
            $s .= $extra if $extra ne "";
            $extra = "";
    
            $s .= "  $quote$info->{label}$quote ";
    
            if (!$info->{hasnonempty})  # column is always empty, make wild guess
            {
                $s .= "CHAR(10)    /* NOTE: column is always empty */";
                next;
            }
    
            # if the column has nonempty values but one of
            # these hasn't been ruled out, that's a problem
            if ($info->{numeric} && $info->{temporal})
            {
                die "Logic error: $info->{label} was characterized as both"
        . " numeric and temporal\n";
            }
    
            if ($info->{numeric})
            {
                if ($info->{integer})
                {
        $s .= "INT";
    ## TO DO: use range to make guess about type
        # Print "might be YEAR" if in range...(0, 1901-2155)
                }
                else
                {
        $s .= "DOUBLE";
                }
                $s .= " UNSIGNED" if $info->{nonnegative};
            }
            elsif ($info->{temporal})
            {
                # if a date column looks more like a U.S. or British
                # date, add some comments to that effect
                if (exists ($info->{date_type}))
                {
        my $ref = $info->{date_type};
        $extra .= "  # $info->{label} might be a U.S. date\n"
                if $ref->{us};
        $extra .= "  # $info->{label} might be a British date\n"
                if $ref->{br};
                }
                if ($info->{date})
                {
        $s .= "DATE";
                }
                elsif ($info->{datetime})
                {
        $s .= "DATETIME";
                }
                elsif ($info->{time})
                {
        $s .= "TIME";
                }
                else
                {
        die "Logic error: $info->{label} is flagged as temporal, but"
            . " not as any of the temporal types\n";
                }
            }
            else
            {
                if ($info->{max_length} < 256)
                {
        $s .= "VARCHAR($info->{max_length})";
                }
                elsif ($info->{max_length} < 65536)
                {
        $s .= "TEXT";
                }
                elsif ($info->{max_length} < 16777216)
                {
        $s .= "MEDIUMTEXT";
                }
                else
                {
        $s .= "LONGTEXT";
                }
            }
            # if a column doesn't have empty values, guess that it cannot be NULL
            $s .= " " . ($info->{hasempty} ? "NULL" : "NOT NULL");
        }
    
        $s = "CREATE TABLE $quote$tbl_name$quote\n(\n$s\n);\n";
        print $s;
    }
    
    sub print_report
    {
    my $col_info_list = shift;
    my $ncols = @{$col_info_list};
    my $s;
    
        for my $i (0 .. $ncols - 1)
        {
            my $info = $col_info_list->[$i];
    
            printf "Column %d: %s\n", $i+1, $info->{label};
            if (!$info->{hasnonempty})  # column is always empty
            {
                print " column is always empty\n";
                next;
            }
    
            # if the column has nonempty values but one of
            # these hasn't been ruled out, that's a problem
            if ($info->{numeric} && $info->{temporal})
            {
                die "Logic error: $info->{label} was characterized as both"
        . " numeric and temporal\n";
            }
    
            print " column has empty values: "
        . ($info->{hasempty} ? "yes" : "no") . "\n";
            printf " column value maximum length = %d\n", $info->{max_length};
    
            if ($info->{numeric})
            {
                printf " column is numeric (range: %g - %g)\n",
                    $info->{min_val}, $info->{max_val};
                if ($info->{integer})
                {
        print " column is integer\n";
        if ($info->{nonnegative})
        {
            print " column is nonnegative\n";
        }
                }
            }
            elsif ($info->{temporal})
            {
                if ($info->{date})
                {
        my $ref = $info->{date_range};
        print " column contains date values";
        printf " (part ranges: %d - %d, %d - %d, %d - %d)\n",
                    $ref->[0]->{min}, $ref->[0]->{max},
                    $ref->[1]->{min}, $ref->[1]->{max},
                    $ref->[2]->{min}, $ref->[2]->{max};
        $ref = $info->{date_type};
        printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n",
                    ($ref->{iso} ? "yes" : "no"),
                    ($ref->{us} ? "yes" : "no"),
                    ($ref->{br} ? "yes" : "no");
                }
                elsif ($info->{datetime})
                {
        my $ref = $info->{date_range};
        print " column contains date+time values";
        printf " (part ranges: %d - %d, %d - %d, %d - %d)\n",
                    $ref->[0]->{min}, $ref->[0]->{max},
                    $ref->[1]->{min}, $ref->[1]->{max},
                    $ref->[2]->{min}, $ref->[2]->{max};
        $ref = $info->{date_type};
        printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n",
                    ($ref->{iso} ? "yes" : "no"),
                    ($ref->{us} ? "yes" : "no"),
                    ($ref->{br} ? "yes" : "no");
                }
                elsif ($info->{time})
                {
        print " column contains time values\n";
                }
                else
                {
        die "Logic error: $info->{label} is flagged as temporal, but"
            . " not as any of the temporal types\n";
                }
            }
            else
            {
                print " column appears to be a string"
        . " (cannot further narrow the type)\n";
            }
        }
    }
    
    # ----------------------------------------------------------------------
    
    my $prog = "guess_table.pl";
    my $usage = <<EOF;
    Usage: $prog [options] [data_file]
    
    Options:
    --help
            Print this message
    --labels, -l
            Interpret first input line as row of table column labels
            (default = c1, c2, ...)
    --lower, --upper
            Force column labels to be in lowercase or uppercase
    --quote-names, --skip-quote-names
         Quote or do not quote table and column identifiers with `` characters
         in case they are reserved words (default = quote identifiers)
    --report , -r
            Report mode; print findings rather than generating a CREATE
            TABLE statement
    --table=tbl_name, -t tbl_name
            Specify table name (default = t)
    EOF
    
    my $help;
    my $labels;     # expect a line of column labels?
    my $tbl_name = "t"; # table name (default: t)
    my $report;
    my $lower;
    my $upper;
    my $quote_names = 1;
    my $skip_quote_names;
    
    GetOptions (
        # =s means a string value is required after the option
        "help"              => \$help,            # print help message
        "labels|l"          => \$labels,          # expect row of column labels
        "table|t=s"         => \$tbl_name,        # table name
        "report|r"          => \$report,          # report mode
        "lower"             => \$lower,           # lowercase labels
        "upper"             => \$upper,           # uppercase labels
        "quote-names"       => \$quote_names,     # quote identifiers
        "skip-quote-names"  => \$skip_quote_names # don't quote identifiers
    ) or die "$usage\n";
    
    die  "$usage\n" if defined $help;
    
    $report = defined ($report);  # convert defined/undefined to boolean
    $lower = defined ($lower);
    $upper = defined ($upper);
    $quote_names = defined ($quote_names);
    $quote_names = 0 if defined ($skip_quote_names);
    
    die "--lower and --upper were both specified; that makes no sense\n"
                if $lower && $upper;
    
    my $line;
    my $line_count = 0;
    my @labels;   # column labels
    my $ncols;    # number of columns
    my $col_info_list;
    
    # If labels are expected, read the first line to get them
    if ($labels)
    {
        defined ($line = <>) or die;
        chomp ($line);
        @labels = split (/\t/, $line);
    }
    
    # Arrays to hold line numbers of lines with too many/too few fields.
    # The first line in the file is assumed to be representative.  The
    # number of fields it contains becomes the norm against which any following
    # lines are assessed.
    
    my @excess_fields;
    my @too_few_fields;
    
    while (<>)
    {
        chomp ($line = $_);
        ++$line_count;
        if (!defined ($ncols))  # don't know this until first data line read
        {
            # determine number of columns (assume no more than 10,000)
            my @val = split (/\t/, $line, 10000);
            $ncols = @val;
            if (@labels)  # label count must match data column count
            {
                die "Label count doesn't match data column count\n"
                    if $ncols != @labels;
            }
            else      # if there were no labels, create them
            {
                @labels = map { "c" . $_ } 1 .. $ncols;
            }
            $col_info_list = init_col_info (@labels);
        }
    
        my @val = split (/\t/, $line, 10000);
        push (@excess_fields, $line_count) if @val > $ncols;
        push (@too_few_fields, $line_count) if @val < $ncols;
        for my $i (0 .. $ncols - 1)
        {
            my $val = ($i < @val ? $val[$i] : "");  # use "" if field is missing
            my $info = $col_info_list->[$i];
    
            $info->{max_length} = length ($val)
                if $info->{max_length} < length ($val);
    
            if ($val eq "")
            {
                # column does have empty values
                $info->{hasempty} = 1;
                next; # no other tests apply
            }
            $info->{hasnonempty} = 1;
    
            # perform numeric tests if no nonnumeric values have yet been seen
    
            if ($info->{numeric})
            {
                # numeric test (doesn't recognize scientific notation)
                if ($val =~ /^[-+]?(\d+(\.\d*)?|\.\d+)$/)
                {
        # not int if contains decimal point
        $info->{integer} = 0 if $val =~ /\./;
        # not unsigned if begins with minus sign
        $info->{nonnegative} = 0 if $val =~ /^-/;
    
        # track min/max value
        $info->{min_val} = $val
            if !defined ($info->{min_val}) || $info->{min_val} > $val;
        $info->{max_val} = $val
            if !defined ($info->{max_val}) || $info->{max_val} < $val;
                }
                else
                {
        # column contains nonnumeric information
        $info->{numeric} = 0;
        $info->{integer} = 0;
                }
            }
    
            # perform temporal tests if no nontemporal values have yet been seen
    
            if ($info->{temporal})
            {
                # date/datetime test
                # allow date, date hour:min, date hour:min:sec
                if (($info->{date} || $info->{datetime})
        && $val =~ /^(\d+)[-\/](\d+)[-\/](\d+)\s*(\d+:\d+(:\d+)?)?$/)
                {
        # it's not a time
        $info->{time} = 0;
    
        # not a date if time part was present; not a
        # datetime if no time part was present
        $info->{ defined ($4) ? "date" : "datetime" } = 0;
    
        # use the first three parts to track range of date parts
        my @val = ($1, $2, $3);
        my $ref = $info->{date_range};
        foreach my $i (0 .. 2)
        {
            # if this is the first value we've checked, create the
            # structure to hold the min and max; otherwise compare
            # the stored min/max to the current value
            if (!defined ($ref->[$i]))
            {
                $ref->[$i]->{min} = $val[$i];
                $ref->[$i]->{max} = $val[$i];
                next;
            }
            $ref->[$i]->{min} = $val[$i]
                if $ref->[$i]->{min} > $val[$i];
            $ref->[$i]->{max} = $val[$i]
                if $ref->[$i]->{max} < $val[$i];
        }
                }
                # time test
                # allow hour:min, hour:min:sec
                elsif ($info->{time} && $val =~ /^\d+:\d+(:\d+)?$/)
                {
        # it's not a date or datetime
        $info->{date} = 0;
        $info->{datetime} = 0;
                }
                else
                {
        # column contains nontemporal information
        $info->{temporal} = 0;
                }
            }
        }
    }
    
    die "Input contained no data lines\n" if $line_count == 0;
    die "Input lines all were empty\n" if $ncols == 0;
    
    # Look at columns that look like DATE or DATETIME columns and attempt
    # to determine whether they appear to be in ISO, U.S., or British format.
    # (Skip columns that are always empty, because these assessments cannot
    # be made for such columns.)
    
    for my $i (0 .. $ncols - 1)
    {
        my $info = $col_info_list->[$i];
        next unless $info->{hasnonempty};
        next unless $info->{temporal} && ($info->{date} || $info->{datetime});
        my $ref = $info->{date_range};
        # assume that the column is valid as each of the types until ruled out
        my $valid_as_iso = 1; # [CC]YY-MM-DD
        my $valid_as_us = 1;  # MM-DD-[CC]YY
        my $valid_as_br = 1;  # DD-MM-[CC]YY
        # first segment is U.S. month, British day
        my $min = $ref->[0]->{min};
        my $max = $ref->[0]->{max};
        $valid_as_us = 0 if $min < 0 || $max > 12;
        $valid_as_br = 0 if $min < 0 || $max > 31;
        # second segment is U.S. day, British month, ISO month
        $min = $ref->[1]->{min};
        $max = $ref->[1]->{max};
        $valid_as_us = 0 if $min < 0 || $max > 31;
        $valid_as_br = 0 if $min < 0 || $max > 12;
        $valid_as_iso = 0 if $min < 0 || $max > 12;
        # third segment is ISO day
        $min = $ref->[2]->{min};
        $max = $ref->[2]->{max};
        $valid_as_iso = 0 if $min < 0 || $max > 31;
        if (!$valid_as_iso && !$valid_as_us && !$valid_as_br)
        {
            $info->{temporal} = 0;  # huh! guess it's not a date after all
        }
        else    # save date type results for later
        {
            $info->{date_type}->{iso} = $valid_as_iso;
            $info->{date_type}->{us} = $valid_as_us;
            $info->{date_type}->{br} = $valid_as_br;
        }
    }
    
    
    warn "# Number of lines = $line_count, columns = $ncols\n";
    warn "# Number of lines with too few fields: " . scalar (@too_few_fields) . "\n"
                        if @too_few_fields;
    warn "# Number of lines with excess fields: " . scalar (@excess_fields) . "\n"
                        if @excess_fields;
    
    if ($report)
    {
        print_report ($col_info_list);
    }
    else
    {
        for my $i (0 .. $ncols - 1)
        {
            my $info = $col_info_list->[$i];
            $info->{label} = lc ($info->{label}) if $lower;
            $info->{label} = uc ($info->{label}) if $upper;
        }
        print_create_table ($tbl_name, $col_info_list, $quote_names);
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Can anybody help(or point to some examples) about how to encrypt files with python?
Can anybody point me in the right direction to be able to encrypt a
can anybody point me in the right direction as to how I would go
Can anybody point me to a good beginner's guide for making Facebook apps?
I just wondered if anybody can point me in the right direction: I'm looking
can anybody tell me what's the point if any for a javascript function like
can anybody recommend some really good resources for how to get Apache authenticating users
Can anybody recommend a reliable and decently documented code highlighter for WordPress 2.6.1? I
Can anybody recommend a good code profiler for C++? I came across Shiny -
I'm connecting to the Google Maps API from PHP to geocode some starting points

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.