i have this huge csv file, it’s 4GB, don’t know how many rows but 320 columns.
since it can’t be open in any program (except using 3rd party programs to split the file into multiple pieces) i’m trying to fins a way to extract the data i need. i only need about 10-15 columns from it.
i saw many solutions on the net (most in vbs) but i couldn’t get any of them to work. i’d get errors and i don’t know vbs to be able to troubleshoot them.
can anyone help please?
thank you
PS here’s one example of the vbs code i found and tried using that i had no luck with.
the original error was “800a01f4 variable is undefined”, on the net it was suggested to take out OPTION EXPLICIT. once i do that the next error is “800a01fa class not defined”.
in both cases the line giving the error is “Set adoJetCommand = New ADODB.Command”
Option Explicit
Dim adoCSVConnection, adoCSVRecordSet, strPathToTextfile
Dim strCSVFile, adoJetConnection,adoJetCommand, strDBPath
Const adCmdText = &H0001
' Specify path to CSV file.
strPathToTextFile = "C:\Users\natalie.rynda\Documents\Temp\RemailMatch\"
' Specify CSV file name.
strCSVFile = "NPIOld.csv"
' Specify Access database file.
strDBPath = "C:\Users\natalie.rynda\Documents\Temp\RemailMatch\NPIs.mdb"
' Open connection to the CSV file.
Set adoCSVConnection = CreateObject("ADODB.Connection")
Set adoCSVRecordSet = CreateObject("ADODB.Recordset")
' Open CSV file with header line.
adoCSVConnection.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source=" & strPathtoTextFile & ";" & _
"Extended Properties=""text;HDR=YES;FMT=Delimited"""
adoCSVRecordset.Open "SELECT * FROM " & strCSVFile, adoCSVConnection
' Open connection to MS Access database.
Set adoJetConnection = CreateObject("ADODB.Connection")
adoJetConnection.ConnectionString = "DRIVER=Microsoft Access Driver (*.mdb);" _
& "FIL=MS Access;DriverId=25;DBQ=" & strDBPath & ";"
adoJetConnection.Open
' ADO command object to insert rows into Access database.
Set adoJetCommand = New ADODB.Command
Set adoJetCommand.ActiveConnection = adoJetConnection
adoJetCommand.CommandType = adCmdText
' Read the CSV file.
Do Until adoCSVRecordset.EOF
' Insert a row into the Access database.
adoJetCommand.CommandText = "INSERT INTO NPIs " _
& "(NPI, EntityTypeCode, ReplacementNPI, EIN, MAddress1, MAddress2, MCity, MState, MZIP, SAddress1, SAddress2, SCity, SState, SZIP, ProviderEnumerationDate, LastUpdateDate, NPIDeactivationReasonCode, NPIDeactivationDate, NPIReactivationDate) " _
& "VALUES (" _
& "'" & adoCSVRecordset.Fields("NPI").Value & "', " _
& "'" & adoCSVRecordset.Fields("Entity Type Code").Value & "', " _
& "'" & adoCSVRecordset.Fields("Replacement NPI").Value & "', " _
& "'" & adoCSVRecordset.Fields("Employer Identification Number (EIN)").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider First Line Business Mailing Address").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider Second Line Business Mailing Address").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider Business Mailing Address City Name").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider Business Mailing Address State Name").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider Business Mailing Address Postal Code").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider First Line Business Practice Location Address").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider Second Line Business Practice Location Address").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider Business Practice Location Address City Name").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider Business Practice Location Address State Name").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider Business Practice Location Address Postal Code").Value & "', " _
& "'" & adoCSVRecordset.Fields("Provider Enumeration Date").Value & "', " _
& "'" & adoCSVRecordset.Fields("Last Update Date").Value & "', " _
& "'" & adoCSVRecordset.Fields("NPI Deactivation Reason Code").Value & "', " _
& "'" & adoCSVRecordset.Fields("NPI Deactivation Date").Value & "', " _
& "'" & adoCSVRecordset.Fields("NPI Reactivation Date").Value & "')"
adoJetCommand.Execute
adoCSVRecordset.MoveNext
Loop
' Clean up.
adoCSVRecordset.Close
adoCSVConnection.Close
adoJetConnection.Close
If your CSV file is straightforward, without newlines or commas in unexpected places, then the standard *nix tool
awkwould be useful. It would allow you to easily extract the 15 columns you are looking for to a new CSV file. This blog post gives an explanation how to use it on CSV files.Suppose that you want to extract columns 1, 3 and 7 from
file.csv, then you could do this with the commandYour Windows machine probably does not have
awkinstalled. There are a few options:You can find it in
MSYS, which basically
provides you with a Unix-like shell environment in Windows. To me, this seems to be the easies way to go.
Another option seems to be Gawk for
Windows, but I
have no experience with that, so no guarantees.
You could try to achieve the same result using the Windows
PowerShell, as explained in this blog
post
— if you have that available. Again, I have no experience trying that.
Last but not least, you could switch to Linux, for example in a
virtual machine.
awkis usually available in *nix environments.If you are parsing a more awkward CSV file, then check out parse csv file using gawk for a bunch of suggestions.