Windows Server 2008 R2 Enterprise, SQL Server 2008 X64, SP3, Developer edition
I build and dynamically execute (via sp_executesql) a BULK INSERT command. The general form is:
BULK INSERT #HeaderRowCheck
from "\\Server\Share\Develop\PKelley\StressTesting\101\DataSet.csv"
with
(
lastrow = 1
,rowterminator = '\n'
,tablock
,maxerrors = 0
,errorfile = 'C:\SQL_Packages\TempFiles\#HeaderRowCheck_257626FB-A5CD-41B8-B862-FAF8C591C7A9.log'
)
(The errorfile name is based on a configured local folder, the table being loaded, and a guid generated freshly for every bulk insert run — it’s a subroutine wrapped in its own stored procedure.)
An outside process (was SQL Agent, is now a WCF service) launches DTEXEC which starts an SSIS package which calls stored procedures in a database that loop through sets, builds the query, and runs it for each. Up to four loads could be running at the same time from/into a given database, and multiple databases on the SQL instance could be running this at the same time – though historically, volume has been low, and we’ve generally only had one instance running this at a time. We do this a lot, and it has worked all but flawlessly for well over two years – security is properly configured, necessary files and folders exist, all the usual. (Luck? I like to think not.)
We are now anticipating some serious workloads, so we’re doing some stress testing, in which I launch 8 runs, each with four processes, where a set of four will divide and one by one process the files to be loaded (i.e. up to 32 simultaneous bulk inserts being performed. Like I said, stress testing.) Low and behold, when launched, one or more will fail during the course of execution, with an error message like:
Error #4861 encountered while loading header information from file "DataSet.csv": Cannot bulk load because the file "C:\SQL_Packages\TempFiles\#HeaderRowCheck_D0070742-76A5-4175-A1A7-16494103EF25.log" could not be opened. Operating system error code 80(The file exists.).
From run to run, the error does not occur for the same file, data set, or point-in-overall-processing.
On the surface, it sounds like two processes are trying to access the same error file, which would mean that they’re independantly generating the same guid(!). My understanding is that’s supposed to be all but impossible. An alternate theory is, so much is going on simultaneously (potentially up to 32 simultaneous BULK INSERT commands running), SQL and/or the OS is getting confused somehow (I’m a DBA, not a network admin). I could do a work-around, building out my try-catch block to check for error 4861 and retrying up to three times, but I’d rather avoid such kludgery.
I have since tossed in a routine that logs the name of the error file (with the guid) to a table before it is used. After many runs and several fails, I see that (a) the failed file + guid is being logged in my table, and (b) there are no duplicate guids being logged.
Anyone know what might be going on?
Philip
I opened a case with Microsoft Tech Support, and after no small amount of back-and-forth, Pradeep M.M. (SQL Server Support Technical Lead) worked it all out.
The general process: read in a list of files in a folder, and one by one perform a series of bulk inserts on those files (first to read the first line, which we parse for columns, and then to read data from the second+ lines). All bulk inserts utilize the “ErrorFile” option, so as to provide users with what information we can when their data is mis-formatted. Process has worked for 3+ years, but under recent stress testing conditions (up to 8 simultaneous runs performed by a single SQL Server instance, with all files properly formatted), we got the errors listed above.
We initially though there were errors with generating the GUID, because of that “already open” error, but that idea was eventually discarded — if newid() wasn’t functioning properly, a lot more people would be having much more serious issues.
As per Pradeep, here is a Step By Step Process of how Bulk Insert Works:
Plan for the same
ERRORFILE parameter then we will create the ErrorFile.log and
ErrorFile.Error.Txt to the folder location specified ( important
thing to understand here is the file will be of 0kb in size)
windows API Calls
and try to execute the Bulk Insert command as a part of it we will
re-create the ErrorFile.log and ErrorFile.Error.Txt to the folder
location specified ( As Per Books Online Documentation the Error
files should not be there in this location or else we will fail our
execution http://msdn.microsoft.com/en-us/library/ms188365.aspx
Bulk insert respective Errors are logged into the Error Files
created if there are no errors these 2 files will be deleted.
Running ProcMon (Process Monitor) during failed runs revealed that the ErrorFile was successfully created and opened in step 3, but were NOT closed in step 4, resulting in step 5 generating the error we were seeing. (For successful runs, the file was created and closed as expected.)
Further analysis of ProcMon showed that, another process running CMD.EXE was issuing “close handle” operations on the file, after the bulk insert attempt. We use a routine involving xp_cmdshell to retrieve the list of files to be processed, and that would be the cause of the CMD.EXE process. Here’s the kicker:
…there is some business logic which launches CMD.EXE inside SQL Server and since CMD.EXE is a child process it inherits all the handles opened by the parent process ( So probably this is some kind of timing issue where in CMD.EXE holds handles for files which are open when it got launched and all those files who’s handle is being inherited by CMD.EXE cannot be deleted and only can be released after CMD.EXE is destroyed)
And that was it. A single run never hits this problem, as its xp_cmdshell call is completed before the bulk inserts are issued. But with parallel runs, particularly with many parallel runs (I only hit the problem with 5 or more going), a timing issue occurred such that:
which internally uses XP_CMDSHELL and launches CMD.EXE to enumerate
the Files
and then starts the Bulk Insert Activity and it’s in Compilation
phase for the BULK INSERT Command
during compilation phase and then delete it after the compilation
phase is done
a stored procedure which internally uses XP_CMDSHELL and launches
CMD.EXE to enumerate all the files
Process SQLServr.exe so by default it inherits all the Handles
created by SQLServr.exe ( So this process gets all the handles for
the ERRORFILE that have been created by BULK INSERT in the First
Connection)
hence we are trying to delete the file during which we have to close
all the handles, We do see that CMD.EXE is holding an handle to the
file and it’s still open and hence we cannot delete the file. So
without deleting the File we move on to the Execution Phase and in
the Execution phase we are trying to create a new ERRORFILE with the
same name but since the file already exists we fail with the Error
“Operating system error code 80(The file exists.).”
My short-term workaround was to (1) implement a retry loop, generating a new ErrorFile name and attempting a new bulk insert up to three times before giving up, and (2) building another routine on our nightly processes to delete all files found in our “ErrorFile folder”.
The long-term fix is to revise our code to not list files via xp_cmdshell. This would seem to be feasible, since the whole ETL process is wrapped in and managed by an SSIS package; alternatively, CLR routines could be built and worked in. For now, given our anticipated work load the work-around is sufficient (particularly given everything else we’re working on just now), so it may be a bit before we implement the final fix.
Posted for posterity, in case it ever happens to you!