“The MS-XLS file format contains streams, substreams, and records.” –Understanding the Excel MS-XLS Binary Format
Given an xls file stream:
FileStream stream = System.IO.File.Open(filePath, FileMode.Open, FileAccess.Read);
var xs = new List<int>();
for(int i = 0; i < stream.Length;i++)
{
xs.Add(stream.ReadByte());
}
How would I go about detecting substreams? Is the name deceiving in that the substreams are actually included in the stream in some sort of sequence?
XLS (and other MS Office formats from before Office 2007) is a structured storage file aka compound binary file (see https://en.wikipedia.org/wiki/COM_Structured_Storage). Structured storage is like a filesystem inside of a file, where files are referred to as “streams”, and directories are called “storages”. A structured storage file has a single root “storage”, which can contain streams and other storages, and it appears that’s where all the streams in an xls file reside (which is probably why their documentation skips over the concept of “storages” and refers to streams as “substreams”).
Normally, you would access structured storage using the IStorage interface (see http://msdn.microsoft.com/en-us/library/windows/desktop/aa380015%28v=vs.85%29.aspx), but that may not be the most convenient method in .NET.
For accessing the data in structured storage in .NET, I’d suggest using OpenMCDF – http://sourceforge.net/projects/openmcdf/ – but I haven’t tried it myself so I can’t make any promises regarding its quality.
“Records” are not part of the structured storage file format, and I think you will need to parse them out of streams yourself.
Depending on what you’re trying to do, it may be appropriate to use a higher-level interface instead of worrying about the details of the XLS format.