I’m a junior programmer, and I’m trying to solve a task. Using c# .net 4.0 I’m running through folders,to choose all *.xml files, and to write each file to new folder with new extension *.bin. For each file before writing I’m applying algorithm, which is written by another programmer and I don’t know it’s realisation.
So I read *.xml file, deserialise it and write it to new *.bin file. When I’ve not used parallel programming, I’ve had 1 minute for 2000 files. And now I’ve decided to apply parallel programming with Task. Now I create new Task for each file (all proccessing(read-deserialize-write) is in one Task), and now I have got 40 seconds. But I think that parallel programming helped me to reduce the time to 25-30 seconds.
Please, give your comments what I do wrong and how I have to realise this. Thanks.
byte[] buffer;
using (Stream stream = new FileInfo(file).OpenRead())
{
buffer = new byte[stream.Length];
stream.Read(buffer, 0, (int)stream.Length);
}
foreach (var culture in supportedCultures)
{
CultureInfo currentCulture = culture;
Tasks.Add(Task.Factory.StartNew(() =>
{
var memoryStream = new MemoryStream(buffer);
Task<object> serializeTask = Task.Factory.StartNew(() =>
{
return typesManager.Load(memoryStream, currentCulture);
}, TaskCreationOptions.AttachedToParent);
string currentOutputDirectory = null;
if (outputDirectory != null)
{
currentOutputDirectory = outputDirectory.Replace(PlaceForCultureInFolderPath,
currentCulture
.ToString());
Directory.CreateDirectory(currentOutputDirectory);
}
string binFile = Path.ChangeExtension(Path.GetFileName(file), ".bin");
string binPath = Path.Combine(
currentOutputDirectory ?? Path.GetDirectoryName(file),
binFile);
using (FileStream outputStream = File.OpenWrite(binPath))
{
try
{
new BinaryFormatter().Serialize(outputStream,serializeTask.Result);
}
catch (SerializationException e)
{
ReportCompilationError(e.Message, null);
}
}
}));
}
First. There is no guarantee that TPL gives any performance hit.
As Jon says writing to HDD can decrease performance unless OS caches these files for later sequential writes. Definitely cache size has its limits.
Second. Default scheduler is oriented to utilize CPU cores so there’s a possibility that only several tasks are processed parallel and others wait in a queue. You can change this default with explicitly setting
ParallelOptions.MaxDegreeOfParallelismor callingWidthDegreeOfParallelism()in queries. Still it is scheduler who decides how many tasks run in parallel.There’s a nice free book about multithreading in .net