I have this chunk of code:
noObjs = 0
Dim oName As String
Dim i As Integer
Dim tripleIndex As Integer = 0
Do While sr.Peek() <> -1
readCSV = sr.ReadLine.Split(sepChar(0))
If readCSV.Length >= 3 Then
oName = readCSV(0)
For i = noObjs - 1 To 0 Step -1
If oName = objNames(i) Then
obIndOfTriple(tripleIndex) = i
Exit For
End If
Next i
If i = -1 Then
objNames(noObjs) = oName
obIndOfTriple(tripleIndex) = noObjs
noObjs += 1
End If
End If
tripleIndex += 1
Loop
sr.Close()
And I’m trying to parallelise as such:
noObjs = 0
Dim oName As String
Dim i As Integer
Dim tripleIndex As Integer = 0
Dim allData() As String = File.ReadAllLines(in_file)
Parallel.For(0, allData.Count, Sub(k)
readCSV = allData(k).Split(sepChar(0))
If readCSV.Length >= 3 Then
oName = readCSV(0)
For i = noObjs - 1 To 0 Step -1
If oName = objNames(i) Then
obIndOfTriple(tripleIndex) = i
Exit For
End If
Next i
If i = -1 Then
objNames(noObjs) = oName
obIndOfTriple(tripleIndex) = noObjs
noObjs += 1
End If
End If
tripleIndex += 1
End Sub)
However, I get an “index was outside the bounds of the array” at:
If oName = objNames(i) Then
I should also mention here that objNames() and obIndOfTriple() are declared globally (with a fixed size).
From some searching around, I understand that this has to do with thread safety, although I’m still a newbie in parallelism.
Could anyone point me in the right direction?
Thanks.
The crux of the problem is that you have multiple threads accessing shared resources without synchronizing access to those resources and thereby introducing a race condition.
For instance, consider
noObjsin relation toobjNames. I suspect you wantnoObjsto always reflect the number of actual items inobjNames. Now suppose you have two threads that reachobjNames(noObjs) = oNameat the same time, and thatnoObjsis 4 at the time. One thread will write a value intoobjNames(4)and then the other thread will immediately overwrite it. The first thread hasn’t gotten to the line to incrementnoObjsyet! Additionally, when both threads do executenoObjs += 1,noObjswill be 6, but you will have nothing stored innoObjs(5). That’s not the exact case of the exception you’re seeing, but it’s another symptom of the fragility of the implementation.In the code that each thread executes, you want to make sure that each thread has its own variable space to work with. You could do that by having
objNamesandobjIndOfTriplebe two dimensional arrays. The first dimension would be the loop iteration,k, and the second would be the index into the array just for that iteration. Likewise,noObjswould be an array, andnoObjs(k)would be the number of elements in theobjNamesarray associated with loop indexk.Technically that should work, but then you will need to coalesce
objNamesfrom a bunch of small arrays into a single large one following the execution ofParallel.For– essentially completing the implementation of a map-reduce pattern.If you do get all that implemented, you might want to take a look at the performance. You’re parallelizing the processing for a single line of input and from the code it doesn’t appear you’re doing much work for each line. In other words, parallelizing it as you have it, line by line, may actually add more overhead than if you’d just done it sequentially. If you have 1000 lines, you’re essentially asking for 1000 tiny tasks to run at the same time, so managing the tasks becomes more work than actually executing them. Now, the TPL may decide whether or not to really do something in parallel based on what it thinks is best so that could mitigate the performance hit.