I have some fairly straightforward F# async code to download a hundred random articles off of Wikipedia (for research).
For some reason, the code hangs at arbitrary points in time during the download. Sometimes it’s after 50, sometimes it’s after 80.
The async code itself is fairly straightforward:
let parseWikiAsync(url:string, count:int ref) =
async {
use wc = new WebClientWithTimeout(Timeout = 5000)
let! html = wc.AsyncDownloadString(Uri(url))
let ret =
try html |> parseDoc |> parseArticle
with | ex -> printfn "%A" ex; None
lock count (fun () ->
if !count % 10 = 0 then
printfn "%d" !count
count := !count + 1
)
return ret
}
Because I couldn’t figure out through fsi what the problem was, I made WebClientWithTimeout, a System.Net.WebClient wrapper that allows me to specify a timeout:
type WebClientWithTimeout() =
inherit WebClient()
member val Timeout = 60000 with get, set
override x.GetWebRequest uri =
let r = base.GetWebRequest(uri)
r.Timeout <- x.Timeout
r
And then I use the async combinators to retrieve just over a hundred pages, and weed out all the articles that return parseWikiAsync calls that return None (most of which are “disambiguation pages”) until I have exactly 100 articles:
let en100 =
let count = ref 0
seq { for _ in 1..110 -> parseWikiAsync("http://en.wikipedia.org/wiki/Special:Random", count) }
|> Async.Parallel
|> Async.RunSynchronously
|> Seq.choose id
|> Seq.take 100
When I compile the code and run it in the debugger, there are only three threads, of which only one is running actual code — the Async pipeline. The other two have “not available” for location, and nothing in the call stack.
Which I think means that it’s not stuck in AsyncDownloadString or in anywhere in parseWikiAsync. What else could be causing this?
Oh, also, initially it takes about a full minute before the async code actually starts. After that it goes at a fairly reasonable pace until it hangs again indefinitely.
Here’s the call stack for the main thread:
> mscorlib.dll!System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle waitableSafeHandle, long millisecondsTimeout, bool hasThreadAffinity, bool exitContext) + 0x22 bytes
mscorlib.dll!System.Threading.WaitHandle.WaitOne(int millisecondsTimeout, bool exitContext) + 0x28 bytes
FSharp.Core.dll!Microsoft.FSharp.Control.AsyncImpl.ResultCell<Microsoft.FSharp.Control.AsyncBuilderImpl.Result<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]>>.TryWaitForResultSynchronously(Microsoft.FSharp.Core.FSharpOption<int> timeout) + 0x36 bytes
FSharp.Core.dll!Microsoft.FSharp.Control.CancellationTokenOps.RunSynchronously<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]>(System.Threading.CancellationToken token, Microsoft.FSharp.Control.FSharpAsync<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]> computation, Microsoft.FSharp.Core.FSharpOption<int> timeout) + 0x1ba bytes
FSharp.Core.dll!Microsoft.FSharp.Control.FSharpAsync.RunSynchronously<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]>(Microsoft.FSharp.Control.FSharpAsync<Microsoft.FSharp.Core.FSharpOption<Program.ArticleData>[]> computation, Microsoft.FSharp.Core.FSharpOption<int> timeout, Microsoft.FSharp.Core.FSharpOption<System.Threading.CancellationToken> cancellationToken) + 0xb9 bytes
WikiSurvey.exe!<StartupCode$WikiSurvey>.$Program.main@() Line 97 + 0x55 bytes F#
Wikipedia is not to blame here, it’s a result of how
Async.Parallelworks internally. The type signature forAsync.Parallelisseq<Async<'T>> -> Async<'T[]>. It returns a single Async value containing all of the results from the sequence — so it doesn’t return until all of the computations in theseq<Async<'T>>return.To illustrate, I modified your code so it tracks the number of outstanding requests, i.e., requests which have been sent to the server, but have not yet received / parsed the response.
If you compile and run that code, you’ll see output like this:
As you can see, all of the requests are made before any of them are parsed — so if you’re on a slower connection, or you’re trying to retrieve a large number of documents, the server could be dropping the connection because it may assume you’re not retrieving the response it’s trying to send. Another issue with the code is that you need to explicitly specify the number of elements to generate in the
seq, which makes the code less reusable.A better solution would be to retrieve and parse the pages as they’re needed by some consuming code. (And if you think about it, that’s exactly what an F#
seqis good for.) We’ll start by creating a function that takes a Uri and produces aseq<Async<'T>>— i.e., it produces an infinite sequence ofAsync<'T>values, each of which will retrieve the content from the Uri, parse it, and return the result.Now we use this function to retrieve the pages as a stream:
If you run that code, you’ll see that it pulls the results one at a time from the server and doesn’t leave outstanding requests hanging around. Also, it’s very easy to change the number of results you want to retrieve — all you need to do is change the value you pass to
Seq.take.Now while that streaming code works just fine, it doesn’t execute the requests in parallel so it could be slow for large numbers of documents. This is an easy problem to fix, though the solution may be a little non-intuitive. Instead of trying to execute the entire sequence of requests in parallel — which is the problem in the original code — let’s create a function which uses
Async.Parallelto execute small batches of requests in parallel, then usesSeq.collectto combine the results back into a flat sequence.To utilize this function, we just need a few small tweaks to the code from the streaming version:
Again, it’s easy to change the number of documents you want to retrieve, and the batch size can easily be modified (again, I suggest you keep it reasonably small). If you wanted to, you could make a few tweaks to the ‘streaming’ and ‘batching’ code so you could switch between them at run-time.
One last thing — with my code the requests shouldn’t time-out, so you can probably get rid of the
WebClientWithTimeoutclass and just useWebClientdirectly.