I’ve just started learning go, and have been working through the tour. The last exercise is to edit a web crawler to crawl in parallel and without repeats.
Here is the link to the exercise: http://tour.golang.org/#70
Here is the code. I only changed the crawl and the main function. So I’ll just post those to keep it neat.
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
var used = make(map[string]bool)
var urlchan = make(chan string)
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// Done: Don't fetch the same URL twice.
// This implementation doesn't do either:
done := make(chan bool)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
go func() {
for _, i := range urls {
urlchan <- i
}
done <- true
}()
for u := range urlchan {
if used[u] == false {
used[u] = true
go Crawl(u, depth-1, fetcher)
}
if <-done == true {
break
}
}
return
}
func main() {
used["http://golang.org/"] = true
Crawl("http://golang.org/", 4, fetcher)
}
The problem is that when I run the program the crawler stops after printing
not found: http://golang.org/cmd/
This only happens when I try to make the program run in parallel. If I have it run linearly then all the urls are found correctly.
Note: If I am not doing this right (parallelism I mean) then I apologise.
main()func, returns, all others go routine would be killed immediately.Crawl()seems like recursive, however it is not, which means it would return immediately, not awaiting for otherCrawl()routines. And you know that if the firstCrawl(), called bymain(), returns, themain()func regards its mission fulfilled.main()func wait until the lastCrawl()returns. Thesyncpackage, or achanwould help.You could probably take a look at the last solution of this, which I did months ago: