I have the following code that extracts urls from a given page using jsoup.

Question

0

Asked: June 7, 20262026-06-07T04:40:57+00:00 2026-06-07T04:40:57+00:00

I have the following code that extracts urls from a given page using jsoup.

0

I have the following code that extracts urls from a given page using jsoup.

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * Example program to list links from a URL.
 */
public class ListLinks {
    public static void main(String[] args) throws IOException {

        String url = "http://shopping.yahoo.com";
        print("Fetching %s...", url);

        Document doc = Jsoup.connect(url).get();
        Elements links = doc.getElementsByTag("a");


        print("\nLinks: (%d)", links.size());
        for (Element link : links) {
       print(" * a: <%s>  (%s)", link.absUrl("href") /*link.attr("href")*/, trim(link.text(), 35));     
        }
    }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
}

What I’m trying to do, is to build a crawler that extracts only https site. I give the crawler a seed link to start with, then it should extracts all https site, then take each of the extracted link and do the same with them until reaching a certain number of collected urls.

My questions: The above code can extract all links in a given page. I need to extract links that starts with https:// only, what do I need to do in order to achieve this ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T04:40:58+00:00

Editorial Team

2026-06-07T04:40:58+00:00Added an answer on June 7, 2026 at 4:40 am

You can use selectors of jsoup. They are pretty powerful.

doc.select("a[href*=https]");//(This is the one you are looking for)selects if value of href contatins https
doc.select("a[href^=www]");//selects if value of href starts with www
doc.select("a[href$=.com]");//selects if value of href ends with .com.

etc.. Experiment with them, you will find out the correct one.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have the following code that extracts urls from a given page using jsoup.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply