Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8414985
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 10, 20262026-06-10T01:18:49+00:00 2026-06-10T01:18:49+00:00

I have to scrape 1000 links that are simmilar in structure but only differ

  • 0

I have to scrape 1000 links that are simmilar in structure but only differ in contents.

I designed this spider, but I don’t want to put each url in start_urls, run it and repeat 1000 times, I have them all in a file, so how can I repeat the process in a way I send the start_url as parameter and do that with a for 1000 times…

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-10T01:18:51+00:00Added an answer on June 10, 2026 at 1:18 am

    Create a spider which overrides the BaseSpider’s init method. Within it, parse the file and append them to the start_urls list.

    The code will look something like this:

    def __init__(self, *args, **kwargs):
         #load the file here
         super(DmozSpider, self).__init__()         
         for url in some_file:
             self.start_urls.append[url]
    

    Obviously, the way in which you loop through the file will depend on the type of file.

    Also, you might look into using the items pipeline and utilizing a mysqldb pipeline to save data after parsing it.

    EDIT

    I will rewrite your spider for you. Technically, it is best practice to use a pipeline for some of what you are doing, but, for the sake of time, I will make your current spider work. One moment.

    Try This

        # -*- coding: utf-8 -*-
    #Por: Daniel Ortiz Costa, Ivo Andres Astudillo, Ruben Quezada
    #Proyecto de Academias Web - Extraer publicaciones de Scopus
    
    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    import datetime
    import MySQLdb
    
    class DmozSpider(BaseSpider):
        name = "scrapyscopus"
    
        start_urls = ["http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=84858710280",]
    
        #id de la url actual
        id_paper_web = ""
    
        #Variables de la base de datos
        abstracto = ""
        keywords = ""
        anio_publicacion = ""
        tipo_documento = ""
        tipo_publicacion = ""    
        descripcion = ""
        volume_info = ""
        idioma = ""
        fecha_consulta = ""
        nombres = {}
        instituciones = {}
    
        #La probabilidad de que el articulo sea de alguien que buscamos
        probabilidad = 0
    
        def __init__(self, *args, **kwargs):
            super(DmozSpider, self).__init__()
    
            #load file here
            for url in some_file:
                self.start_urls.append[url]
    
    
        def parse(self, response):
    
            #Recibe el codigo de la pagina en la response
            hxs = HtmlXPathSelector(response)
    
            self.obtenerId(response.url)
    
            #La probabilidad de exito consta de 3 factores
            #1 - Probabilidad del 25% por pertenecer al pais
            #2 - Probabilidad del 25% por tener la misma inicial y apellido
            #3 - Probabilidad del 35% porque el articulo tenga a alguien de la universidad
            #4 - Probabilidad del 15% si es que todos los del articulo son de la universidad
    
            #Las dos primeras condiciones ya se cumplieron, por lo que se suma 50%
            #la otra se determinará leyendo las instituciones dentro del código 
            self.probabilidad = self.probabilidad+50;
    
            #ABSTRACTO
            #Se extrae el abstracto que es el parrafo que contiene un valor align=justify
            lista =  hxs.select('//p[contains(@align, "justify")]/text()')  
    
            #Se saca el texto
            self.abstracto = lista[0].extract()
    
            #KEYWORDS
            #Se encuentran en el ultimo resultado de la lista de parrafos con clase marginB3
            lista =  hxs.select('//p[@class="marginB3"]/text()')          
    
            #Se saca el texto del ultimo resultado
            self.keywords = lista[len(lista)-1].extract()
    
            #TIPO DE PUB, TIPO DE DOC E IDIOMA
            #Se encuentran todos con la clase paddingR15
            lista =  hxs.select('//span[@class="paddingR15"]')
    
            #Se analiza cada uno de los span recibidos en busca del correcto
            for i in lista:
    
                #Se analiza el strong que retiene la descripcion de lo que vemos
                #Para sacar el lenguaje por ejemplo, debemos buscar la linea "Original Language"
                #Luego de ello proceder a extraer el texto del span padre
    
                if (str(i.select('.//strong/text()').extract()[0]) == "Source Type: "):
                    self.tipo_publicacion=i.select('text()').extract()[0]; 
    
                if (str(i.select('.//strong/text()').extract()[0]) == "Original language: "):
                    self.idioma=i.select('text()').extract()[1];
    
                if (str(i.select('.//strong/text()').extract()[0]) == "Document Type: "):
                    self.tipo_documento=i.select('text()').extract()[0]; 
    
            #FECHA DE CONSULTA
            #Para la fecha de consulta se obtiene la fecha actual
            self.fecha_consulta = datetime.datetime.now().strftime("%Y-%m-%d")
    
    
            #DESCRIPCION
            #La descripcion se encuentra formada por la zona del encabezado
            #Se extrae primeramente el titulo, que es un h2 de clase sourceTitle
            lista =  hxs.select('//h2[@class="sourceTitle"]/text()') 
    
            #Luego se la agrega a la cadena de descripcion
            self.descripcion=self.descripcion+str(lista[0].extract())+"\n";
    
            #Se obtiene la informacion del volumen que tambien pertenece a la descripcion
            lista =  hxs.select('//div[@class="volumeInfo"]/text()')             
    
            #Se la extrae
            self.volume_info=str(lista[0].extract())
    
            #Se la agrega a la cadena de la descripcion
            self.descripcion=self.descripcion+self.volume_info
    
            #Se debe extraer el anio de publicacion desde la informacion de volumeen
            #Para ello se llama al metodo respectivo que se encarga de la extraccion
            self.obtenerAnioPublicacion()
    
    
            #AUTORES
            #Se determina el parrafo donde se encuentran los nombres de los autores
            lista =  hxs.select('//p[@class="smallLink authorlink svDoNotLink paddingB5"]')
    
            #Se seleccionan  los span directos de ese parrafo
            lista = lista.select('span')
    
            for elemento in lista:
    
                lista2 = elemento.select('.//sup')
    
                for i in lista2:
                    self.nombres[elemento.select('.//span[@class="previewTxt"]/text()').extract()[0]]=i.select('text()').extract()[0]
                    break;
    
    
            #DIRECCIONES
            #Se determina el parrafo donde se encuentran los nombres de los autores
            lista =  hxs.select('//p[@class="affilTxt"]')
    
            #Se determina una nueva lista con los sup y su texto
            lista2 = lista.select('.//sup/text()')
    
            #Se la lista siguiente mostrará los datos procesados
            letras=[]
    
            #Obtendrá la letra de cada publicación
            for i in lista2:
                letra = str(i.extract()[0])
                letras.append(letra)
    
            #Se determina el parrafo donde se encuentran los nombres de los autores
            lista3 = lista.select('text()')
    
            institucion=[]
    
            contador=0;
    
            for i in lista3:
    
                if(i.extract()!="\n"):
                    if "Loja" in i.extract():
                        contador=contador+1
    
                    institucion.append(i.extract())
    
            if contador>=1:
                if contador==1:
                    self.probabilidad=self.probabilidad+35
                else:
                    if contador==len(institucion):
                        self.probabilidad=self.probabilidad+15
    
            self.instituciones=dict(zip(letras, institucion))
    
            self.guardarDatos()
    
        """
        Metodo responsable de obtener el 
        anio de publicacion del articulo.
        """
        def obtenerAnioPublicacion(self):
    
            #Divide el volumen de acuerdo a la , que posee
            componentes=self.volume_info.split(', ')      
    
            #Dependiendo del tipo de publicacion, la posicion del anio variara
            if(self.tipo_publicacion == "Journal"):            
                self.anio_publicacion=componentes[2]
    
            else:
                self.anio_publicacion=componentes[0]
    
    
    
        """
        Metodo de obtener el id de la url actual
        """
        def obtenerId(self, url):   
    
            db = MySQLdb.connect("localhost","root","","proyectoacademias" )
    
            cursor = db.cursor()
    
            sql = "SELECT id FROM test WHERE url like \'"
            sql = sql + url
            sql = sql + "\'"
    
            cursor.execute(sql)
    
            data = cursor.fetchone()
    
            for row in data:
                print str(row)
                self.id_paper_web=str(row)
    
            db.close()
    
    
    
        """
        Metodo de guardar los datos
    
        """
        def guardarDatos(self):
            db = MySQLdb.connect("localhost","root","","proyectoacademias" )
    
            cursor = db.cursor()
    
            sql = "UPDATE test SET abstract=\'"+str(self.abstracto)+"\', fecha_consulta=\'"+str(self.fecha_consulta)+"\', anio_publicacion=\'"+str(self.anio_publicacion)+"\', probabilidad="+str(self.probabilidad)+" WHERE id = "+str(self.id_paper_web)
    
            print "\n\n\n"+sql+"\n\n\n"
            cursor.execute(sql)      
            db.commit()
    
            for i in range (len(self.nombres)):
                sql = "INSERT INTO test_autores VALUES (\'"+self.nombres.keys()[i]+"\', "+str(self.id_paper_web)+", \'"+self.instituciones[self.nombres[self.nombres.keys()[i]]]+"\', "+str((i+1))+")"
                print "\n\n\n"+sql+"\n\n\n"
                cursor.execute(sql)
                db.commit()
    
            db.close()
    

    I didn’t change anything other than modifying the init and obtenerId methods.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a cron job scrape.sh that look like this: #!/bin/bash touch rage cd
I'm trying to scrape only article text from web pages. I have discovered that
I have the following folder structure /project /scrape item.py /spiders myscraper.py inside the file
I have a nodejs script that uses phantomjs-node to scrape a webpage. It works
I currently have a script that scrapes proxies off websites, but I'm just wondering
I have written a scrapy spider to scrape out some html tags. Now the
I need to scrape some data from webpages. But I have some encoding problems
I have a C# app that needs to scrape many many pages within a
I have a simple function to scrape a web page that is suddenly getting
I have a web site with flash forms that I need to scrape .

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.