Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9178125
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T17:28:46+00:00 2026-06-17T17:28:46+00:00

I’ve managed to write a simple indexer script for mongoDB using pymongo. But I’ve

  • 0

I’ve managed to write a simple indexer script for mongoDB using pymongo. But I’ve no idea why would indexing, adding documents and querying would take up 96GB of the RAM on my server.

Is it because my query isn’t optimized? How could i optimize my query instead of database.find_one({"eng":src})

How else could i optimize my indexer script?

So my inputs are as such (the actual data inputs have 2 million+ lines of varying length of sentence):

#srcfile

You will be aware from the press and television that there have been a number of bomb explosions and killings in Sri Lanka.
One of the people assassinated very recently in Sri Lanka was Mr Kumar Ponnambalam, who had visited the European Parliament just a few months ago.
Would it be appropriate for you, Madam President, to write a letter to the Sri Lankan President expressing Parliament's regret at his and the other violent deaths in Sri Lanka and urging her to do everything she possibly can to seek a peaceful reconciliation to a very difficult situation?
Yes, Mr Evans, I feel an initiative of the type you have just suggested would be entirely appropriate.
If the House agrees, I shall do as Mr Evans has suggested.

#trgfile

Wie Sie sicher aus der Presse und dem Fernsehen wissen, gab es in Sri Lanka mehrere Bombenexplosionen mit zahlreichen Toten.
Zu den Attentatsopfern, die es in jüngster Zeit in Sri Lanka zu beklagen gab, zählt auch Herr Kumar Ponnambalam, der dem Europäischen Parlament erst vor wenigen Monaten einen Besuch abgestattet hatte.
Wäre es angemessen, wenn Sie, Frau Präsidentin, der Präsidentin von Sri Lanka in einem Schreiben das Bedauern des Parlaments zum gewaltsamen Tod von Herrn Ponnambalam und anderen Bürgern von Sri Lanka übermitteln und sie auffordern würden, alles in ihrem Kräften stehende zu tun, um nach einer friedlichen Lösung dieser sehr schwierigen Situation zu suchen?
Ja, Herr Evans, ich denke, daß eine derartige Initiative durchaus angebracht ist.
Wenn das Haus damit einverstanden ist, werde ich dem Vorschlag von Herrn Evans folgen.

An example doc looks like this

{ 
    "_id" : ObjectId("50f5fe8916174763f6217994"), 
    "deu" : "Wie Sie sicher aus der Presse und dem Fernsehen wissen, gab es in Sri 
             Lanka mehrere Bombenexplosionen mit zahlreichen Toten.\n", 
    "uid" : 13, 
    "eng" : "You will be aware from the press and television that there have been a 
             number of bomb explosions and killings in Sri Lanka." 
}

My code:

# -*- coding: utf8 -*-
import codecs, glob, os
from pymongo import MongoClient
from itertools import izip
from bson.code import Code

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

# Gets first instance of matching key given a value and a dictionary.    
def getKey(dic, value):
  return [k for k,v in dic.items() if v == value]

def langiso (lang, isochar=3):
  languages = {"en":"eng",
               "da":"dan","de":"deu",
               "es":"spa",
               "fi":"fin","fr":"fre",
               "it":"ita",
               "nl":"nld",
               "zh":"mcn"}
  if len(lang) == 2 or isochar==3:
    return languages[lang]
  if len(lang) == 3 or isochar==2:
    return getKey(lang)

def txtPairs (bitextDir):
  txtpairs = {}
  for infile in glob.glob(os.path.join(bitextDir, '*')):
    #print infile
    k = infile[-8:-3]; lang = infile[-2:]
    try:
      txtpairs[k] = (txtpairs[k],infile) if lang == "en" else (infile,txtpairs[k]) 
    except:
      txtpairs[k] = infile
  for i in txtpairs:
    if len(txtpairs[i]) != 2:
      del txtpairs[i]
  return txtpairs

def indexEuroparl(sfile, tfile, database):   
  trglang = langiso(tfile[-2:]) #; srclang = langiso(sfile[-2:]) 

  maxdoc = database.find().sort("uid",-1).limit(1)
  uid = 1 if maxdoc.count() == 0 else maxdoc[0]

  counter = 0
  for src, trg in izip(codecs.open(sfile,"r","utf8"), \
                       codecs.open(tfile,"r","utf8")):
    quid = database.find_one({"eng":src})
    # If sentence already exist in db
    if quid != None:
      if database.find({trglang: {"$exists": True}}):
        print "Sentence uniqID",quid["uid"],"already exist."
        continue
      else:
        print "Reindexing uniqID",quid["uid"],"..."
        database.update({"uid":quid["uid"]}, {"$push":{trglang:trg}})
    else:
      print "Indexing uniqID",uid,"..."
      doc = {"uid":uid,"eng":src,trglang:trg}
      database.insert(doc)
      uid+=1
    if counter == 1000:
      for i in database.find():
        print i
      counter = 0
    counter+=1

connection = MongoClient()
db = connection["europarl"]
v7 = db["v7"]

srcfile = "eng-deu.en"; trgfile = "eng-deu.de"
indexEuroparl(srcfile,trgfile,v7)

# After indexing the english-german pair, i'll perform the same indexing on other language pairs
srcfile = "eng-spa.en"; trgfile = "eng-spa.es"
indexEuroparl(srcfile,trgfile,v7)
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T17:28:47+00:00Added an answer on June 17, 2026 at 5:28 pm

    After several rounds of code profiling, I’ve found where the RAM was leaking to.

    Firstly, if i want to query the "eng" field often, i should create an index for that field by doing this:

    v7.ensure_index([("eng",1),("unique",True)])
    

    That resolves the time taken for serial searches across the unindexed "eng" field.

    Second, the bleeding RAM problem is due to this costly function call:

    doc = {"uid":uid,"eng":src,trglang:trg}
    if counter == 1000:
      for i in database.find():
        print i
      counter = 0
    counter+=1
    

    What MongoDb does is that it stores the results into the RAM as @Sammaye had noticed. And each time i call the database.find(), it keeps in the RAM a whole set of docs i’ve added to the collection. That’s how i burn out 96GB of RAMs. The above snippet needs to be changed to:

    doc = {"uid":uid,"eng":src,trglang:trg}
    if counter == 1000:
      print doc
    counter+=1
    

    By eliminating the database.find() and creating the index for "eng" field, I’m only using up to 25GB and I’ve completed the index for 2 million sentences in less than 1 hour.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This could be a duplicate question, but I have no idea what search terms
I'm making a simple page using Google Maps API 3. My first. One marker
Seemingly simple, but I cannot find anything relevant on the web. What is the
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have just tried to save a simple *.rtf file with some websites and
I want to count how many characters a certain string has in PHP, but
I am using JSon response to parse title,date content and thumbnail images and place
I have a small JavaScript validation script that validates inputs based on Regex. I
I am using the SimpleRSS gem to parse a WordPress RSS feed. The only

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.