Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6972583
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T16:59:18+00:00 2026-05-27T16:59:18+00:00

I would like to use CPython in a hadoop streaming job that needs access

  • 0

I would like to use CPython in a hadoop streaming job that needs access to supplementary information from a line-oriented file kept in a hadoop file system. By “supplementary” I mean that this file is in addition to the information delivered via stdin. The supplementary file is large enough that I can’t just slurp it into memory and parse out the end-of-line characters. Is there a particularly elegant way (or library) to process this file one line at a time?

Thanks,

SetJmp

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T16:59:18+00:00Added an answer on May 27, 2026 at 4:59 pm

    Check out this documentation for Streaming for using the Hadoop Distributed Cache in Hadoop Streaming jobs. You first upload the file to hdfs, then you tell Hadoop to replicate it everywhere before running the job, then it conveniently places a symlink in the working directory of the job. You can then just use python’s open() to read the file with for line in f or whatever.

    The distributed cache is the most efficient way to push files around (out of the box) for a job to utilize as a resource. You do not just want to open the hdfs file from your process, as each task will attempt to stream the file over the network… With the distributed cache, one copy is downloaded even if several tasks are running on the same node.


    First, add -files hdfs://NN:9000/user/sup.txt#sup.txt to your command-line arguments when you run the job.

    Then:

    for line in open('sup.txt'):
        # do stuff
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I would like to use a language that I am familiar with - Java,
I would like to use a component that exposes the datasource property, but instead
I must use a commercial Java library, and would like to do it from
I would like to use the C implementation of a class method (generated from
I currently have some large strings that I would like use as test data
I would like use from Spring.NET Aspect library Logging aspect together with log4Net. I
I would like use unmanaged code from C in C#. I built a DLL
I would like use argparse to parse the arguments that it knows and then
I am using AdoNetAppender (SQL server) in my asp.net application and would like use
I would like to use something like CLR Profiles on .Net 2.0 to see

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.