Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8355593
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T09:54:38+00:00 2026-06-09T09:54:38+00:00

I have a program which pulls data out of a MySQL database, decodes a

  • 0

I have a program which pulls data out of a MySQL database, decodes a pair of
binary columns, and then sums together a subset of of the rows within the pair
of binary columns. Running the program on a sample data set takes 12-14 seconds,
with 9-10 of those taken up by unlist. I’m wondering if there is any way to
speed things up.

Structure of the table

The rows I’m getting from the database look like:

| array_length | mz_array        | intensity_array |
|--------------+-----------------+-----------------|
|           98 | 00c077e66340... | 002091c37240... |
|           74 | c04a7c7340...   | db87734000...   |

where array_length is the number of little-endian doubles in the two arrays
(they are guaranteed to be the same length). So the first row has 98 doubles in
each of mz_array and intensity_array. array_length has a mean of 825 and a
median of 620 with 13,000 rows.

Decoding the binary arrays

Each row gets decoded by being passed to the following function. Once the binary
arrays have been decoded, array_length is no longer needed.

DecodeSpectrum <- function(array_length, mz_array, intensity_array) {
  sapply(list(mz_array=mz_array, intensity_array=intensity_array),
         readBin,
         what="double",
         endian="little",
         n=array_length)
}

Summing the arrays

The next step is to sum the values in intensity_array, but only if their
corresponding entry in mz_array is within a certain window. The arrays are
ordered by mz_array, ascending. I am using the following function to sum up
the intensity_array values:

SumInWindow <- function(spectrum, lower, upper) {
  sum(spectrum[spectrum[,1] > lower & spectrum[,1] < upper, 2])
}

Where spectrum is the output from DecodeSpectrum, a matrix.

Operating over list of rows

Each row is handled by:

ProcessSegment <- function(spectra, window_bounds) {
  lower <- window_bounds[1]
  upper <- window_bounds[2]
  ## Decode a single spectrum and sum the intensities within the window.
  SumDecode <- function (...) {
    SumInWindow(DecodeSpectrum(...), lower, upper)
  }

  do.call("mapply", c(SumDecode, spectra))
}

And finally, the rows are fetched and handed off to ProcessSegment with this
function:

ProcessAllSegments <- function(conn, window_bounds) {
  nextSeg <- function() odbcFetchRows(conn, max=batchSize, buffsize=batchSize)

  while ((res <- nextSeg())$stat == 1 && res$data[[1]] > 0) {
    print(ProcessSegment(res$data, window_bounds))
  }
}

I’m doing the fetches in segments so that R doesn’t have to load the entire data
set into memory at once (it was causing out of memory errors). I’m using the
RODBC driver because the RMySQL driver isn’t able to return unsullied binary
values (as far as I could tell).

Performance

For a sample data set of about 140MiB, the whole process takes around 14 seconds
to complete, which is not that bad for 13,000 rows. Still, I think there’s room
for improvement, especially when looking at the Rprof output:

$by.self
                 self.time self.pct total.time total.pct
"unlist"             10.26    69.99      10.30     70.26
"SumInWindow"         1.06     7.23      13.92     94.95
"mapply"              0.48     3.27      14.44     98.50
"as.vector"           0.44     3.00      10.60     72.31
"array"               0.40     2.73       0.40      2.73
"FUN"                 0.40     2.73       0.40      2.73
"list"                0.30     2.05       0.30      2.05
"<"                   0.22     1.50       0.22      1.50
"unique"              0.18     1.23       0.36      2.46
">"                   0.18     1.23       0.18      1.23
".Call"               0.16     1.09       0.16      1.09
"lapply"              0.14     0.95       0.86      5.87
"simplify2array"      0.10     0.68      11.48     78.31
"&"                   0.10     0.68       0.10      0.68
"sapply"              0.06     0.41      12.36     84.31
"c"                   0.06     0.41       0.06      0.41
"is.factor"           0.04     0.27       0.04      0.27
"match.fun"           0.04     0.27       0.04      0.27
"<Anonymous>"         0.02     0.14      13.94     95.09
"unique.default"      0.02     0.14       0.06      0.41

$by.total
                     total.time total.pct self.time self.pct
"ProcessAllSegments"      14.66    100.00      0.00     0.00
"do.call"                 14.50     98.91      0.00     0.00
"ProcessSegment"          14.50     98.91      0.00     0.00
"mapply"                  14.44     98.50      0.48     3.27
"<Anonymous>"             13.94     95.09      0.02     0.14
"SumInWindow"             13.92     94.95      1.06     7.23
"sapply"                  12.36     84.31      0.06     0.41
"DecodeSpectrum"          12.36     84.31      0.00     0.00
"simplify2array"          11.48     78.31      0.10     0.68
"as.vector"               10.60     72.31      0.44     3.00
"unlist"                  10.30     70.26     10.26    69.99
"lapply"                   0.86      5.87      0.14     0.95
"array"                    0.40      2.73      0.40     2.73
"FUN"                      0.40      2.73      0.40     2.73
"unique"                   0.36      2.46      0.18     1.23
"list"                     0.30      2.05      0.30     2.05
"<"                        0.22      1.50      0.22     1.50
">"                        0.18      1.23      0.18     1.23
".Call"                    0.16      1.09      0.16     1.09
"nextSeg"                  0.16      1.09      0.00     0.00
"odbcFetchRows"            0.16      1.09      0.00     0.00
"&"                        0.10      0.68      0.10     0.68
"c"                        0.06      0.41      0.06     0.41
"unique.default"           0.06      0.41      0.02     0.14
"is.factor"                0.04      0.27      0.04     0.27
"match.fun"                0.04      0.27      0.04     0.27

$sample.interval
[1] 0.02

$sampling.time
[1] 14.66

I’m surprised to see unlist taking up so much time; this says to me that there
might be some redundant copying or rearranging going on. I’m new at R, so it’s
entirely possible that this is normal, but I’d like to know if there’s anything
glaringly wrong.

Update: sample data posted

I’ve posted the full version of the program
here and the sample data I use
here. The sample data is the
gziped output from mysqldump. You need to set the proper environment
variables for the script to connect to the database:

  • MZDB_HOST
  • MZDB_DB
  • MZDB_USER
  • MZDB_PW

To run the script, you must specify the run_id and the window boundaries. I
run the program like this:

Rscript ChromatoGen.R -i 1 -m 600 -M 1200

These window bounds are pretty arbitrary, but select roughly a half to a third
of the range. If you want to print the results, put a print() around the call
to ProcessSegment within ProcessAllSegments. Using those parameters, the
first 5 should be:

[1] 7139.682 4522.314 3435.512 5255.024 5947.999

You probably want want to limit the number of results, unless you want 13,000
numbers filling your screen 🙂 The simplest way is just add LIMIT 5 at the end
of query.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T09:54:41+00:00Added an answer on June 9, 2026 at 9:54 am

    I’ve figured it out!

    The problem was in the sapply() call. sapply does a fair amount of
    renaming and property setting which slows things down massively for arrays of
    this size. Replacing DecodeSpectrum with the following code brought the sample
    time from 14.66 seconds down to 3.36 seconds, a 4-fold increase!

    Here’s the new body of DecodeSpectrum:

    DecodeSpectrum <- function(array_length, mz_array, intensity_array) {
      ## needed to tell `vapply` how long the result should be. No, there isn't an
      ## easier way to do this.
      resultLength <- rep(1.0, array_length)
    
      vapply(list(mz_array=mz_array, intensity_array=intensity_array),
             readBin,
             resultLength,
             what="double",
             endian="little",
             n=array_length,
             USE.NAMES=FALSE)
    }
    

    The Rprof output now looks like:

    $by.self
                   self.time self.pct total.time total.pct
    "<Anonymous>"           0.64    19.75       2.14     66.05
    "DecodeSpectrum"        0.46    14.20       1.12     34.57
    ".Call"                 0.42    12.96       0.42     12.96
    "FUN"                   0.38    11.73       0.38     11.73
    "&"                     0.16     4.94       0.16      4.94
    ">"                     0.14     4.32       0.14      4.32
    "c"                     0.14     4.32       0.14      4.32
    "list"                  0.14     4.32       0.14      4.32
    "vapply"                0.12     3.70       0.66     20.37
    "mapply"                0.10     3.09       2.54     78.40
    "simplify2array"        0.10     3.09       0.30      9.26
    "<"                     0.08     2.47       0.08      2.47
    "t"                     0.04     1.23       2.72     83.95
    "as.vector"             0.04     1.23       0.08      2.47
    "unlist"                0.04     1.23       0.08      2.47
    "lapply"                0.04     1.23       0.04      1.23
    "unique.default"        0.04     1.23       0.04      1.23
    "NextSegment"           0.02     0.62       0.50     15.43
    "odbcFetchRows"         0.02     0.62       0.46     14.20
    "unique"                0.02     0.62       0.10      3.09
    "array"                 0.02     0.62       0.04      1.23
    "attr"                  0.02     0.62       0.02      0.62
    "match.fun"             0.02     0.62       0.02      0.62
    "odbcValidChannel"      0.02     0.62       0.02      0.62
    "parent.frame"          0.02     0.62       0.02      0.62
    
    $by.total
                         total.time total.pct self.time self.pct
    "ProcessAllSegments"       3.24    100.00      0.00     0.00
    "t"                        2.72     83.95      0.04     1.23
    "do.call"                  2.68     82.72      0.00     0.00
    "mapply"                   2.54     78.40      0.10     3.09
    "<Anonymous>"              2.14     66.05      0.64    19.75
    "DecodeSpectrum"           1.12     34.57      0.46    14.20
    "vapply"                   0.66     20.37      0.12     3.70
    "NextSegment"              0.50     15.43      0.02     0.62
    "odbcFetchRows"            0.46     14.20      0.02     0.62
    ".Call"                    0.42     12.96      0.42    12.96
    "FUN"                      0.38     11.73      0.38    11.73
    "simplify2array"           0.30      9.26      0.10     3.09
    "&"                        0.16      4.94      0.16     4.94
    ">"                        0.14      4.32      0.14     4.32
    "c"                        0.14      4.32      0.14     4.32
    "list"                     0.14      4.32      0.14     4.32
    "unique"                   0.10      3.09      0.02     0.62
    "<"                        0.08      2.47      0.08     2.47
    "as.vector"                0.08      2.47      0.04     1.23
    "unlist"                   0.08      2.47      0.04     1.23
    "lapply"                   0.04      1.23      0.04     1.23
    "unique.default"           0.04      1.23      0.04     1.23
    "array"                    0.04      1.23      0.02     0.62
    "attr"                     0.02      0.62      0.02     0.62
    "match.fun"                0.02      0.62      0.02     0.62
    "odbcValidChannel"         0.02      0.62      0.02     0.62
    "parent.frame"             0.02      0.62      0.02     0.62
    
    $sample.interval
    [1] 0.02
    
    $sampling.time
    [1] 3.24
    

    It’s possible that some additional performance could be squeezed out of messing
    with the do.call('mapply', ...) call, but I’m satisfied enough with the
    performance as is that I’m not willing to waste time on that.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a program in which it pulls data from a workbook. One of
I have a program which creates JButtons which are then added to a JPanel
I have a program which outputs a pair of words surrounded by spaces on
Currently we have a table with a bunch of data from a database which
I have a table with a field which contains strings in my MySQL database.
What I'm doing: I'm writing a custom program in PHP which pulls data via
I have a console based application which pulls files from a database and outputs
I have a program which executes constantly and I need to save data every
Assume I have some program that pulls airline data. Each airline data uses a
I have a program which dynamically generates a GUI. I don't know how many

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.