I have a dataset that looks like: ColA ColB ColC ColD ColE rs778 C

Question

0

Asked: June 7, 20262026-06-07T02:25:17+00:00 2026-06-07T02:25:17+00:00

I have a dataset that looks like: ColA ColB ColC ColD ColE rs778 C

0

I have a dataset that looks like:

    ColA  ColB   ColC  ColD  ColE
    rs778   C   Can     +   C/T
    rs778   C   Pro     +   C/T
    rs779   P   Can     +   A/G
    rs779   P   Can     -   A/G

I want to remove duplicate entries in Column A based on column C. Said another way, if two entries in Column A are the same, I want the row that stays to be determined by the entry in Column C. If the entries in Column C are the same, then the row that stays should be determined by Column D. If “Can” > “Pro” and “+” > “-“, then the final output I’m looking for would look like this:

    ColA   ColB ColC   ColD ColE
    rs778   C   Can     +   C/T
    rs779   P   Can     +   A/G

I removed completely duplicated data using:

data2 <- data[!duplicated(data[-2]),]

And am hoping my solution lies in some modification of this I have yet to discover. Thanks for your help!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T02:25:19+00:00

Here is a solution that accomplishes what you need, but may not be the most graceful method possible.

data = read.table(header=TRUE, stringsAsFactors=FALSE,
                  text="ColA  ColB   ColC  ColD  ColE
                        rs778   C   Can     +   C/T
                        rs778   C   Pro     +   C/T
                        rs779   P   Can     +   A/G
                        rs779   P   Can     -   A/G")

# Convert ColC and ColD to factors, controlling sort order with levels arg.
# "Can" will sort before "Pro", and "+" will sort before "-".
data$ColC = factor(data$ColC, levels=c("Can", "Pro"))
data$ColD = factor(data$ColD, levels=c("+", "-"))

# Sort rows.
data = data[order(data$ColA, data$ColC, data$ColD), ]

# Works because prefered ColA duplicate sorts highest.
data2 = data[!duplicated(data$ColA), ]

data2
#    ColA ColB ColC ColD ColE
# 1 rs778    C  Can    +  C/T
# 3 rs779    P  Can    +  A/G

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a dataset that looks like: ColA ColB ColC ColD ColE rs778 C

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply