Skip to content

Conversation

@ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Dec 27, 2025

Closes #635

  • basic functionality dt[i, .ROW := NULL]
  • support by clauses dt[i, .ROW := NULL, by]
  • documentation
  • NEWS
  • tests

Currently only supports deletion of i, but could also be integrated into dogroups.

Do we need/want a functional form like delete(x, irows). Depending on what we allow for irows this would need a rewrite/duplicate of the internals of [i evaluation.

@codecov
Copy link

codecov bot commented Dec 27, 2025

Codecov Report

❌ Patch coverage is 91.72932% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.91%. Comparing base (dc69b3d) to head (57a4bdc).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/deleterows.c 90.09% 10 Missing ⚠️
src/wrappers.c 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7536      +/-   ##
==========================================
- Coverage   98.97%   98.91%   -0.06%     
==========================================
  Files          87       88       +1     
  Lines       16733    16853     +120     
==========================================
+ Hits        16561    16670     +109     
- Misses        172      183      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Dec 27, 2025

  • HEAD=delete_by_ref slower P<0.001 for DT[by,verbose=TRUE] improved in #6296
    Comparison Plot

Generated via commit 57a4bdc

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 41 seconds
Installing different package versions 42 seconds
Running and plotting the test cases 3 minutes and 52 seconds

@jangorecki

This comment was marked as resolved.

@ben-schwen
Copy link
Member Author

What if columns are not marked as resizable? AFAIU delete rows will have to create a resizable copy in order to remove it. Right? In such case I think we may want to provide a function to mark resizable explicitly, ahead of time, so further processing is more predicable.

Yes. I have added allocrow as a pendant to alloccol. In the future we can also use this to overallocate the number of rows so we go implement the fast insert #660

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. setallocrow needs to return invisibly.
  2. I was expecting following example to show difference, it does not
x1 = data.table(a = 1:2)
x2 = x1

y1 = data.table(a = 1:2)
invisible(setallocrow(y1))
y2 = y1

delete_first_row = function(x) x[1L, .ROW := NULL]

delete_first_row(x1)
delete_first_row(y1)

x2
y2

could verbose=T tell when object is re-allocated (as resizable) rather than truly updated in place?

we need example that shows why setallocrow() might be desired.

  1. does it make sense to mark resizable already in as.data.table / fread / rbindlist?

@ben-schwen
Copy link
Member Author

I have added verbose messages to allocrow. I'm also not convinced how useful allocrow is right now. In the future yes, it will be essential when overallocating the number rows, respectively shortening overallocated rows.

library(data.table)
dt = data.table(a = 1:5)
truelength(dt$a)
#> [1] 0
setallocrow(dt)
truelength(dt$a)
#> [1] 5
dt[3:5, .ROW := NULL]
truelength(dt$a)
#> [1] 5
setallocrow(dt)
truelength(dt$a)
#> [1] 2

Created on 2025-12-29 with reprex v2.1.1

Thinking about the bigger picture of fast insert we will need to think about how much we want to overallocate when inserting respectively when we want to tidy up.


Note that for efficiency no check is performed for duplicate assignments, i.e. if multiple values are passed for assignment to the same index, assignment to this index will occur repeatedly and sequentially; for a given use case, consider whether it makes sense to create your own test for duplicates, e.g. in production code.

Note that \code{.ROW := NULL} is a special case used to delete rows by reference. Unlike column assignment, this requires an \code{i} expression to specify which rows to delete, and does not support \code{by} or \code{keyby}. See \code{\link{.ROW}} or \code{\link{special-symbols}} for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside for future consideration -- another use case would be DT[<optional i>, by=<grp>, having=<condition>, .ROW := NULL]

Copy link
Member

@MichaelChirico MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, be sure to point to ?setallocrow as the functional equivalent

Oh, I misunderstood

\description{
\code{.SD}, \code{.BY}, \code{.N}, \code{.I}, \code{.GRP}, and \code{.NGRP} are \emph{read-only} symbols for use in \code{j}. \code{.N} can be used in \code{i} as well. \code{.I} can be used in \code{by} as well. See the vignettes, Details and Examples here and in \code{\link{data.table}}.
\code{.EACHI} is a symbol passed to \code{by}; i.e. \code{by=.EACHI}, \code{.NATURAL} is a symbol passed to \code{on}; i.e. \code{on=.NATURAL}
\code{.ROW} is a symbol used with \code{:= NULL} to delete rows by reference; i.e. \code{DT[i, .ROW := NULL]} deletes the rows selected by \code{i}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just point to ?assign, the info here is a bit repetitive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a section to the Reference semantics vignette about this.

E.g., illustrating any issues that might obtain from shared references to data.tables, what happens here?

DT1 = data.table(i = 1:10)
DT2 = DT1

DT2[i > 5, .ROW := NULL]
nrow(DT1)

Comment on lines +1189 to +1198
if (is.null(jsub) || identical(jsub, quote(NULL))) {
if (is.null(irows))
stopf(".ROW := NULL requires i= condition to specify rows to delete")
if (!missingby)
stopf(".ROW := NULL with 'by' or 'keyby' is not supported yet")
.Call(CdeleteRows, x, irows)
return(suppPrint(x))
} else {
stopf(".ROW can only be used with := NULL to delete rows")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest unnesting:

Suggested change
if (is.null(jsub) || identical(jsub, quote(NULL))) {
if (is.null(irows))
stopf(".ROW := NULL requires i= condition to specify rows to delete")
if (!missingby)
stopf(".ROW := NULL with 'by' or 'keyby' is not supported yet")
.Call(CdeleteRows, x, irows)
return(suppPrint(x))
} else {
stopf(".ROW can only be used with := NULL to delete rows")
}
if (!is.null(jsub) && !identical(jsub, quote(NULL))) {
stopf(".ROW can only be used with := NULL to delete rows")
if (is.null(irows))
stopf(".ROW := NULL requires i= condition to specify rows to delete")
if (!missingby)
stopf(".ROW := NULL with 'by' or 'keyby' is not supported yet")
.Call(CdeleteRows, x, irows)
return(suppPrint(x))

# multirow
dt = data.table(a=1:5, b=letters[1:5])
test(2356.09, copy(dt)[c(1L, 3L), .ROW := NULL], dt[c(2,4,5)])
test(2356.10, copy(dt)[c(TRUE, FALSE, TRUE, FALSE, TRUE), .ROW := NULL], dt[c(2,4)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks the same as 2356.03?

Please note: over-allocation of the column pointer vector is not for efficiency \emph{per se}; it is so that \code{:=} can add columns by reference without a shallow copy.
\code{setallocrow} is a utility function that prepares columns for fast row operations (delete or insert (not implemented yet)) by reference and manages row capacity.
Copy link
Member

@MichaelChirico MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention the .ROW:=NULL alternative here by linking to ?assign I misunderstood

Please note: over-allocation of the column pointer vector is not for efficiency \emph{per se}; it is so that \code{:=} can add columns by reference without a shallow copy.
\code{setallocrow} is a utility function that prepares columns for fast row operations (delete or insert (not implemented yet)) by reference and manages row capacity.
Copy link
Member

@MichaelChirico MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current wording doesn't make clear that delete is implemented, insert is not.

Suggested change
\code{setallocrow} is a utility function that prepares columns for fast row operations (delete or insert (not implemented yet)) by reference and manages row capacity.
\code{setallocrow} is a utility function that prepares columns for fast row operations (delete or insert) by reference and manages row capacity. (Note that 'insert' by reference is not yet implemented)

\code{setallocrow} is a utility function that prepares columns for fast row operations (delete or insert (not implemented yet)) by reference and manages row capacity.
Before deleting or inserting rows by reference, columns must be resizable.
\code{setallocrow} ensures all columns are in the appropriate state by converting ALTREP columns to materialized form and reallocating
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a bit of clarification -- when I read it, my thought was "oh, we can't do .ROW:=NULL on a table with ALTREP columns unless we first setallocrow()", but I don't see any examples/tests to that effect.

OTOH, I think (?) we strive to expand all ALTREP columns, so I haven't been able to come up with such an example


SEXP allocrowwrapper(SEXP dt, SEXP n) {
if (!isInteger(n) || length(n)!=1 || INTEGER(n)[0]<0 || INTEGER(n)[0]==NA_INTEGER)
error("n must be a single non-negative non-NA integer");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error("n must be a single non-negative non-NA integer");
error(_("n must be a single non-negative non-NA integer"));

for (R_xlen_t i = 0; i < length(dt); i++) {
SEXP col = VECTOR_ELT(dt, i);
if (!isVector(col))
error("Cannot make non-vector column %lld resizable", (long long)(i + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be sure to mark strings for translation as appropriate

Suggested change
error("Cannot make non-vector column %lld resizable", (long long)(i + 1));
error(_("Cannot make non-vector column %lld resizable"), (long long)(i + 1));

Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is kind of 2 PRs -- one about setallocrow() and .ROW := NULL, are they totally independent?

Or perhaps, we need setallocrow() first in order to enable .ROW := NULL?

It might be easier to review in that sequence -- PR 1 adds setallocrow() (with no way to actually use it), PR 2 enables DELETE operations by reference. WDYT?

@jangorecki
Copy link
Member

jangorecki commented Dec 29, 2025

  1. table that is already marked resizable, when it will get new column added, it should lose/reset it's resizable marker, or make newly added columns resizable:
d = data.table(a = 4:1)
.Internal(inspect(d[["a"]]))
#@55b90c4c1c28 13 INTSXP g0c2 [REF(2)] (len=4, tl=0) 4,3,2,1
setallocrow(d)
.Internal(inspect(d[["a"]]))
#@55b90c4c1e68 13 INTSXP g0c2 [REF(1),gp=0x20] (len=4, tl=4) 4,3,2,1
d[, "x" := c(1,2,3,4)]
.Internal(inspect(d[["a"]]))
#@55b90c4c1e68 13 INTSXP g0c2 [REF(1),gp=0x20] (len=4, tl=4) 4,3,2,1
.Internal(inspect(d[["x"]]))
#@55b910172c58 14 REALSXP g0c3 [REF(2)] (len=4, tl=0) 1,2,3,4

Thinking about the bigger picture of fast insert we will need to think about how much we want to overallocate when inserting respectively when we want to tidy up.

I think it might be explicitly requested from user each time

  1. I was expecting we can show why one might want to use setallocrow, any example? the one I made in my last post appeared to not be working...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Delete rows by reference

3 participants