| Title: | Wrangling Longitudinal Survival Data |
|---|---|
| Description: | Streamlines the process of transitioning between data formats commonly used in survival analysis. Functions convert longitudinal data between formats used as input for survival models as well as support overall preparation. Users are able to focus on model building rather than data wrangling. |
| Authors: | Charles Ingulli [aut, cre] |
| Maintainer: | Charles Ingulli <[email protected]> |
| License: | GPL-3 |
| Version: | 1.0.1.9000 |
| Built: | 2026-06-06 10:02:43 UTC |
| Source: | https://github.com/ci2131a/wlsd |
Creates a new row of values for subjects representing baseline observations in a data set of follow-up observations.
basedate(data,id)basedate(data,id)
data |
Data frame with relevant columns. |
id |
Character string of the identification column name in |
Adds a new row for each level of the id column. Internal functions will try to determine any constant columns by checking for consistency within id groups in order to fill in some of the blanks.
A data frame with added row for each level of id.
basedate(long_data, "id")basedate(long_data, "id")
A toy data set in count format.
count_datacount_data
A data frame with 3 rows on the following 5 variables.
idAn identification variable
timeAggregate time variable
eventAggregated status indicator variable
var1First example explanatory variable
var2Second example explanatory variable
count_datacount_data
A toy data set in counting process format.
cp_datacp_data
A data frame with 6 rows on the following 6 variables.
idAn identification variable
time1Starting time of observation interval
time2Ending time of observation interval
eventStatus indicator variable
var1First example explanatory variable
var2Second example explanatory variable
cp_datacp_data
Transforms data from counting process format to the long format.
cp2long(data, id, time1, time2, status = NULL, fill = FALSE)cp2long(data, id, time1, time2, status = NULL, fill = FALSE)
data |
A data frame with relevant columns. |
id |
A character string of the identification variable name in |
time1 |
A character string of the first time point variable in |
time2 |
A character string of the second time point variable in |
status |
A character string of the status column name in |
fill |
An optional argument that attempts to fill any |
The data transition consolitdates information from the time1 and time2 argument into a single time column. All other columns are assumed to correspond to the time2 point. Thus, the first row generally consists of NA values. The fill argument will attempt to discern any constant columns within id groups in order to populate that first row.
A data frame in long format.
cp2long(data = cp_data, id = "id", time1 = "time1", time2 = "time2")cp2long(data = cp_data, id = "id", time1 = "time1", time2 = "time2")
Converts one or more event columns within a data frame to a single state vector whose values represent combinations of events.
events2state(data, events, number = TRUE, drop = TRUE, ...)events2state(data, events, number = TRUE, drop = TRUE, ...)
data |
A data frame with relevant columns. |
events |
The names of the event variables as character strings in a vector. |
number |
A logical argument to determine whether the new state variable should be converted to a number representing the combination of events or left as is. Defaults to |
drop |
Passed to |
... |
Further arguments to be passed to |
For a data frame with the necessary inputs, the function will aggregate values across columns supplied to events through the interaction function. The key for the different combination levels is printed to the console.
Returns the input data frame with an added column called state.
events2state(data = long_data, events = c("event", "var2"))events2state(data = long_data, events = c("event", "var2"))
A long format data set from a longitudinal study of low back pain (LBP) on midwestern manufacturing workers.
LBPLBP
A data frame on the following variables:
| Variable | Description | Class |
sid: |
The subject identification variable for individuals. | Factor |
Baseline.date: |
The date of baseline visit or enrollment of individuals into the study. | Date |
Date: |
The calendar time of follow-up visit. | Date |
time_to_row: |
The number of days between the current follow-up visit and the baseline date. | Integer |
case.lbp: |
A status indicator for individuals possessing any LBP (0 for no and 1 for yes). | Integer |
case.med: |
A status indicator determining whether indviduals are taking medication for LBP (0 for no and 1 for yes). | Integer |
case.sc: |
A status indicator to determine whether individuals are seeking care for LBP (0 for no and 1 for yes). | Integer |
case.ls: |
A status indicator to determine whether individuals have lost time from work due to LBP (0 for no and 1 for yes). | Integer |
gender: |
The gender of the individual (either M for Male or F for Female). | Factor |
age: |
The age of the individual at baseline visit in years. | Numeric |
weight: |
The weight of individuals in lbs. | Integer |
height: |
The height of individuals in inches. | Integer |
raceth: |
A categorical variable to determine the race/ethnicity of individuals (0 = White; 1 = Hispanic/Latino; 2 = Black; 3 = Asian; 4 = Native Hawaiian or Pacific Islander; 5 = Native American or Native Alaskan; 6 = Other/declined). | Factor |
smoking: |
A smoking indicator variable (0 = Smoked less than 100 cigarettes in life; 1 = smoked in the past, but no longer, 2 = currently smoke). | Factor |
comptenure: |
A categorical variable to determine length of time at the current company (0 = less than 3 months; 1 = 3 months to 1 year; 2 = 1 year to 3 years; 3 = 3 years to 5 years; 4 = 5 years to 10 years; 5 = 10 or more years). | Factor |
jobtenure: |
A categorical variable to determine length of time in their current job 0 = less than 3 months; 1 = 3 months to 1 year; 2 = 1 year to 3 years; 3 = 3 years to 5 years; 4 = 5 years to 10 years; 5 = 10 or more years. | Factor |
control.order: |
A categorical variable to determine how much control individuals have over the order in which they complete tasks (0 = "Very Much", 1 = "Much", 2="Moderate Amounts", 3="A Little", 4="Very Little"). | Factor |
control.pace: |
A categorical variable to determine how much control individuals have over the pace in which they complete tasks (0 = "Very Much", 1 = "Much", 2="Moderate Amounts", 3="A Little", 4="Very Little"). | Factor |
control.breaks: |
A categorical variable to determine the amount of control individuals have in taking breaks between completing tasks (0 = "Very Much", 1 = "Much", 2="Moderate Amounts", 3="A Little", 4="Very Little"). | Factor |
supervisor.support: |
A categorical variable determining how much support individuals feel they receive from their supervisor (0="Almost Always", 1="Some of the Time", 2="Hardly Ever"). | Factor |
coworker.support: |
A categorical variable determining how much support individuals feel they receive from their coworkers (0="Almost Always", 1="Some of the Time", 2="Hardly Ever"). | Factor |
job.satisfied: |
A categorical variable to determine whether individuals feel satisfied with their current job (0="Very Satisfied", 1="Somewhat Satisfied", 2="A Little Satisfied", 3="Not at all Satisfied"). | Factor |
bmi: |
The calculated body mass index (BMI) of individuals based on height and weight. |
Numeric |
Data set construction was done through the consolidation of various source files pulled from the original database. The final data frame contains follow-up information for selected individuals. The case definitions assessed over time were case.lbp, case.med, case.sc, and case.lt. Column time_to_row is constructed using the Baseline.date and Date columns to calculate the number of days between observations (denoted by rows). All other columns are constant with respect to time. Categorical variables were recorded through self-assessment on the part of the subject. The age and weight variables were able to be physically measured to then be used in calculation of bmi.
LBP Research Consortium, University of Wisconsin-Milwaukee
Garg, Arun, Kurt Hegmann, J. Moore, Jay Kapellusch, Matthew Thiese, Sruthi Boda, Parag Bhoyar, Donald Bloswick, Andrew Merryweather, Richard Sesek, Gwen Deckow-Schaefer, James Foster, Eric Wood, Xiaoming Sheng, and Richard Holubkov (2013). Study protocol title: A prospective cohort study of low back pain. BMC Musculoskeletal Disorders 14(84), 84.
Ingulli, Charles. (2020). A Survey of Statistical Methods for Investigating Risk of Low Back Pain in a Cohort of Manufacturing Workers. (85696). [Master's Thesis, American University]
LBPLBP
A toy data set in long format data.
long_datalong_data
A data frame with 9 rows on the following 5 variables.
idAn identification variable
timeTime of observation
eventStatus indicator variable
var1First example explanatory variable
var2Second example explanatory variable
long_datalong_data
Aggregates longitudinal data into a count format data set.
long2count(data, id, event = NULL, state = NULL, FUN, ...)long2count(data, id, event = NULL, state = NULL, FUN, ...)
data |
A data frame with relevant columns. |
id |
A character string of the identification variable name in |
event |
The name(s) of the event column(s) in |
state |
The name of the state variable in |
FUN |
The summary function to be applied to all time-depentent columns (wrapper for argument in |
... |
Additional arguments supplied to |
The returned data frame aggregates any time-depended values based on row-wise changes within id groups. New columns include event.counts which represents the sum total of values in the event column for each level of id or the sum total of levels of the state column if supplied as well as the count.weight column which sums the number of rows for each level of id.
A data frame aggregated into count format.
# if the "event" column should be summed long2count(long_data, id = "id", event = "event") # if the "event" column contains levels that should be summed separately long2count(long_data, id = "id", state = "event")# if the "event" column should be summed long2count(long_data, id = "id", event = "event") # if the "event" column contains levels that should be summed separately long2count(long_data, id = "id", state = "event")
Transforms data from long format to counting process format.
long2cp(data, id, time, status = NULL, drop = FALSE)long2cp(data, id, time, status = NULL, drop = FALSE)
data |
A data frame with relevant columns. |
id |
A character string of the identification column name in |
time |
A character string of the time column name in |
status |
A character string of the status column in |
drop |
Logical indicator for whether any |
The transition is primarily done by shifting the column supplied to the time argument into two new columns for a column-wise time definition and adjusting rows accordingly. Column names supplied to the status arguement are assumed to ocurr at the right endpoint so the first value for each id of the input is dropped. All other time-varying columns are assumed to ocurr at the left endpoint so the last value for each id of the input is dropped. The drop argument can be used for any id levels that may only have one row where a two column time data set might not suit them. Since there is not any useful gained from going from one time to the same time, it may be useful to just drop those id levels altogether.
A data frame in counting process format.
long2cp(data = long_data, id = "id", time = "time", status = "event")long2cp(data = long_data, id = "id", time = "time", status = "event")
Takes all rows of a data frame up to and including the first occurrence of a supplied criteria for grouped data.
takefirst(data, id, criteria.column, criteria)takefirst(data, id, criteria.column, criteria)
data |
A data frame with relevant columns. |
id |
A character string of the identification vector name defining groups in |
criteria.column |
The name as a character string of the column in |
criteria |
The value of the cutoff for subsetting. |
Returns a data frame that takes all rows within the groups supplied by id up to and including the first occurrence of the value of criteria in criteria.column.
A data frame subset up to and including the first row matching criteria in cirteria.column for each level of id.
takefirst(long_data, "id", criteria.column = "var1", criteria = 10.4)takefirst(long_data, "id", criteria.column = "var1", criteria = 10.4)
A toy data set in wide format.
wide_datawide_data
A data frame with 3 rows on the following 14 variables.
idAn identification variable
time1First time observation column
time2Second time observation column
time3Third time observation column
time4Fourth observation column
event1Status indicator at first time
event2Status indicator at second time
event3Status indicator at third time
event4Status indicator at fourth time
var11First explanatory variable at first time
var12First explanatory variable at second time
var13First explanatory variable at third time
var14First explanatory variable at fourth time
var2Second explanatory variable
wide_datawide_data