View Issue Details

This bug affects 1 person(s).
 24
IDProjectCategoryView StatusLast Update
05289Feature requests_ Unknownpublic2023-02-03 18:24
ReporterTMSWhite Assigned Toc_schmitz  
PrioritynormalSeverityfeature 
Status acknowledgedResolutionopen 
Summary05289: Redesign response storage to support more questions+answers (database columns) in a survey
Description

According to the documentation, LimeSurvey can hold at most 1000 questions & answers per survey. A more generic data model could let LimeSurvey support 1000s of questions and answers. I have used such a generic model to support surveys with 3500+ questions and 10,000+ answers.

A brief description of the data model, plus Entity Relationship diagrams can be found at issue 05106.

Given the Google 2011 Summer of Code project to redesign the database API, it is timely to consider alternate (or even parallel) database architectures.

FYI, the system I used took exactly that strategy - everything was written to both generic Entity-Attribute-Value, and specific (like LimeSurvey's current approach) database tables. There were no significant performance limitations.

TagsNo tags attached.
Attached Files
LS-EAV-DataModel.png (490,551 bytes)
Bug heat24
Story point estimate20
Users affected %10

Relationships

has duplicate 09620 closedc_schmitz proposed fix to the 1600 column issue 
has duplicate 07745 closedc_schmitz Increase database limit by joining tables 
related to 07411 acknowledged Warn survey authors about survey size 
related to 15021 closedc_schmitz Use VIEW to cirumvent the database column limit 
related to 07593 closedc_schmitz Increase the allowed length of the answer code 

Users monitoring this issue

There are no users monitoring this issue.

Activities

TMSWhite

TMSWhite

2012-01-16 16:45

reporter   ~16802

Now that EM integration is completed in 1.92 and Yii, this would be a relatively easy addition. EM already controls collecting all of the values that need to be updated, and keeps track of which questions were asked (so it can validate them upon page turn or submission), so it could be extended to support a vertical, Entity Attribute Value-type database design. Here are the steps in the task.

(A) Admin menu (General, Notification & data management):

  • add $DatabaseStyle option, which choices {horizontal, vertical-EAV, both}

(B) On activation, always create survey_SID table with the core columns
-id
-submitdate
-lastpage
-startlanguage
-token
-datestamp
-startdate

(C)if $DatabaseStyle is horizontal or both, add SGQA columns

(D) if $DatabaseStyle is vertical-EAV or both, create tables similar to that described in LS-EAV-DataModel.png attachment. LS doesn't need as many columns, but in general, here is the idea. We can change the names, but this data model is known to work:
(1) data_elements stores one row per SGQA code per survey instance. It keeps track of the current value for all data. At survey-start-time, X rows are added to data_elements, one per SGQA code, all NULL
(2) item_usages - this stores data about questions asked during the survey. At survey start, it has no content. Each time a page is generated, LS would insert rows into item_usages - one row per SGQA on the generated page. Then, if the subject answers the questions, the values are updated when responses are submitted prior to navigating to the next page. This table includes important meta-data including:
(a) starting answer values and codes,
(b) finishing answer values and codes (so can detect that someone has changed an answer),
(c) questionAsAsked - this is the fully tailored question in the current language - this helps you debug cases where you didn't ask what you expected,
(d) time, and
(e) visit_num - so you know how many times this question was asked
(3) page_usages - this stores usage timing data - such as
(a) how long server took to generate process requests and generate next page
(b) how long page took to render
(c) network latency
(d) how much time the user spent on the page

(E) Modify EM data saving functions to save into either or both of the $DatabaseStyle. With each page flip,
(1) Update survey_SID table so know $lastpage
(2) Update changed values in data_elements table
(3) Insert values into item_usages table
(4) Insert row onto page_usage table

(F) Modify database export to transpose the data_elements table to horizontal format for export of SPSS, R, etc. data

(G) Optionally add performance monitoring reports that show how survey performance changes with load or time.

tacman1123

tacman1123

2012-03-29 20:45

reporter   ~18123

"At survey-start-time, X rows are added to data_elements, one per SGQA code, all NULL". I don't think that's necessary. If a row is missing, it means it hasn't been answered yet.

I'd like to propose that we add a group_sequence to the data_elements. That would allow for repeatable groups or subquestions. In the flat/horizontal database structure, you'd have to have a column for up the the maximum number of items, but you wouldn't have that restriction in the vertical structure.

So your data might look like this (skipping survey_SID and timing data)

child|1|name|Peter
child|1|gender|M
child|2|name|Paul
child|2|gender|M
child|2|name|Mary
child|3|gender|F

The group_code (or prefix) would be "child", it would contain two questions (name and gender), be and tagged as repeatable. In the UI, you'd see the question in a group, but instead of "Next>", you'd see "Add Another>", which would repeat that group (and update the internal sequence).

The point here is to NOT generate fields like child1_age, child2_age, which then requires parsing out field names to map back to anything. Certainly you'd have the option of exporting data as such, but internally that's a very difficult way to keep the repeatable data.

TMSWhite

TMSWhite

2012-03-29 22:05

reporter   ~18124

Tac-

The reason for "At survey-start-time, X rows are added to data_elements, one per SGQA code, all NULL" is so that we can easily get the current value for all data elements. That way, there is a single SQL INSERT statement when the survey starts, and everything else is an UPDATE. However, as long as we do a good job of maintaining state, LS should know which variables have not been set, so could do INSERTs for those and UPDATEs for the rest.

That would be essential to support your proposal for repeating groups.

Since EM needs to know the names of all declared variables, we might need a different naming syntax for such variables. As you know, we already use a dot notation for question attributes. Perhaps we could use a tilde syntax for repeating groups - like <group>~<repeat#>~<varname>.<suffix>, like child~1~name.shown - but this will probably require more thought.

tacman1123

tacman1123

2012-03-29 22:28

reporter   ~18125

I think the idea of a group code is key in that setup. There will be other uses for that code as well, in EM.

I think the EM should also handle relative rules, so that within a repeatable group the rule doesn't need to spell out the number, e.g.

the condition for childhad_hpv_vaccine is childgender == 'F'
the condition for childhpv_vaccine_date is childhad_hpv_vaccdine == 'Y'

These condition would be relative to the group they're already in (and therefore the child~~ isn't necessary, but something would be needed to say "look up a question from within this group with my same sequence number".

In MySQL, you can do a single insert/update by doing an INSERT ... ON DUPLICATE KEY UPDATE (http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html), not sure how that would translate to the other databases.

TMSWhite

TMSWhite

2012-03-29 22:36

reporter   ~18126

Other things people might want from such repeating groups include:
(1) comparison of current group field to prior group - e.g. if you're supposed to enter ages of children in descending order
(2) functions across all entries for a variable in a group - e.g. if your group has you list products you purchased and their cost, you might want an easy way to sum the costs, get the max or min value of them, etc.

This sounds like it might be an extension of the "this" functionality proposed elsewhere.

Another option is to let EM support array processing, but that is a much bigger lift.

tacman1123

tacman1123

2012-03-29 22:46

reporter   ~18127

Brainstrorming: I wonder if we'd be able to have a "grid group", which would have an interface similar to labels/subquestions (code, assessment value, label text), but for any group with questions that could fit in a single line.

So your repeatable group would be displayed as a grid, with + - up/down icons, e.g.

Name | Age | Status
[ ] | [ ] | ()In School ()Not in School () Home Schooled. + - ^ v

The radio could also be a dropdown, the age would be a number, etc.

Right now, the grids are either all numeric or all text, no validation, rarely what we need. The alternative is to set up a group and repeat it, but that leads to a lot of questions.

The horizontal layout of a group would even allow us do to conditional questions, disabling some columns if they're not relevant.

ollehar

ollehar

2023-02-03 18:24

administrator   ~73662

Schemaless database might be a better approach here. But the first step would be to add an abstraction layer between EM and the storage itself, I think.

Also note that this is a complex tasks that affects few users of the software. Only the advanced users hit the limit.

Issue History

Date Modified Username Field Change
2011-06-19 22:17 TMSWhite New Issue
2012-01-16 16:45 TMSWhite Note Added: 16802
2012-01-16 16:46 TMSWhite File Added: LS-EAV-DataModel.png
2012-01-16 17:22 TMSWhite Project @2@ => Development
2012-03-29 20:45 tacman1123 Note Added: 18123
2012-03-29 22:05 TMSWhite Note Added: 18124
2012-03-29 22:28 tacman1123 Note Added: 18125
2012-03-29 22:36 TMSWhite Note Added: 18126
2012-03-29 22:46 tacman1123 Note Added: 18127
2021-03-07 20:39 c_schmitz Project Development => Feature requests
2021-03-07 20:39 c_schmitz Category Other => Other issues
2021-03-07 20:40 c_schmitz Severity @50@ => feature
2021-03-07 20:40 c_schmitz Category Other issues => _ Unknown
2021-03-07 20:40 c_schmitz Product Version 1.91 =>
2021-03-07 20:40 c_schmitz Summary Support more than 1000 questions+answers (database columns) in a survey => Redesign response storage to support more questions+answers (database columns) in a survey
2021-03-07 20:43 c_schmitz Relationship added has duplicate 09620
2021-03-07 20:43 c_schmitz Assigned To => c_schmitz
2021-03-07 20:43 c_schmitz Status new => acknowledged
2021-03-07 20:43 c_schmitz Relationship added related to 07411
2021-03-07 21:08 c_schmitz Relationship added related to 15021
2021-03-07 21:10 c_schmitz Relationship added related to 07593
2021-03-07 21:11 c_schmitz Relationship added has duplicate 07745
2023-02-03 18:23 ollehar Story point estimate => 20
2023-02-03 18:23 ollehar Users affected % => 10
2023-02-03 18:24 ollehar Note Added: 73662
2023-02-03 18:24 ollehar Bug heat 22 => 24