05289: Redesign response storage to support more questions+answers (database columns) in a survey

This bug affects 1 person(s).

ID	Project	Category	View Status	Date Submitted	Last Update

05289	Feature requests	_ Unknown	public	2011-06-19 22:17	2023-02-03 18:24

Reporter	TMSWhite	Assigned To	c_schmitz
Priority	normal	Severity	feature
Status	acknowledged	Resolution	open

Summary	05289: Redesign response storage to support more questions+answers (database columns) in a survey
Description	According to the documentation, LimeSurvey can hold at most 1000 questions & answers per survey. A more generic data model could let LimeSurvey support 1000s of questions and answers. I have used such a generic model to support surveys with 3500+ questions and 10,000+ answers. A brief description of the data model, plus Entity Relationship diagrams can be found at issue 05106. Given the Google 2011 Summer of Code project to redesign the database API, it is timely to consider alternate (or even parallel) database architectures. FYI, the system I used took exactly that strategy - everything was written to both generic Entity-Attribute-Value, and specific (like LimeSurvey's current approach) database tables. There were no significant performance limitations.
Tags	No tags attached.
Attached Files	LS-EAV-DataModel.png (490,551 bytes)

Bug heat	24
Story point estimate	20
Users affected %	10

User List	There are no users monitoring this issue.

TMSWhite 2012-01-16 16:45 reporter ~16802	Now that EM integration is completed in 1.92 and Yii, this would be a relatively easy addition. EM already controls collecting all of the values that need to be updated, and keeps track of which questions were asked (so it can validate them upon page turn or submission), so it could be extended to support a vertical, Entity Attribute Value-type database design. Here are the steps in the task. (A) Admin menu (General, Notification & data management): add $DatabaseStyle option, which choices {horizontal, vertical-EAV, both} (B) On activation, always create survey_SID table with the core columns -id -submitdate -lastpage -startlanguage -token -datestamp -startdate (C)if $DatabaseStyle is horizontal or both, add SGQA columns (D) if $DatabaseStyle is vertical-EAV or both, create tables similar to that described in LS-EAV-DataModel.png attachment. LS doesn't need as many columns, but in general, here is the idea. We can change the names, but this data model is known to work: (1) data_elements stores one row per SGQA code per survey instance. It keeps track of the current value for all data. At survey-start-time, X rows are added to data_elements, one per SGQA code, all NULL (2) item_usages - this stores data about questions asked during the survey. At survey start, it has no content. Each time a page is generated, LS would insert rows into item_usages - one row per SGQA on the generated page. Then, if the subject answers the questions, the values are updated when responses are submitted prior to navigating to the next page. This table includes important meta-data including: (a) starting answer values and codes, (b) finishing answer values and codes (so can detect that someone has changed an answer), (c) questionAsAsked - this is the fully tailored question in the current language - this helps you debug cases where you didn't ask what you expected, (d) time, and (e) visit_num - so you know how many times this question was asked (3) page_usages - this stores usage timing data - such as (a) how long server took to generate process requests and generate next page (b) how long page took to render (c) network latency (d) how much time the user spent on the page (E) Modify EM data saving functions to save into either or both of the $DatabaseStyle. With each page flip, (1) Update survey_SID table so know $lastpage (2) Update changed values in data_elements table (3) Insert values into item_usages table (4) Insert row onto page_usage table (F) Modify database export to transpose the data_elements table to horizontal format for export of SPSS, R, etc. data (G) Optionally add performance monitoring reports that show how survey performance changes with load or time.

tacman1123 2012-03-29 20:45 reporter ~18123	"At survey-start-time, X rows are added to data_elements, one per SGQA code, all NULL". I don't think that's necessary. If a row is missing, it means it hasn't been answered yet. I'd like to propose that we add a group_sequence to the data_elements. That would allow for repeatable groups or subquestions. In the flat/horizontal database structure, you'd have to have a column for up the the maximum number of items, but you wouldn't have that restriction in the vertical structure. So your data might look like this (skipping survey_SID and timing data) child\|1\|name\|Peter child\|1\|gender\|M child\|2\|name\|Paul child\|2\|gender\|M child\|2\|name\|Mary child\|3\|gender\|F The group_code (or prefix) would be "child", it would contain two questions (name and gender), be and tagged as repeatable. In the UI, you'd see the question in a group, but instead of "Next>", you'd see "Add Another>", which would repeat that group (and update the internal sequence). The point here is to NOT generate fields like child1_age, child2_age, which then requires parsing out field names to map back to anything. Certainly you'd have the option of exporting data as such, but internally that's a very difficult way to keep the repeatable data.

TMSWhite 2012-03-29 22:05 reporter ~18124	Tac- The reason for "At survey-start-time, X rows are added to data_elements, one per SGQA code, all NULL" is so that we can easily get the current value for all data elements. That way, there is a single SQL INSERT statement when the survey starts, and everything else is an UPDATE. However, as long as we do a good job of maintaining state, LS should know which variables have not been set, so could do INSERTs for those and UPDATEs for the rest. That would be essential to support your proposal for repeating groups. Since EM needs to know the names of all declared variables, we might need a different naming syntax for such variables. As you know, we already use a dot notation for question attributes. Perhaps we could use a tilde syntax for repeating groups - like <group>~<repeat#>~<varname>.<suffix>, like child~1~name.shown - but this will probably require more thought.

tacman1123 2012-03-29 22:28 reporter ~18125	I think the idea of a group code is key in that setup. There will be other uses for that code as well, in EM. I think the EM should also handle relative rules, so that within a repeatable group the rule doesn't need to spell out the number, e.g. the condition for child~~had_hpv_vaccine is child~~gender == 'F' the condition for child~~hpv_vaccine_date is child~~had_hpv_vaccdine == 'Y' These condition would be relative to the group they're already in (and therefore the child~~ isn't necessary, but something would be needed to say "look up a question from within this group with my same sequence number". In MySQL, you can do a single insert/update by doing an INSERT ... ON DUPLICATE KEY UPDATE (http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html), not sure how that would translate to the other databases.

TMSWhite 2012-03-29 22:36 reporter ~18126	Other things people might want from such repeating groups include: (1) comparison of current group field to prior group - e.g. if you're supposed to enter ages of children in descending order (2) functions across all entries for a variable in a group - e.g. if your group has you list products you purchased and their cost, you might want an easy way to sum the costs, get the max or min value of them, etc. This sounds like it might be an extension of the "this" functionality proposed elsewhere. Another option is to let EM support array processing, but that is a much bigger lift.

tacman1123 2012-03-29 22:46 reporter ~18127	Brainstrorming: I wonder if we'd be able to have a "grid group", which would have an interface similar to labels/subquestions (code, assessment value, label text), but for any group with questions that could fit in a single line. So your repeatable group would be displayed as a grid, with + - up/down icons, e.g. Name \| Age \| Status [ ] \| [ ] \| ()In School ()Not in School () Home Schooled. + - ^ v The radio could also be a dropdown, the age would be a number, etc. Right now, the grids are either all numeric or all text, no validation, rarely what we need. The alternative is to set up a group and repeat it, but that leads to a lot of questions. The horizontal layout of a group would even allow us do to conditional questions, disabling some columns if they're not relevant.

ollehar 2023-02-03 18:24 administrator ~73662	Schemaless database might be a better approach here. But the first step would be to add an abstraction layer between EM and the storage itself, I think. Also note that this is a complex tasks that affects few users of the software. Only the advanced users hit the limit.

Date Modified	Username	Field	Change
2011-06-19 22:17	TMSWhite	New Issue
2012-01-16 16:45	TMSWhite	Note Added: 16802
2012-01-16 16:46	TMSWhite	File Added: LS-EAV-DataModel.png
2012-01-16 17:22	TMSWhite	Project	@2@ => Development
2012-03-29 20:45	tacman1123	Note Added: 18123
2012-03-29 22:05	TMSWhite	Note Added: 18124
2012-03-29 22:28	tacman1123	Note Added: 18125
2012-03-29 22:36	TMSWhite	Note Added: 18126
2012-03-29 22:46	tacman1123	Note Added: 18127
2021-03-07 20:39	c_schmitz	Project	Development => Feature requests
2021-03-07 20:39	c_schmitz	Category	Other => Other issues
2021-03-07 20:40	c_schmitz	Severity	@50@ => feature
2021-03-07 20:40	c_schmitz	Category	Other issues => _ Unknown
2021-03-07 20:40	c_schmitz	Product Version	1.91 =>
2021-03-07 20:40	c_schmitz	Summary	Support more than 1000 questions+answers (database columns) in a survey => Redesign response storage to support more questions+answers (database columns) in a survey
2021-03-07 20:43	c_schmitz	Relationship added	has duplicate 09620
2021-03-07 20:43	c_schmitz	Assigned To	=> c_schmitz
2021-03-07 20:43	c_schmitz	Status	new => acknowledged
2021-03-07 20:43	c_schmitz	Relationship added	related to 07411
2021-03-07 21:08	c_schmitz	Relationship added	related to 15021
2021-03-07 21:10	c_schmitz	Relationship added	related to 07593
2021-03-07 21:11	c_schmitz	Relationship added	has duplicate 07745
2023-02-03 18:23	ollehar	Story point estimate	=> 20
2023-02-03 18:23	ollehar	Users affected %	=> 10
2023-02-03 18:24	ollehar	Note Added: 73662
2023-02-03 18:24	ollehar	Bug heat	22 => 24

View Issue Details

Relationships

Users monitoring this issue

Activities

Issue History

has duplicate	09620	closed	c_schmitz	proposed fix to the 1600 column issue
has duplicate	07745	closed	c_schmitz	Increase database limit by joining tables
related to	07411	acknowledged		Warn survey authors about survey size
related to	15021	closed	c_schmitz	Use VIEW to cirumvent the database column limit
related to	07593	closed	c_schmitz	Increase the allowed length of the answer code