View Issue Details

This bug affects 1 person(s).
 8
IDProjectCategoryView StatusLast Update
04006Bug reportsOtherpublic2010-02-11 17:32
Reportererick Assigned Touser1548 
PrioritynormalSeverityminor 
Status closedResolutionfixed 
Product Version1.85+ 
Fixed in Version1.87+ 
Summary04006: Possible inaccurate data when exporting to R
Description

When exporting answers to R format, the script generated by LimeSurvey may create inaccurate data by converting factors to numeric values with the function as.numeric(). This function's output is in fact the indices of the levels that appear in the factor.
This may be corrected by reading the CSV answers file with the option stringsAsFactors = FALSE, which makes the data to be read as character strings, and then the function as.numeric() will convert them to numeric without such problem.
I have made a patch for changing the php file responsible for exporting R data, in which two other minor changes are incorporated:

  • maximum size for text fields changed from 255 to 25500
  • in the end of the script, instead of printing str(data), which is irrelevant when running in batch mode, I remove the temporary variable v.names.
TagsNo tags attached.
Attached Files
export_data_r.php.patch (1,234 bytes)   
--- export_data_r.php.~1~	2009-07-21 11:30:33.000000000 -0300
+++ export_data_r.php	2009-12-16 21:50:06.000000000 -0200
@@ -30,7 +30,7 @@
  * Optimization opportunities remain in the VALUE LABELS section, which runs a query / column
  */
 
-$length_varlabel = '255'; // Set the max text length of Variable Labels
+$length_varlabel = '25500'; // Set the max text length of Variable Labels
 $headerComment = '';
 $tempFile = '';
 
@@ -160,7 +160,7 @@
 	 * be sent to the client.
 	 */
 	echo $headerComment;
-	echo "data=read.table(\"survey_".$surveyid."_data_file.csv\", sep=\",\", quote = \"'\", na.strings=\"\")\n names(data)=paste(\"V\",1:dim(data)[2],sep=\"\")\n";
+	echo "data=read.table(\"survey_".$surveyid."_data_file.csv\", sep=\",\", quote = \"'\", na.strings=\"\")\n names(data)=paste(\"V\",1:dim(data)[2],sep=\"\", stringAsFactors=FALSE)\n";
 	foreach ($fields as $field){
 		if($field['SPSStype'] == 'DATETIME23.2') $field['size']='';
 		if($field['LStype'] == 'N' || $field['LStype']=='K') {
@@ -236,7 +236,7 @@
 			}
 		}
 	}
-	echo "NA); names(data)= v.names[-length(v.names)]\nprint(str(data))\n";
+	echo "NA); names(data)= v.names[-length(v.names)]\nrm(v.names)\n";
 	echo $errors;
 	exit;
 }
export_data_r.php.patch (1,234 bytes)   
Surveydata_syntax.R (705 bytes)
export_data_r2.php.patch (1,235 bytes)   
--- export_data_r.php.~1~	2009-07-21 11:30:33.000000000 -0300
+++ export_data_r.php	2009-12-21 16:51:51.000000000 -0200
@@ -30,7 +30,7 @@
  * Optimization opportunities remain in the VALUE LABELS section, which runs a query / column
  */
 
-$length_varlabel = '255'; // Set the max text length of Variable Labels
+$length_varlabel = '25500'; // Set the max text length of Variable Labels
 $headerComment = '';
 $tempFile = '';
 
@@ -160,7 +160,7 @@
 	 * be sent to the client.
 	 */
 	echo $headerComment;
-	echo "data=read.table(\"survey_".$surveyid."_data_file.csv\", sep=\",\", quote = \"'\", na.strings=\"\")\n names(data)=paste(\"V\",1:dim(data)[2],sep=\"\")\n";
+	echo "data=read.table(\"survey_".$surveyid."_data_file.csv\", sep=\",\", quote = \"'\", na.strings=\"\", stringsAsFactors=FALSE)\n names(data)=paste(\"V\",1:dim(data)[2],sep=\"\")\n";
 	foreach ($fields as $field){
 		if($field['SPSStype'] == 'DATETIME23.2') $field['size']='';
 		if($field['LStype'] == 'N' || $field['LStype']=='K') {
@@ -236,7 +236,7 @@
 			}
 		}
 	}
-	echo "NA); names(data)= v.names[-length(v.names)]\nprint(str(data))\n";
+	echo "NA); names(data)= v.names[-length(v.names)]\nrm(v.names)\n";
 	echo $errors;
 	exit;
 }
export_data_r2.php.patch (1,235 bytes)   
Bug heat8
Complete LimeSurvey version number (& build)7191
I will donate to the project if issue is resolved
Browser
Database type & version138
Server OS (if known)Linux Debian
Webserver software & version (if known)Apache
PHP Version5.2.6

Users monitoring this issue

There are no users monitoring this issue.

Activities

user372

2009-12-18 01:13

  ~10599

@ mdekker: what do you think about that issue?

mdekker

mdekker

2009-12-18 09:10

reporter   ~10602

Hey Livio,

As you did the R export and this only involves R code I'll leave this one to you. If you are okay with the changes please commit the patch and close the report.

mdekker

mdekker

2009-12-21 09:59

reporter   ~10615

Hey Livio,

I will commit the patch, no problem.

mdekker

mdekker

2009-12-21 11:13

reporter   ~10616

Not committing (yet) as the stringAsFactors line seems to be giving trouble: all set factors are appended at the end of the dataframe.

mdekker

mdekker

2009-12-21 11:15

reporter   ~10617

@erick, can you please provide an export where you found the data was inaccurate? Please make the export as simple (small) as possible.

erick

erick

2009-12-21 22:17

reporter   ~10633

I've uploaded an R script and a corresponding csv file. Note that the variable "Q1" will be read incorrectly, after converting a factor with levels ("", 4, 1).

The previously uploaded patch is incorrect. The option stringsAsFactors = FALSE was given to the wrong function. I've also uploaded a corrected version for the patch. Sorry for the confusion.

mdekker

mdekker

2009-12-22 09:53

reporter   ~10635

Now I see the problem :) the first patch had stringAsFactors... the second has stringsAsFactors (so stringS) that's why movind the code to the correct part didn't work for me :)

I'll try again with this change on my problem dataset. Thanks for clarifying and attaching the example!

mdekker

mdekker

2009-12-22 10:05

reporter   ~10636

It seems to work ok now, but I get warnings for all the missings. Do you have a clue how to fix that?

In eval.with.vis(expr, envir, enclos) : NAs introduced by coercion

I think the missings are incorrectly exported as "" and should be just empty. I fixed it in my install and it seems to work perfect. No more warnings and correct NA (for as far as a quick scan shows me)

mdekker

mdekker

2009-12-22 11:09

reporter   ~10639

Coming back, the strings as factors doesn't seem to do much. The change for the missings seems to be the important change. In R we always get a factor with numbers from 1 to x and then the value labels. Original answer value is not stored! This is something to think about.

read.spss does the same thing when I read my spss file to a dataframe and R help gives a warning about not using the values from the vector. If erick or livio can come up with a solution for this we can see how to implement this.

So question How many children do you have?
1 - None
3 - 1 child
5 - 2 children

Would in spss become values 1,3,5 with according value labels but in R it would become a factor with values 1,2,3 and the labels. If you want to calculate a score based on the actual answer value this factor thing doesn't seem to be the best approach. Don't know if there is a datatype better fit for the job, or whether using just a number with some extra attributs can be of any help.

I am committing the patch including the stingsAsFactors and leave the topic open to solve the answer value vs answer label problem.

user1548

2009-12-22 20:44

  ~10642

hi, commited my proposal. I just changed the read.csv line. unfortunatly I don't have the time to check if it works now.
anyway, the resulting code should be:
na.strings=c("","\"\""),
we can also add the black tield as NA eg.
na.strings=c("","\"\""," "),

now those data are missing data.
menno, if you have some time to check.. it would be great..

mdekker

mdekker

2009-12-23 12:27

reporter   ~10658

I fixed a problem with the \ you needed one more :) Seems to work ok for me now.

user1548

2009-12-23 12:45

  ~10659

nice Menno thanks again!!

Issue History

Date Modified Username Field Change
2009-12-17 22:39 erick New Issue
2009-12-17 22:39 erick Status new => assigned
2009-12-17 22:39 erick Assigned To => user372
2009-12-17 22:39 erick File Added: export_data_r.php.patch
2009-12-17 22:39 erick LimeSurvey build number => 7191
2009-12-17 22:39 erick Database & DB-Version => 138
2009-12-17 22:39 erick Operating System (Server) => Linux Debian
2009-12-17 22:39 erick Webserver => Apache
2009-12-17 22:39 erick PHP Version => 5.2.6
2009-12-18 01:13 user372 Assigned To user372 => mdekker
2009-12-18 01:13 user372 Note Added: 10599
2009-12-18 09:10 mdekker Note Added: 10602
2009-12-18 09:10 mdekker Assigned To mdekker => user1548
2009-12-18 09:10 mdekker Status assigned => feedback
2009-12-21 09:59 mdekker Note Added: 10615
2009-12-21 11:13 mdekker Note Added: 10616
2009-12-21 11:15 mdekker Note Added: 10617
2009-12-21 22:04 erick File Added: Surveydata_syntax.R
2009-12-21 22:04 erick File Added: survey_99164_data_file.csv
2009-12-21 22:17 erick Note Added: 10633
2009-12-21 22:18 erick File Added: export_data_r2.php.patch
2009-12-22 09:53 mdekker Note Added: 10635
2009-12-22 10:05 mdekker Note Added: 10636
2009-12-22 11:09 mdekker Note Added: 10639
2009-12-22 20:44 user1548 Note Added: 10642
2009-12-23 12:27 mdekker Note Added: 10658
2009-12-23 12:45 user1548 Note Added: 10659
2010-02-11 17:32 c_schmitz Status feedback => closed
2010-02-11 17:32 c_schmitz Resolution open => fixed
2010-02-11 17:32 c_schmitz Fixed in Version => 1.87+
2010-05-06 10:27 c_schmitz Category Import / Export => (No Category)