Slow access to data

Jul 25, 2013 at 5:13 PM
First of all thanks for creating this tool!

I am trying to access an .xlsx file with three sheets, each about 4400x50 cells. I would like all data to memory and then loop over the resulting table and select what is needed. Therefore I do:
private DataSet m_ds = new DataSet;
...

FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read);
IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(fs);
excelReader.IsFirstRowAsColumnNames = true;
m_ds = excelReader.AsDataSet();
fs.Close();
...
foreach (DataTable dt in m_ds[tab].Tables)
{
   if (dt.TableName == sheet)
   {
      foreach (DataRow r in dt.Rows)
      {
         string str = Convert.ToString(r["Date"]);
         // access some other columns...
      }
   }
}
I investigated the times required by different operations and I got to the conclusion that the time required for opening is quite small (abt 2 s), but the time required for looping over the DataTables is quite large (abt 15 s). This is a bit of a problem for my application. I am a bit surprised by this since I expected that IO should be slower than access of objects from memory. ...or I completely misunderstood how the DataSet object works. Any explanation for this behavior or how I could improve my code?
Thanks in advance,
Valentin

PS. I am new to C# so comments regarding my implementation are welcome.
Jul 25, 2013 at 6:51 PM
Opening would include the "m_ds = excelReader.AsDataSet()" line. I ran a test opening 3 fairly large sheets and loading them in a data grid:

13 seconds to load sheets in dataset.
0.003 seconds to load each table in a data grid.

I believe the IO occurs in the "m_ds =" line. I don't see any timing in your code to break out the load time from the data set access time. Opening the file IO time is minimal.

I suggest you put some Stopwatch timers around file open to file close and around the table access.

Let me know what you find.
Jul 26, 2013 at 10:05 AM
Edited Jul 26, 2013 at 10:05 AM
I am sorry I didn't put the timers on the piece of code posted above. I have already done that. These are more exact results:
FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read);
IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(fs);
excelReader.IsFirstRowAsColumnNames = true;
m_ds = excelReader.AsDataSet();
fs.Close();
This takes about 2 s. If I understand correctly this is the where the IO takes place. If so, I think this is excellent.
foreach (DataRow r in dt.Rows)
                        {
                            string str = Convert.ToString(r["Date"]);
                            DateTime d;
                            double v;
                            try
                            {
                                if (str.Contains('.'))
                                {
                                    d = DateTime.Parse(ControlExcel.ConvertToDateTime(str), provider);
                                }
                                else
                                    d = DateTime.Parse(ControlExcel.ConvertToDateTime(str));
                                v = Convert.ToDouble(r[p.Key]);
                            }
                            catch (Exception theException)
                            {
                                continue;
                            }
                            dat.Add(d);
                            val.Add(v);
                        }
This takes about 13 s.

My first thought was that access to the DataTable container is slow. So I tried some alternatives, by copying the contents of the columns I am interested in into List or into System.Array objects. Then I looped over the List/System.Array containers, but the time is exactly the same.

As a reference, I am comparing to what I observe when using the Microsoft.Office.Interop functionality. In this case, the file access time is comparable. But I only read the columns I am interested in and not the whole workbook (like when using the ExcelDataReader --> this might a big advantage!). I get the colmun data into System.Array objects. Then looping over these containers is very fast (abt 2 seconds), with the same code as in the foreach loop above.
Jul 26, 2013 at 3:09 PM
I think you should pull everything out of your loop but the string str = Convert.ToString(r["Date"]); and do something with the str value just to make sure it doesn't get optimized out. I like the Stopwatch timer outside the loop so it doesn't add any delay inside the loop.

I just have a list box on a form that displays the times. I create the gridviews dynamically for each table. It's possible I messed up the timing code but I don't see anything like the times your seeing. Here is the code I used:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.IO;
using Excel;
using System.Diagnostics;

namespace ExcelParserFormsTest
{
public partial class Form1 : Form
{
public Form1()
{
  InitializeComponent();
  LoadSheets();
}

private void LoadSheets()
{
  Stopwatch swLoad = new Stopwatch();
  Stopwatch swTot = new Stopwatch();
  Stopwatch swTbl = new Stopwatch();

  //string _PathFilename = @"C:\Temp\test3.xlsx";
  string _PathFilename = @"C:\Temp\tri_201918.xlsx";
  swTot.Start();
  using (FileStream streamIn = File.Open(_PathFilename, FileMode.Open, FileAccess.Read))
  using (IExcelDataReader execlReader = (Path.GetExtension(_PathFilename) == ".xlsx" ? ExcelReaderFactory.CreateOpenXmlReader(streamIn) : ExcelReaderFactory.CreateBinaryReader(streamIn)))
  {
    swLoad.Start();
    DataSet ds = new DataSet();
    ds = execlReader.AsDataSet();
    swLoad.Stop();
    lbTimes.Items.Add("Load " + swLoad.Elapsed.ToString());

    int r = ds.Tables[0].Rows.Count;
    int c = ds.Tables[0].Columns.Count;

    if (ds.Tables.Count < 1)
    {
      MessageBox.Show("  No sheets in workbook.");
    }
    else
    {
      int x = 0, y = 0;
      foreach (DataTable sheet in ds.Tables)
      {
        lbTimes.Items.Add("Rows " + sheet.Rows.Count.ToString());
        lbTimes.Items.Add("Cols " + sheet.Columns.Count.ToString());
        swTbl.Start();
        DataGridView dgv = new DataGridView();
        dgv.Location = new Point(x+=100, y+=100);
        dgv.DataSource = sheet;
        this.Controls.Add(dgv);
        swTbl.Stop();
        lbTimes.Items.Add("Table " + swTbl.Elapsed.ToString());
      }
    }
  }
  swTot.Stop();
  lbTimes.Items.Add("Tot " + swTot.Elapsed.ToString());
}
}
}
Jul 29, 2013 at 2:30 PM
Ok I figured out what was slowing things down. It was the part which was parsing the date/time string to a DateTime object.
Using the Microsoft.Office.Interop functionality, the value of the cell converted to a string gives a "number" (e.g., "35432" etc.), while using the ExcelDataReader returns a date string (e.g., "1/2/1997 12:00:00 AM"). In the latter case I had to change a little bit my code and I achieved again quite fast times around the foreach loop. Around 2s, which is ok for my application.
Thanks for the discussion!
Regards,
Valentin