A little bit of C#… – Extracting correct table structure from a Word document

After a summer break (or  I can say a busy summer) I decided to blog about a problem that I bumped into and I see that is bothering many people but I didn’t manage to find nice implementation.

Recently I had some task to extract data from word documents and well I thought: “Easy one… just use the API that Microsoft has and in all will be done in no-time”. I was so naïve to believe that the API will be without bugs (some of them reported and many unreported ones…).

One of the greatest challenges that I faced was exactly determining the structure of the table in a word document. The table can have horizontally or vertically merged cells and table header row(s). I didn’t pay much attention for the borders and shades since that is artistic thing and not really important for the table structure (when referring the information that the table holds).

So first thing is simple table (no merging, just pure table):

Column 1 Column2 Column 3
Line 1
Line 2
Line 3

Let’s suppose that I have created following classes for representing Cells and Tables

using System;
using System.Collections.Generic;
using Word = Microsoft.Office.Interop.Word;
using Office = Microsoft.Office.Core;
using Microsoft.Office.Tools.Word;
using System.Windows.Forms;
using Microsoft.Office.Interop.Word;
using System.Xml;

public class CustomCell
{
public int cellRow;
public int cellColumn;
public int cellRowSpan;
public int cellColumnSpan;
public String cellText;
}
public class CustomTable
{
public int rowsField;
public int columnsField;
public List cells;
}

In order to get the structure of the given table we simply need to loop through each row and cell and its contents. Simple as 1,2,3… If we have selected a table in Word we can use the following code:


Table wordTable = Application.Selection.Tables[1];
CustomTable table = new CustomTable();
table.cells = new List();
for (int row = 1; row <= wordTable.Rows.Count; row++)
{
for (int column = 1; column <= wordTable.Rows[row].Cells.Count; column++)
{
CustomCell cell = new CustomCell();
cell.cellColumn = column;
cell.cellRow = row;
cell.cellText = wordTable.Cell(row, column).Range.Text;
table.cells.Add(cell);
}
}

This code works also for a table like this:

Merged cell
Line 1
Line 2

Usually the users get creative and want something more in the tables like merged columns or cells for better data representation. In these cases this code is useless…  In case of vertically merged cells this code will still work but it is not the case if there are horizontally merged cells. There is a nice solution to use the Cell.Next method that the Word API offers. In that case the code would look like:


Table wordTable = Application.Selection.Tables[1];
Cell wordCell = wordTable.Cell(1, 1);
CustomTable table = new CustomTable();

table.cells = new List();
while(wordCell!=null)
{
CustomCell cell = new CustomCell();
cell.cellColumn = wordCell.RowIndex;
cell.cellRow = wordCell.ColumnIndex;
cell.cellText = wordCell.Range.Text;
table.cells.Add(cell);
wordCell = wordCell.Next;
}

Here is some a little bit more complex table:

Text Text Text
Text
Text

So when you think that all your problems are solved someone comes and asks for exact row and column span of each cell because maybe they want to have Word documents viewer or just embed the table in a web site. In this case the API doesn’t help with a method or attribute.. you have to find your own algorithm how to get this information. I read and read many ideas and forums: some had broken cells apart and them merged them (?!), then someone was using the height value/weight attribute to decide if the cells have row or column span and many more creative ideas. They all seemed a bit impossible for me because I thought of many exceptions that might happen and the code wouldn’t work properly. So I bumped into a comment saying that the XML structure of the table is a good place to start. So since there is a lot of documentation how to form a legal XML structure for a table I will refer to the following link http://msdn.microsoft.com/en-us/library/office/ff951689.aspx

After you master somewhat the structure you will understand the following code that goes through the cells and gets the information you need:

Table wordTable = Application.Selection.Tables[1];
Cell wordCell = wordTable.Cell(1, 1);
CustomTable table = new CustomTable();
table.cells = new List();

String s = Application.Selection.Tables[1].Range.XML;
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(s);
XmlNamespaceManager nsmgr = new XmlNamespaceManager(xmlDoc.NameTable);
nsmgr.AddNamespace("w", "http://schemas.microsoft.com/office/word/2003/wordml");

while (wordCell != null)
{
CustomCell cell = new CustomCell();
cell.cellRow = wordCell.RowIndex;
cell.cellColumn = wordCell.ColumnIndex;
int colspan;
XmlNode exactCell = xmlDoc.SelectNodes("//w:tr[" + wordCell.RowIndex.ToString() + "]/w:tc[" + wordCell.ColumnIndex.ToString() + "]/w:tcPr/w:gridSpan", nsmgr)[0];
if (exactCell != null)
{
colspan = Convert.ToInt16(exactCell.Attributes["w:val"].Value);
}
else
{
colspan = 1;
}

int rowspan = 1;
Boolean endRows = false;
int nextRows = wordCell.RowIndex + 1;
XmlNode exactCellVMerge = xmlDoc.SelectNodes("//w:tr[" + wordCell.RowIndex.ToString() + "]/w:tc[" + wordCell.ColumnIndex.ToString() + "]/w:tcPr/w:vmerge", nsmgr)[0];

if ((exactCellVMerge == null) || (exactCellVMerge != null && exactCellVMerge.Attributes["w:val"] == null))
{
rowspan = 1;
}
else
{
while (nextRows <= wordTable.Rows.Count && !endRows)
{
XmlNode nextCellMerge = xmlDoc.SelectNodes("//w:tr[" + nextRows.ToString() + "]/w:tc[" + wordCell.ColumnIndex.ToString() + "]/w:tcPr/w:vmerge", nsmgr)[0];
if (nextCellMerge != null && (nextCellMerge.Attributes["w:val"] == null))
{
nextRows++;
rowspan++;
continue;
}
else
{
endRows = true;
}
}
}
cell.cellRowSpan = rowspan;
cell.cellColumnSpan = colspan;
cell.cellText = wordCell.Range.Text;
table.cells.Add(cell);
wordCell = wordCell.Next;
}

Last words: Copy, paste, test, use, reuse but don’t abuse 🙂