I'd like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.
If, for example, I had an HTML table with three columns (marked by header tags), 'Event', 'Start Date', and 'End Date' and that table had 5 entries, I would like to parse through that table to get back a list of length 5 where each element is a dictionary with keys 'Event', 'Start Date', and 'End Date'.
Jan 30, 2018 - Automate Simple Tasks with Python: Excel Table to HTML Table using the python Dominate. #create another function to create html page.
Thanks for the help!
AndrewAndrew4 Answers
You should use some HTML parsing library like lxml
:
prints
Sven MarnachSven MarnachSven Marnach excellent solution is directly translatable into ElementTree which is part of recent Python distributions:
same output as Sven Marnach's answer...
Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.
Only downside is that read_html()
doesn't preserve hyperlinks.
If the HTML is not XML you can't do it with etree. But even then, you don't have to use an external library for parsing a HTML table. In python 3 you can reach your goal with HTMLParser
from html.parser
. I've the code of the simple derived HTMLParser class here in a github repo.
You can use that class (here named HTMLTableParser
) the following way:
The output of this is a list of 2D-lists representing tables. It looks maybe like this:
schmijosschmijosNot the answer you're looking for? Browse other questions tagged pythonhtml or ask your own question.
Table objects are constructed using the add_table()
method on Document
.
Table
objects¶
![Python Create Html Table Python Create Html Table](https://ajalacomfort.com/wp-content/uploads/2017/08/brenda-godinez-229718.jpg)
docx.table.
Table
(tbl, parent)[source]¶Proxy class for a WordprocessingML <w:tbl>
element.
add_column
(width)[source]¶Return a _Column
object of width, newly added rightmost to thetable.
add_row
()[source]¶Return a _Row
instance, newly added bottom-most to the table.
alignment
¶Read/write. A member of WD_TABLE_ALIGNMENT or None, specifying thepositioning of this table between the page margins. None
if nosetting is specified, causing the effective value to be inheritedfrom the style hierarchy.
autofit
¶True
if column widths can be automatically adjusted to improve thefit of cell contents. False
if table layout is fixed. Column widthsare adjusted in either case if total column width exceeds page width.Read/write boolean.
cell
(row_idx, col_idx)[source]¶Return _Cell
instance correponding to table cell at row_idx,col_idx intersection, where (0, 0) is the top, left-most cell.
![Python create html table word Python create html table word](https://sarahleejane.github.io/assets/tables_in_flask_app.png)
column_cells
(column_idx)[source]¶Sequence of cells in the column at column_idx in this table.
columns
¶_Columns
instance representing the sequence of columns in thistable.
row_cells
(row_idx)[source]¶Sequence of cells in the row at row_idx in this table.
rows
¶_Rows
instance containing the sequence of rows in this table.
style
¶Read/write. A _TableStyle
object representing the style applied tothis table. The default table style for the document (often NormalTable) is returned if the table has no directly-applied style.Assigning None
to this property removes any directly-applied tablestyle causing it to inherit the default table style of the document.Note that the style name of a table style differs slightly from thatdisplayed in the user interface; a hyphen, if it appears, must beremoved. For example, Light Shading - Accent 1 becomes LightShading Accent 1.
table_direction
¶A member of WD_TABLE_DIRECTION indicating the direction in whichthe table cells are ordered, e.g. WD_TABLE_DIRECTION.LTR. None
indicates the value is inherited from the style hierarchy.
_Cell
objects¶
docx.table.
_Cell
(tc, parent)[source]¶Table cell
add_paragraph
(text=u', style=None)[source]¶Return a paragraph newly added to the end of the content in thiscell. If present, text is added to the paragraph in a single run.If specified, the paragraph style style is applied. If style isnot specified or is None
, the result is as though the ‘Normal’style was applied. Note that the formatting of text in a cell can beinfluenced by the table style. text can contain tab (t
)characters, which are converted to the appropriate XML form fora tab. text can also include newline (n
) or carriage return(r
) characters, each of which is converted to a line break.
add_table
(rows, cols)[source]¶Return a table newly added to this cell after any existing cellcontent, having rows rows and cols columns. An empty paragraph isadded after the table because Word requires a paragraph element asthe last element in every cell.
merge
(other_cell)[source]¶Return a merged cell created by spanning the rectangular regionhaving this cell and other_cell as diagonal corners. RaisesInvalidSpanError
if the cells do not define a rectangular region.
paragraphs
¶List of paragraphs in the cell. A table cell is required to containat least one block-level element and end with a paragraph. Bydefault, a new cell contains a single paragraph. Read-only
tables
¶List of tables in the cell, in the order they appear. Read-only.
text
¶The entire contents of this cell as a string of text. Assigninga string to this property replaces all existing content with a singleparagraph containing the assigned text in a single run.
vertical_alignment
¶Member of WD_CELL_VERTICAL_ALIGNMENT or None.
A value of None
indicates vertical alignment for this cell isinherited. Assigning None
causes any explicitly defined verticalalignment to be removed, restoring inheritance.
width
¶The width of this cell in EMU, or None
if no explicit width is set.
_Row
objects¶
docx.table.
_Row
(tr, parent)[source]¶Table row
cells
¶Sequence of _Cell
instances corresponding to cells in this row.
height
¶Return a Length
object representing the height of this cell, orNone
if no explicit height is set.
height_rule
¶Return the height rule of this cell as a member of theWD_ROW_HEIGHT_RULE enumeration, or None
if no explicitheight_rule is set.
table
¶Reference to the Table
object this row belongs to.
_Column
objects¶
docx.table.
_Column
(gridCol, parent)[source]¶Table column
cells
¶Sequence of _Cell
instances corresponding to cells in this column.
table
¶Reference to the Table
object this column belongs to.
width
¶The width of this column in EMU, or None
if no explicit width isset.
_Rows
objects¶
docx.table.
_Rows
(tbl, parent)[source]¶Sequence of _Row
objects corresponding to the rows in a table.Supports len()
, iteration, indexed access, and slicing.
table
¶Reference to the Table
object this row collection belongs to.
_Columns
objects¶
docx.table.
_Columns
(tbl, parent)[source]¶Sequence of _Column
instances corresponding to the columns in a table.Supports len()
, iteration and indexed access.
table
¶Reference to the Table
object this column collection belongs to.