Calculating words from plain t

2006/11/09 by Lassi A. Liikkanen

Calculating words in formatted text

Say you have to calculate the number of words or characters of several paragprahs in a plain text file. For instance, the word 'summary' is always followed by a paragraph and you would like to know how long these paragraphs are. This would be trivial if the text was of a structured format (XML or other SGML derivative) and could parsed with a dedicated tool (if you're into programming). So normally, to do this using Ms Word, you could manually go through all the paragrahs, selecting them and performing a word count for each. But that would be laborous if the number fields was considerable.

Solution is to use Word and Excel together. First, we will use the Find (Ctrl+F) function of Word to find all the fields. We will use two special options, Highlight all items found in Main document and Use wildcards. The former allows us to select all found fields at once (clever, huh), which is very difficult by hand. The latter option allows us to search using wildcards, the Ms Word version of regular expressions. Using wildcards we can neatly specify what we're looking for:
Summary:^13*^13
is the magic word. In english, it says that we would like to find all occurrences of a string starting with the word "summary:" and followed by a paragraph mark ^13 (or line feed/carriage return/enter). We use the strange ^13 because the normally used ^p is not supported with wildcards (for unknown reason). Then we use the standard asterisk (*) wildcard that will capture all text within the paragraph before the paragraph is terminated by a second paragraph mark. Click ok you should see word highlight all the fields we are looking for. Then just Copy (Ctrl+C) all information to clipboard.

From clipboard, open a new worksheet in Excel and select Paste special... and choose Text. This will paste all selected fields along with their headers (summary:) to a single column. You will likely need to re-adjust the size and formatting of the cells to fit the text into cell better. Next we will do the count, which goes easily for the character using the workbook function len(). Say the first field of interest is in the cell A2, then go to B2 and enter the formula:
=len(a2)
Next, select the cells b1:b2 and copy their formula to all cells you want to include. The empty cell is included to avoid counting the header field repeatedly. Then select all the data you have and Sort it by the column B descending, and there it is!

Counting Words in Excel is more tricky, because word count function is non-existent. However, Ozgrid provides a tip on how to do this properly. You will need to formula
=len(trim(A2))-len(substitute(A2," ",""))+1
this will effectively give the number of words. Again, sort and add average() and stdev() to know all about your text file.

Keywords: [computers] , [windows] Document's status: Ok (Document dates explained)

This document created: 2006/11/09
Modified: 2006/11/09
Published: 2006/11/09


This document's permanent URI (linking):
http://iki.fi/lassial/guides/computers/061109-Calculating_words_from_plain_t

© Lassi A. Liikkanen 2006 - 2016. All rights reserved.
^Top of the Page^

*Change layout:
Printable printable
Large text

Myydään:

@lassial Twitter feed: