Text Processing¶

ela.textproc module¶

Text processing features for lithology analysis.

ela.textproc.DEM_ELEVATION_COL = 'DEM_elevation'¶: Default column name expected in lithodescription data frames

ela.textproc.DEPTH_FROM_AHD_COL = 'Depth From (AHD)'¶: Default column name expected in lithodescription data frames

ela.textproc.DEPTH_FROM_COL = 'Depth From (m)'¶: Default column name expected in lithodescription data frames

ela.textproc.DEPTH_TO_AHD_COL = 'Depth To (AHD)'¶: Default column name expected in lithodescription data frames

ela.textproc.DEPTH_TO_COL = 'Depth To (m)'¶: Default column name expected in lithodescription data frames

ela.textproc.DISTANCE_COL = 'distance'¶: Default column name expected in lithodescription data frames

ela.textproc.EASTING_COL = 'Easting'¶: Default column name expected in lithodescription data frames

ela.textproc.GEOMETRY_COL = 'geometry'¶: Default column name expected in lithodescription data frames

ela.textproc.LITHO_DESC_COL = 'Lithological Description'¶: Default column name expected in lithodescription data frames

ela.textproc.NORTHING_COL = 'Northing'¶: Default column name expected in lithodescription data frames

ela.textproc.PRIMARY_LITHO_COL = 'Lithology_1'¶: Default column name expected in lithodescription data frames

ela.textproc.PRIMARY_LITHO_NUM_COL = 'Lithology_1_num'¶: Default column name expected in lithodescription data frames

ela.textproc.SECONDARY_LITHO_COL = 'Lithology_2'¶: Default column name expected in lithodescription data frames

ela.textproc.SECONDARY_LITHO_NUM_COL = 'Lithology_2_num'¶: Default column name expected in lithodescription data frames

ela.textproc.as_numeric(x)¶

ela.textproc.clean_lithology_descriptions(description_series, lex)¶

Preparatory cleanup of lithology descriptions for further analysis

Replace abbreviations and misspelling according to a lexicon, and transform to lower case

Parameters

description_series (iterable of str, or pd.Series) – lithology descriptions
lex (striplog.Lexicon) – an instance of striplog’s Lexicon

Returns

processed descriptions.

Return type

(iterable of str)

ela.textproc.columns_as_numeric(df, colnames=None)¶

Process some columns to make sure they are numeric. In-place changes.

Parameters

df (pandas data frame) – bore lithology data
colnames (iterable of str) – column names

ela.textproc.find_litho_markers(tokens, regex)¶

Find lithology lithology terms that match a regular expression

Parameters

tokens (iterable of str) – the list of tokenised sentences.
regex (regex) – compiles regular expression e.g. re.compile(‘sand|clay’)

Returns

tokens found to be matching the expression

Return type

(list of str)

ela.textproc.find_primary_lithology(tokens, lithologies_dict)¶

Find a primary lithology in a tokenised sentence.

Parameters

v_tokens (iterable of iterable of str) – the list of tokenised sentences.
lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Keys are the lithology classes.

Returns

list of primary lithologies if dectected. empty string for none.

Return type

list

ela.textproc.find_regex_df(df, expression, colname)¶

Sample a random subset of rows where the lithology column matches a particular class name.

Parameters: df (pandas data frame) – bore lithology data with columns named PRIMARY_LITHO_COL
Returns
Return type: dataframe

ela.textproc.find_secondary_lithology(tokens_and_primary, lithologies_adjective_dict, lithologies_dict)¶

Find a secondary lithology in a tokenised sentence.

Parameters

tokens_and_primary (tuple ([str],str) – tokens and the primary lithology
lithologies_adjective_dict (dict) – dictionary, where keys are exact, “clear” markers for secondary lithologies (e.g. ‘clayey’). Keys are the lithology classes.
lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Keys are the lithology classes.

Returns

secondary lithology if dectected. empty string for none.

Return type

str

ela.textproc.find_word_from_root(tokens, root)¶

Filter token (words) to retain only those containing a root term

Parameters

tokens (iterable of str) – the list of tokens.
root (str) – regular expression for the root term, to look for (e.g ‘clay’ or ‘cl(a|e)y’), which will be padded with ‘[a-z]*’ for searching

Returns

terms matching the root term.

Return type

a list

ela.textproc.flat_list_tokens(descriptions)¶

Convert a collection of strings to a flat list of tokens. English NLTK stopwords.

Parameters: descriptions (iterable of str) – lithology descriptions.
Returns: List of tokens.
Return type: list

ela.textproc.match_and_sample_df(df, litho_class_name, colname='Lithology_1', out_colname=None, size=50, seed=None)¶

Sample a random subset of rows where the lithology column matches a particular class name.

Parameters: df (pandas data frame) – bore lithology data with columns named PRIMARY_LITHO_COL
Returns: a list of strings, compound primary+optional_secondary lithology descriptions e.g. ‘sand/clay’, ‘loam/’

ela.textproc.plot_freq(dataframe, y_log=False, x='token', figsize=(15, 10), fontsize=14)¶

Plot a sorted histogram of work frequencies

Parameters

dataframe (pandas dataframe) – frequency of tokens, typically with colnames [“token”,”frequency”]
y_log (bool) – should there be a log scale on the y axis
x (str) – name of the columns with the tokens (i.e. words)
figsize (tuple) –
fontsize (int) –

Returns

plot

Return type

barplot

ela.textproc.plot_freq_for_root(tokens, root, y_log=True)¶

Plot a sorted histogram of work frequencies

Parameters

tokens (iterable of str) – the list of tokens.
root (str) – regular expression for the root term, to look for (e.g ‘clay’ or ‘cl(a|e)y’), which will be padded with ‘[a-z]*’ for searching
y_log (bool) – should there be a log scale on the y axis

Returns

plot

Return type

barplot

ela.textproc.remove_punctuations(text)¶: Remove the punctuations (string.punctuation) in a string.

ela.textproc.replace_punctuations(text, replacement=' ')¶: Replace the punctuations (string.punctuation) in a string.

ela.textproc.split_composite_term(x, joint_re='with')¶

Break terms that are composites padding several words without space. This has been observed in one case study but may not be prevalent.

Parameters

x (str) – the term to split if matching, e.g. ‘claywithsand’ to ‘clay with sand’
joint_re (str) – regular expression for the word used as fusing join, typically ‘with’

Returns

tokens split from the joining term.

Return type

split wording (str)

ela.textproc.split_with_term(x)¶

split words that are joined by a with, i.e. ‘sandwithclay’ :param x: the term to split if matching, e.g. ‘claywithsand’ to ‘clay with sand’ :type x: str

Returns: tokens split from the joining term.
Return type: split wording (str)

ela.textproc.token_freq(tokens, n_most_common=50)¶

Gets the most frequent (counts) tokens

Parameters

tokens (iterable of str) – the list of tokens to analyse for frequence.
n_most_common (int) – subset to the this number of most frequend tokens

Returns

columns=[“token”,”frequency”]

Return type

pandas DataFrame

ela.textproc.v_find_litho_markers(v_tokens, regex)¶

Find lithology lithology terms that match a regular expression

Parameters

v_tokens (iterable of iterable of str) – the list of tokenised sentences.
regex (regex) – compiles regular expression e.g. re.compile(‘sand|clay’)

Returns

tokens found to be matching the expression

Return type

(iterable of iterable of str)

ela.textproc.v_find_primary_lithology(v_tokens, lithologies_dict)¶

Vectorised function to find a primary lithology in a list of tokenised sentences.

Parameters

v_tokens (iterable of iterable of str) – the list of tokenised sentences.
lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Values are the lithology classes.

Returns

list of primary lithologies if dectected. empty string for none.

Return type

list

ela.textproc.v_find_secondary_lithology(v_tokens, prim_litho, lithologies_adjective_dict, lithologies_dict)¶

Vectorised function to find a secondary lithology in a list of tokenised sentences.

Parameters

v_tokens (iterable of iterable of str) – the list of tokenised sentences.
prim_litho (list of str) – the list of primary lithologies already detected for v_tokens. The secondary lithology cannot be the same as the primary.
lithologies_adjective_dict (dict) – dictionary, where keys are exact, “clear” markers for secondary lithologies (e.g. ‘clayey’). Keys are the lithology classes.
lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Values are the lithology classes.

Returns

list of secondary lithologies if dectected. empty string for none.

Return type

list

ela.textproc.v_lower = <numpy.vectorize object>¶: vectorised, unicode version to lower case strings

ela.textproc.v_remove_punctuations(textlist)¶

vectorised function to remove punctuations :param textlist: list of terms :type textlist: iterable of str

Returns
Return type: (list)

ela.textproc.v_replace_punctuations(textlist, replacement=' ')¶

vectorised function to replace punctuations :param textlist: list of terms :type textlist: iterable of str

Returns
Return type: (list)

ela.textproc.v_split_with_term(xlist)¶

split words that are joined by a with, i.e. ‘sandwithclay’ :param xlist: the terms to split if matching, e.g. ‘claywithsand’ to ‘clay with sand’ :type xlist: iterable of str

Returns: tokens split from the joining term.
Return type: split tokens (list of str)

ela.textproc.v_word_tokenize(descriptions)¶

Vectorised tokenisation of lithology descriptions.

Parameters: descriptions (iterable of str) – lithology descriptions.
Returns: list of lists of tokens in the NLTK.
Return type: list