Text Processing¶
ela.textproc module¶
Text processing features for lithology analysis.
-
ela.textproc.DEM_ELEVATION_COL= 'DEM_elevation'¶ Default column name expected in lithodescription data frames
-
ela.textproc.DEPTH_FROM_AHD_COL= 'Depth From (AHD)'¶ Default column name expected in lithodescription data frames
-
ela.textproc.DEPTH_FROM_COL= 'Depth From (m)'¶ Default column name expected in lithodescription data frames
-
ela.textproc.DEPTH_TO_AHD_COL= 'Depth To (AHD)'¶ Default column name expected in lithodescription data frames
-
ela.textproc.DEPTH_TO_COL= 'Depth To (m)'¶ Default column name expected in lithodescription data frames
-
ela.textproc.DISTANCE_COL= 'distance'¶ Default column name expected in lithodescription data frames
-
ela.textproc.EASTING_COL= 'Easting'¶ Default column name expected in lithodescription data frames
-
ela.textproc.GEOMETRY_COL= 'geometry'¶ Default column name expected in lithodescription data frames
-
ela.textproc.LITHO_DESC_COL= 'Lithological Description'¶ Default column name expected in lithodescription data frames
-
ela.textproc.NORTHING_COL= 'Northing'¶ Default column name expected in lithodescription data frames
-
ela.textproc.PRIMARY_LITHO_COL= 'Lithology_1'¶ Default column name expected in lithodescription data frames
-
ela.textproc.PRIMARY_LITHO_NUM_COL= 'Lithology_1_num'¶ Default column name expected in lithodescription data frames
-
ela.textproc.SECONDARY_LITHO_COL= 'Lithology_2'¶ Default column name expected in lithodescription data frames
-
ela.textproc.SECONDARY_LITHO_NUM_COL= 'Lithology_2_num'¶ Default column name expected in lithodescription data frames
-
ela.textproc.as_numeric(x)¶
-
ela.textproc.clean_lithology_descriptions(description_series, lex)¶ Preparatory cleanup of lithology descriptions for further analysis
Replace abbreviations and misspelling according to a lexicon, and transform to lower case
- Parameters
description_series (iterable of str, or pd.Series) – lithology descriptions
lex (striplog.Lexicon) – an instance of striplog’s Lexicon
- Returns
processed descriptions.
- Return type
(iterable of str)
-
ela.textproc.columns_as_numeric(df, colnames=None)¶ Process some columns to make sure they are numeric. In-place changes.
- Parameters
df (pandas data frame) – bore lithology data
colnames (iterable of str) – column names
-
ela.textproc.find_litho_markers(tokens, regex)¶ Find lithology lithology terms that match a regular expression
- Parameters
tokens (iterable of str) – the list of tokenised sentences.
regex (regex) – compiles regular expression e.g. re.compile(‘sand|clay’)
- Returns
tokens found to be matching the expression
- Return type
(list of str)
-
ela.textproc.find_primary_lithology(tokens, lithologies_dict)¶ Find a primary lithology in a tokenised sentence.
- Parameters
v_tokens (iterable of iterable of str) – the list of tokenised sentences.
lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Keys are the lithology classes.
- Returns
list of primary lithologies if dectected. empty string for none.
- Return type
list
-
ela.textproc.find_regex_df(df, expression, colname)¶ Sample a random subset of rows where the lithology column matches a particular class name.
- Parameters
df (pandas data frame) – bore lithology data with columns named PRIMARY_LITHO_COL
- Returns
- Return type
dataframe
-
ela.textproc.find_secondary_lithology(tokens_and_primary, lithologies_adjective_dict, lithologies_dict)¶ Find a secondary lithology in a tokenised sentence.
- Parameters
tokens_and_primary (tuple ([str],str) – tokens and the primary lithology
lithologies_adjective_dict (dict) – dictionary, where keys are exact, “clear” markers for secondary lithologies (e.g. ‘clayey’). Keys are the lithology classes.
lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Keys are the lithology classes.
- Returns
secondary lithology if dectected. empty string for none.
- Return type
str
-
ela.textproc.find_word_from_root(tokens, root)¶ Filter token (words) to retain only those containing a root term
- Parameters
tokens (iterable of str) – the list of tokens.
root (str) – regular expression for the root term, to look for (e.g ‘clay’ or ‘cl(a|e)y’), which will be padded with ‘[a-z]*’ for searching
- Returns
terms matching the root term.
- Return type
a list
-
ela.textproc.flat_list_tokens(descriptions)¶ Convert a collection of strings to a flat list of tokens. English NLTK stopwords.
- Parameters
descriptions (iterable of str) – lithology descriptions.
- Returns
List of tokens.
- Return type
list
-
ela.textproc.match_and_sample_df(df, litho_class_name, colname='Lithology_1', out_colname=None, size=50, seed=None)¶ Sample a random subset of rows where the lithology column matches a particular class name.
- Parameters
df (pandas data frame) – bore lithology data with columns named PRIMARY_LITHO_COL
- Returns
a list of strings, compound primary+optional_secondary lithology descriptions e.g. ‘sand/clay’, ‘loam/’
-
ela.textproc.plot_freq(dataframe, y_log=False, x='token', figsize=(15, 10), fontsize=14)¶ Plot a sorted histogram of work frequencies
- Parameters
dataframe (pandas dataframe) – frequency of tokens, typically with colnames [“token”,”frequency”]
y_log (bool) – should there be a log scale on the y axis
x (str) – name of the columns with the tokens (i.e. words)
figsize (tuple) –
fontsize (int) –
- Returns
plot
- Return type
barplot
-
ela.textproc.plot_freq_for_root(tokens, root, y_log=True)¶ Plot a sorted histogram of work frequencies
- Parameters
tokens (iterable of str) – the list of tokens.
root (str) – regular expression for the root term, to look for (e.g ‘clay’ or ‘cl(a|e)y’), which will be padded with ‘[a-z]*’ for searching
y_log (bool) – should there be a log scale on the y axis
- Returns
plot
- Return type
barplot
-
ela.textproc.remove_punctuations(text)¶ Remove the punctuations (
string.punctuation) in a string.
-
ela.textproc.replace_punctuations(text, replacement=' ')¶ Replace the punctuations (
string.punctuation) in a string.
-
ela.textproc.split_composite_term(x, joint_re='with')¶ Break terms that are composites padding several words without space. This has been observed in one case study but may not be prevalent.
- Parameters
x (str) – the term to split if matching, e.g. ‘claywithsand’ to ‘clay with sand’
joint_re (str) – regular expression for the word used as fusing join, typically ‘with’
- Returns
tokens split from the joining term.
- Return type
split wording (str)
-
ela.textproc.split_with_term(x)¶ split words that are joined by a with, i.e. ‘sandwithclay’ :param x: the term to split if matching, e.g. ‘claywithsand’ to ‘clay with sand’ :type x: str
- Returns
tokens split from the joining term.
- Return type
split wording (str)
-
ela.textproc.token_freq(tokens, n_most_common=50)¶ Gets the most frequent (counts) tokens
- Parameters
tokens (iterable of str) – the list of tokens to analyse for frequence.
n_most_common (int) – subset to the this number of most frequend tokens
- Returns
columns=[“token”,”frequency”]
- Return type
pandas DataFrame
-
ela.textproc.v_find_litho_markers(v_tokens, regex)¶ Find lithology lithology terms that match a regular expression
- Parameters
v_tokens (iterable of iterable of str) – the list of tokenised sentences.
regex (regex) – compiles regular expression e.g. re.compile(‘sand|clay’)
- Returns
tokens found to be matching the expression
- Return type
(iterable of iterable of str)
-
ela.textproc.v_find_primary_lithology(v_tokens, lithologies_dict)¶ Vectorised function to find a primary lithology in a list of tokenised sentences.
- Parameters
v_tokens (iterable of iterable of str) – the list of tokenised sentences.
lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Values are the lithology classes.
- Returns
list of primary lithologies if dectected. empty string for none.
- Return type
list
-
ela.textproc.v_find_secondary_lithology(v_tokens, prim_litho, lithologies_adjective_dict, lithologies_dict)¶ Vectorised function to find a secondary lithology in a list of tokenised sentences.
- Parameters
v_tokens (iterable of iterable of str) – the list of tokenised sentences.
prim_litho (list of str) – the list of primary lithologies already detected for v_tokens. The secondary lithology cannot be the same as the primary.
lithologies_adjective_dict (dict) – dictionary, where keys are exact, “clear” markers for secondary lithologies (e.g. ‘clayey’). Keys are the lithology classes.
lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Values are the lithology classes.
- Returns
list of secondary lithologies if dectected. empty string for none.
- Return type
list
-
ela.textproc.v_lower= <numpy.vectorize object>¶ vectorised, unicode version to lower case strings
-
ela.textproc.v_remove_punctuations(textlist)¶ vectorised function to remove punctuations :param textlist: list of terms :type textlist: iterable of str
- Returns
- Return type
(list)
-
ela.textproc.v_replace_punctuations(textlist, replacement=' ')¶ vectorised function to replace punctuations :param textlist: list of terms :type textlist: iterable of str
- Returns
- Return type
(list)
-
ela.textproc.v_split_with_term(xlist)¶ split words that are joined by a with, i.e. ‘sandwithclay’ :param xlist: the terms to split if matching, e.g. ‘claywithsand’ to ‘clay with sand’ :type xlist: iterable of str
- Returns
tokens split from the joining term.
- Return type
split tokens (list of str)
-
ela.textproc.v_word_tokenize(descriptions)¶ Vectorised tokenisation of lithology descriptions.
- Parameters
descriptions (iterable of str) – lithology descriptions.
- Returns
list of lists of tokens in the NLTK.
- Return type
list