Text Processing

ela.textproc module

Text processing features for lithology analysis.

ela.textproc.DEM_ELEVATION_COL = 'DEM_elevation'

Default column name expected in lithodescription data frames

ela.textproc.DEPTH_FROM_AHD_COL = 'Depth From (AHD)'

Default column name expected in lithodescription data frames

ela.textproc.DEPTH_FROM_COL = 'Depth From (m)'

Default column name expected in lithodescription data frames

ela.textproc.DEPTH_TO_AHD_COL = 'Depth To (AHD)'

Default column name expected in lithodescription data frames

ela.textproc.DEPTH_TO_COL = 'Depth To (m)'

Default column name expected in lithodescription data frames

ela.textproc.DISTANCE_COL = 'distance'

Default column name expected in lithodescription data frames

ela.textproc.EASTING_COL = 'Easting'

Default column name expected in lithodescription data frames

ela.textproc.GEOMETRY_COL = 'geometry'

Default column name expected in lithodescription data frames

ela.textproc.LITHO_DESC_COL = 'Lithological Description'

Default column name expected in lithodescription data frames

ela.textproc.NORTHING_COL = 'Northing'

Default column name expected in lithodescription data frames

ela.textproc.PRIMARY_LITHO_COL = 'Lithology_1'

Default column name expected in lithodescription data frames

ela.textproc.PRIMARY_LITHO_NUM_COL = 'Lithology_1_num'

Default column name expected in lithodescription data frames

ela.textproc.SECONDARY_LITHO_COL = 'Lithology_2'

Default column name expected in lithodescription data frames

ela.textproc.SECONDARY_LITHO_NUM_COL = 'Lithology_2_num'

Default column name expected in lithodescription data frames

ela.textproc.as_numeric(x)
ela.textproc.clean_lithology_descriptions(description_series, lex)

Preparatory cleanup of lithology descriptions for further analysis

Replace abbreviations and misspelling according to a lexicon, and transform to lower case

Parameters
  • description_series (iterable of str, or pd.Series) – lithology descriptions

  • lex (striplog.Lexicon) – an instance of striplog’s Lexicon

Returns

processed descriptions.

Return type

(iterable of str)

ela.textproc.columns_as_numeric(df, colnames=None)

Process some columns to make sure they are numeric. In-place changes.

Parameters
  • df (pandas data frame) – bore lithology data

  • colnames (iterable of str) – column names

ela.textproc.find_litho_markers(tokens, regex)

Find lithology lithology terms that match a regular expression

Parameters
  • tokens (iterable of str) – the list of tokenised sentences.

  • regex (regex) – compiles regular expression e.g. re.compile(‘sand|clay’)

Returns

tokens found to be matching the expression

Return type

(list of str)

ela.textproc.find_primary_lithology(tokens, lithologies_dict)

Find a primary lithology in a tokenised sentence.

Parameters
  • v_tokens (iterable of iterable of str) – the list of tokenised sentences.

  • lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Keys are the lithology classes.

Returns

list of primary lithologies if dectected. empty string for none.

Return type

list

ela.textproc.find_regex_df(df, expression, colname)

Sample a random subset of rows where the lithology column matches a particular class name.

Parameters

df (pandas data frame) – bore lithology data with columns named PRIMARY_LITHO_COL

Returns

Return type

dataframe

ela.textproc.find_secondary_lithology(tokens_and_primary, lithologies_adjective_dict, lithologies_dict)

Find a secondary lithology in a tokenised sentence.

Parameters
  • tokens_and_primary (tuple ([str],str) – tokens and the primary lithology

  • lithologies_adjective_dict (dict) – dictionary, where keys are exact, “clear” markers for secondary lithologies (e.g. ‘clayey’). Keys are the lithology classes.

  • lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Keys are the lithology classes.

Returns

secondary lithology if dectected. empty string for none.

Return type

str

ela.textproc.find_word_from_root(tokens, root)

Filter token (words) to retain only those containing a root term

Parameters
  • tokens (iterable of str) – the list of tokens.

  • root (str) – regular expression for the root term, to look for (e.g ‘clay’ or ‘cl(a|e)y’), which will be padded with ‘[a-z]*’ for searching

Returns

terms matching the root term.

Return type

a list

ela.textproc.flat_list_tokens(descriptions)

Convert a collection of strings to a flat list of tokens. English NLTK stopwords.

Parameters

descriptions (iterable of str) – lithology descriptions.

Returns

List of tokens.

Return type

list

ela.textproc.match_and_sample_df(df, litho_class_name, colname='Lithology_1', out_colname=None, size=50, seed=None)

Sample a random subset of rows where the lithology column matches a particular class name.

Parameters

df (pandas data frame) – bore lithology data with columns named PRIMARY_LITHO_COL

Returns

a list of strings, compound primary+optional_secondary lithology descriptions e.g. ‘sand/clay’, ‘loam/’

ela.textproc.plot_freq(dataframe, y_log=False, x='token', figsize=(15, 10), fontsize=14)

Plot a sorted histogram of work frequencies

Parameters
  • dataframe (pandas dataframe) – frequency of tokens, typically with colnames [“token”,”frequency”]

  • y_log (bool) – should there be a log scale on the y axis

  • x (str) – name of the columns with the tokens (i.e. words)

  • figsize (tuple) –

  • fontsize (int) –

Returns

plot

Return type

barplot

ela.textproc.plot_freq_for_root(tokens, root, y_log=True)

Plot a sorted histogram of work frequencies

Parameters
  • tokens (iterable of str) – the list of tokens.

  • root (str) – regular expression for the root term, to look for (e.g ‘clay’ or ‘cl(a|e)y’), which will be padded with ‘[a-z]*’ for searching

  • y_log (bool) – should there be a log scale on the y axis

Returns

plot

Return type

barplot

ela.textproc.remove_punctuations(text)

Remove the punctuations (string.punctuation) in a string.

ela.textproc.replace_punctuations(text, replacement=' ')

Replace the punctuations (string.punctuation) in a string.

ela.textproc.split_composite_term(x, joint_re='with')

Break terms that are composites padding several words without space. This has been observed in one case study but may not be prevalent.

Parameters
  • x (str) – the term to split if matching, e.g. ‘claywithsand’ to ‘clay with sand’

  • joint_re (str) – regular expression for the word used as fusing join, typically ‘with’

Returns

tokens split from the joining term.

Return type

split wording (str)

ela.textproc.split_with_term(x)

split words that are joined by a with, i.e. ‘sandwithclay’ :param x: the term to split if matching, e.g. ‘claywithsand’ to ‘clay with sand’ :type x: str

Returns

tokens split from the joining term.

Return type

split wording (str)

ela.textproc.token_freq(tokens, n_most_common=50)

Gets the most frequent (counts) tokens

Parameters
  • tokens (iterable of str) – the list of tokens to analyse for frequence.

  • n_most_common (int) – subset to the this number of most frequend tokens

Returns

columns=[“token”,”frequency”]

Return type

pandas DataFrame

ela.textproc.v_find_litho_markers(v_tokens, regex)

Find lithology lithology terms that match a regular expression

Parameters
  • v_tokens (iterable of iterable of str) – the list of tokenised sentences.

  • regex (regex) – compiles regular expression e.g. re.compile(‘sand|clay’)

Returns

tokens found to be matching the expression

Return type

(iterable of iterable of str)

ela.textproc.v_find_primary_lithology(v_tokens, lithologies_dict)

Vectorised function to find a primary lithology in a list of tokenised sentences.

Parameters
  • v_tokens (iterable of iterable of str) – the list of tokenised sentences.

  • lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Values are the lithology classes.

Returns

list of primary lithologies if dectected. empty string for none.

Return type

list

ela.textproc.v_find_secondary_lithology(v_tokens, prim_litho, lithologies_adjective_dict, lithologies_dict)

Vectorised function to find a secondary lithology in a list of tokenised sentences.

Parameters
  • v_tokens (iterable of iterable of str) – the list of tokenised sentences.

  • prim_litho (list of str) – the list of primary lithologies already detected for v_tokens. The secondary lithology cannot be the same as the primary.

  • lithologies_adjective_dict (dict) – dictionary, where keys are exact, “clear” markers for secondary lithologies (e.g. ‘clayey’). Keys are the lithology classes.

  • lithologies_dict (dict) – dictionary, where keys are exact markers as match for lithologies. Values are the lithology classes.

Returns

list of secondary lithologies if dectected. empty string for none.

Return type

list

ela.textproc.v_lower = <numpy.vectorize object>

vectorised, unicode version to lower case strings

ela.textproc.v_remove_punctuations(textlist)

vectorised function to remove punctuations :param textlist: list of terms :type textlist: iterable of str

Returns

Return type

(list)

ela.textproc.v_replace_punctuations(textlist, replacement=' ')

vectorised function to replace punctuations :param textlist: list of terms :type textlist: iterable of str

Returns

Return type

(list)

ela.textproc.v_split_with_term(xlist)

split words that are joined by a with, i.e. ‘sandwithclay’ :param xlist: the terms to split if matching, e.g. ‘claywithsand’ to ‘clay with sand’ :type xlist: iterable of str

Returns

tokens split from the joining term.

Return type

split tokens (list of str)

ela.textproc.v_word_tokenize(descriptions)

Vectorised tokenisation of lithology descriptions.

Parameters

descriptions (iterable of str) – lithology descriptions.

Returns

list of lists of tokens in the NLTK.

Return type

list